TheTAZZone - Internet Chaos

Tutorial – Google Security

ORIGINALLY POSTED BY NOKIA FOR THETAZZONE/TAZFORUM HERE

Do not use, republish, in whole or in part, without the consent of the Author. TheTAZZone policy is that Authors retain the rights to the work they submit and/or post…we do not sell, publish, transmit, or have the right to give permission for such…TheTAZZone merely retains the right to use, retain, and publish submitted work within it’s Network

Code: Select all
Soda_Popinsky has very kindly allowed this tutorial of his to be hosted on the TAZ.

Google Security

Some background first…
Google as a Hacking Tool by 3rr0r:
http://www.antionline.com/showthrea…threadid=257512
Google Hacking Honeypots:
http://www.antionline.com/showthrea…threadid=260050
Google hacking and Credit Card Security:
http://www.antionline.com/showthrea…threadid=260580
Google: Net Hacker Tool:
http://www.antionline.com/showthrea…threadid=240791
Google Aids Hackers:
http://www.antionline.com/showthrea…threadid=240734
Google is watching you:
http://www.antionline.com/showthrea…threadid=260700

Seems that Google is becoming a problem for some webmasters. I decided to check out what Google knew about the site I took over, so I decided to write this tut while I worked as a reference.

Control the Spiders

Nearly all crawlers work with something called the Robots Exclusion Standard, which allows webmasters to determine which parts of their website are indexed.

To do this, we stick a text file called robots.txt at the top level of our document root folder. Here is an example file:

Code: Select all
User-agent:  *
Disallow:

This code sucks. It allows all crawlers to index whatever they want. Lets write code to deny all crawlers.

Code: Select all
User-agent:  *
Disallow: /

Notice the slash, it tells all crawlers to ignore everything past the document root folder.

Code: Select all
User-agent:  *
Disallow: /admin
Disallow: /cgi-bin

This code will tell the crawler to ignore documents past the admin and cgi-bin folders in the document root folder. Now lets define which crawlers we like and dont like. These are called records, and hard returns matter for it to work. 1 return between records.

Code: Select all
#Denys access to Google's spiders
User-agent: Google
Disallow: /

User-agent: *
Disallow:

You can also deny a single file

Code: Select all
User-agent:  *
Disallow: /admin/index.html

Note that wildcards only work in the “User-agent” line.

Meta Tag Crawler Denial

You may not have permission to put a robots.txt file in the document root of your webserver. This method is available, though crawlers do not support this method as well. This is simple, place one of these meta tags in your pages:

Permission to index, and follow links:
<meta name=”robots” content=”index,follow”>

Do not index, permission to follow links
<meta name=”robots” content=”noindex,follow”>

Permission to index, do not follow links
<meta name=”robots” content=”index,nofollow”>

Do not index, do not follow links.
<meta name=”robots” content=”noindex,nofollow”>

This method is a lot more work, and is not well supported, but requires no permission to setup.

Dumping info in Google

This is an easy trick, though not practical for large sites. Enter this into the google search engine:

site:www.YOURSITEHERE.com

You’ll see that it dumps all it knows about your site. If you aren’t too popular, you can skim through it to see what it knows.

Foundstone’s SiteDigger

In order to use this great tool, you need to register for a Google license key. Get it done here:
https://www.google.com/accounts/NewAccount

SiteDigger can be found here-
http://www.foundstone.com/resources…/sitedigger.htm

Install SiteDigger, and enter your license key in the bottom right corner. After that, update your signatures by clicking options, update signatures. Enter your domain where it says, “please enter your domain”, and click search.

What SiteDigger does is run automated searches on your domain with signatures, looking for common indexing mistakes left behind by webmasters. Hackers use this, so should you. Anything it finds should be handled accordingly.

In short, learn to protect your public files. Learn to use .htaccess files for apache webservers here-
http://www.antionline.com/showthrea…threadid=231380

All done.
Comments and criticisms encouraged.

SOURCES:
http://www.robotstxt.org/
http://www.antionline.com/

__________________

Leave a Reply

Your email address will not be published. Required fields are marked *

Advertise

If you'd like to advertise on The Mutt ( aka TheTAZZone.com ) feel free to contact us at: administration[at]thetazzone.com

TheTAZZone is a non-commercial entity. We do not sell any products or services ourselves. Our revenue comes from advertising and donations only.

We appreciate your support! Your advertising revenue ( or donations ) helps us to continue to upgrade, improve, and offset the costs of maintaining this site.

Donations can be made through the page ' Donate '.