ORIGINALLY POSTED BY NOKIA FOR THETAZZONE/TAZFORUM HERE
Do not use, republish, in whole or in part, without the consent of the Author. TheTAZZone policy is that Authors retain the rights to the work they submit and/or post…we do not sell, publish, transmit, or have the right to give permission for such…TheTAZZone merely retains the right to use, retain, and publish submitted work within it’s Network
- Code: Select all
Soda_Popinsky has very kindly allowed this tutorial of his to be hosted on the TAZ.
Google Security
Some background first…
Google as a Hacking Tool by 3rr0r:
http://www.antionline.com/showthrea…threadid=257512
Google Hacking Honeypots:
http://www.antionline.com/showthrea…threadid=260050
Google hacking and Credit Card Security:
http://www.antionline.com/showthrea…threadid=260580
Google: Net Hacker Tool:
http://www.antionline.com/showthrea…threadid=240791
Google Aids Hackers:
http://www.antionline.com/showthrea…threadid=240734
Google is watching you:
http://www.antionline.com/showthrea…threadid=260700
Seems that Google is becoming a problem for some webmasters. I decided to check out what Google knew about the site I took over, so I decided to write this tut while I worked as a reference.
Control the Spiders
Nearly all crawlers work with something called the Robots Exclusion Standard, which allows webmasters to determine which parts of their website are indexed.
To do this, we stick a text file called robots.txt at the top level of our document root folder. Here is an example file:
- Code: Select all
User-agent: *
Disallow:
This code sucks. It allows all crawlers to index whatever they want. Lets write code to deny all crawlers.
- Code: Select all
User-agent: *
Disallow: /
Notice the slash, it tells all crawlers to ignore everything past the document root folder.
- Code: Select all
User-agent: *
Disallow: /admin
Disallow: /cgi-bin
This code will tell the crawler to ignore documents past the admin and cgi-bin folders in the document root folder. Now lets define which crawlers we like and dont like. These are called records, and hard returns matter for it to work. 1 return between records.
- Code: Select all
#Denys access to Google's spiders
User-agent: Google
Disallow: /User-agent: *
Disallow:
You can also deny a single file
- Code: Select all
User-agent: *
Disallow: /admin/index.html
Note that wildcards only work in the “User-agent” line.
Meta Tag Crawler Denial
You may not have permission to put a robots.txt file in the document root of your webserver. This method is available, though crawlers do not support this method as well. This is simple, place one of these meta tags in your pages:
Permission to index, and follow links:
<meta name=”robots” content=”index,follow”>
Do not index, permission to follow links
<meta name=”robots” content=”noindex,follow”>
Permission to index, do not follow links
<meta name=”robots” content=”index,nofollow”>
Do not index, do not follow links.
<meta name=”robots” content=”noindex,nofollow”>
This method is a lot more work, and is not well supported, but requires no permission to setup.
Dumping info in Google
This is an easy trick, though not practical for large sites. Enter this into the google search engine:
site:www.YOURSITEHERE.com
You’ll see that it dumps all it knows about your site. If you aren’t too popular, you can skim through it to see what it knows.
Foundstone’s SiteDigger
In order to use this great tool, you need to register for a Google license key. Get it done here:
https://www.google.com/accounts/NewAccount
SiteDigger can be found here-
http://www.foundstone.com/resources…/sitedigger.htm
Install SiteDigger, and enter your license key in the bottom right corner. After that, update your signatures by clicking options, update signatures. Enter your domain where it says, “please enter your domain”, and click search.
What SiteDigger does is run automated searches on your domain with signatures, looking for common indexing mistakes left behind by webmasters. Hackers use this, so should you. Anything it finds should be handled accordingly.
In short, learn to protect your public files. Learn to use .htaccess files for apache webservers here-
http://www.antionline.com/showthrea…threadid=231380
All done.
Comments and criticisms encouraged.
SOURCES:
http://www.robotstxt.org/
http://www.antionline.com/
__________________