Bots are in fact applications which are able to run automated tasks over the internet. There are many types of them categorized based on the way we use them, for instance google bot is used to effectively fetch and analyze information from web servers, this is known as web spidering. Other bots are designed to replace humans in different online environments such as gaming, chat (chatterbot) etc


This post is about www robots (also called: web spiders/crawlers/worms/indexers) such as google bot, so what are they?

These applications browse the World Wide Web based on a primary list of URLs, then they use links from on the web pages they visit to reach other pages so the process can continue. This is how search engines find their way to your website.

To stop search bots from indexing some folders and files on your domain, a robots.txt file is used.

robots.txt contain rules to follow by this kind of bots, they are referred to as the Robot Exclusion Standard. The structure of the file is very simple as follow:

User-agent: *
Disallow: /private/
Disallow: /code/

User-agent: Googlebot-Image
Disallow: /images/haider.jpg

First line is used to determine the specific bot you want your rules to apply to. The disallow rule, specify which file or folder to you want to keep away from search bots. In my example I am excluding two folders from all bots and an image from Googlebot-image only.

The robots.txt file should be uploaded to the root folder, but note that this does not cover subfolders, so every subfolder should have its own copy.

However, be warned that bots are also used by malicious people to spread viruses, run DDoS attacks or to collect emails and other kind of information. These bots does not respect robots.txt rules, in other words, robots.txt does not guarantee privacy. In addition, the file is accessible by any web browser, so if you specify your important folders manually, it might be used by some people to locate your important folders and files. This will be then a security breach and to overcome it imply the disallow rule on folders rather than individual files and also try to name them smartly.

Another possible way to control bots is by using special metadata tags as part of your HTML files, the following tag for example, implies that the bot should not index the page.

<META NAME="ROBOTS" CONTENT="NOINDEX">

Another tag is:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

And by using it, you stop bots from parsing the links on your page.

Leave a Reply

*

Haider’s WebSpace

Welcome to my technical blog. This is where I write, archive and share computer related articles. Subjects vary from posting technical solutions to researching particular topics. Feel free to comment and talk IT!

The information provided is for educational purposes only. All content including links and comments is provided "as is" with no warranty, expressed or implied. Use is at your own risk and you are solely responsible for what you do with it.

Sponsored Links
My Tweets