Today is the 20th birthday of the robots.txt directive being available for webmasters to block search engines from crawling their pages. It was created by Martijn Koster in 1994.
What is robots.txt?
Robots.txt is a text but not html file we put on the site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that we put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sensitive data, it is too naaive to rely on robots.txt to protect it from being indexed and displayed in search results.
The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way.
Structure of a Robots.txt File
The structure of a robots.txt is pretty simple – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:
It does not disallow any file from crawling
It disallow the whole website from crawling.
“User-agent” are search engines’ crawlers and disallow: lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:
# All user agents are disallowed to see the /temp directory.
For the last 20 years this simple txt file is the guiding path which directs a search engine to crawl or not to crawl a file / website.