What is the robots.txt file?
The robots.txt file is a document that tells search engine indexing spiders which parts of a website can be indexed and provides a link to the XML-sitemap.
Put more simply, a robots.txt file tells search engine crawlers which URLs of a site they can access, but in no case it is a mechanism that prevents the page from being indexed, as many crawlers ignore the robots.txt instructions.
When crawlers or spiders ignore the indexing denial placed in the robots.txt (which is none other than noindex), it will be necessary to protect the web page password to prevent them from indexing the web page in question.
1. What is the robots.txt file used for?
In general, the robots.txt file is used to give concrete and specific orders to the crawlers of the different search engines. But let’s show some more specific features.
- It controls access to graphic resources. It allows you to indicate if you want to prevent image files from your website from showing in search results. This is of vital importance, to try to take control over the infographics and images with technical information of products that we have. In such a way that those interested in such information, in the first instance, can only access it by visiting our website.
- It restricts access to certain web pages. The web sites are formed by multiple pages, but we may not want to include all of them during the indexation process. There are many reasons for this, generally it is so that the visits of the crawlers don’t negatively influence the performance of our web server and to update our SEO strategy, removing or adding pages that could affect our rankings.
- Block access to files and directories. It is very useful to prevent crawlers from poking their noses into directories and files that are only resources for the proper functioning of the website, contain information that is exclusive to certain users or simply duplicate content.
2. How to create a robots.txt file?
However, any default installation of a CMS, such as , WordPress It creates by default a robots.txt file, they can be easily created by using a plain text editor to create a plain text file in ASCII or UTF-8 format, to place inside it the desired indexing instructions.
Just below we will show some of the most common commands or instructions used in robots.txt.
3. Command robots.txt
The most commonly used commands in the robots.txt document are:
User-agent: it is used to indicate to each robots or search engine spiders, what you want from them. It is important to emphasize that the instructions for each one of the crawlers are made together, that is to say, a single instance is used for Googlebot (Google search engine) to indicate to it what it is allowed or not allowed to do.
Its basic syntax is:
- User-agent: [robot concreto al que indicare las reglas]
- Disallow: it tells the robot that it should not access, nor index a particular URL, subdirectory or directory.
- Disallow: [directorio que se desea bloquear]
- Allow: just the opposite of Disallow:, in this case you indicate to the user-agent a URL, subdirectory or directory that it should access and index.
4. Examples for robots.txt
Here you have some examples of the bots or agents of the main search engines:
- Googlebot (Google search engine)
- Googlebot-Image (Google-image search)
- Adsbot-Google (Google AdWords)
- Slurp (Yahoo)
- bingbot (Bing)
With this we are going to see some examples of lines that we can place in robots.txt and the usefulness of each one:
- All agents are forbidden to access those directories/files:
- Block all website images to Googlebot-Image :
User Agent: Googlebot-Image
- Block all PDF files from Googlebot: