Optimize robots.txt for Better SEO
If you're new here, you may want to subscribe to my RSS feed or via Email. Thanks for visiting!
If you aren’t in the know, search engine is not as smart as what you think it is. It can’t tell which pages of your site should be included in the index and which files to ignore. Therefore we need to instruct it on what to do using a robots.txt file.
What is robots.txt file?
A robots.txt file is a set of instructions that tell search engine robots which pages of your site to be crawled and indexed. In most cases, your site is consist of many files or folders i.e. admin folders, cgi-bin, image folder, which are not relevant to the search engines.
Basically the purpose of creating a robots.txt file is to improve site indexation by telling search engine crawler to only index your content pages and to ignore other pages (i.e. monthly archives, categories folders or your admin files) that you do not want them to appear on the search index because it may lead to duplicate content issue.
How to create robots.txt file?
A robots.txt is just a simple text file that can be easily created using any text editor such as Notepad and save it exactly as robots.txt.
Here is the basic syntax of a robots.txt file
User-agent: *
Disallow: [files or folders that are excluded]
User-agent: * The asterisk (*) or wildcard means that all the instructions within the robot.txt file are applicable to all search engine spiders (Google, Yahoo, MSN and others) whereas
Disallow: when leave blank basically tell those spiders to crawl and index the entire site which is of course not the most prudent thing to do with respect to SEO.
Basic example of robots.txt file
If you include an instruction that looks like this:
User-agent: *
Disallow: /wp-admin/
It basically tells all robots that they cannot index the contents of that /wp-admin/ directory. I hope it does make sense so far, doesn’t it?
There seems to be a number of different view points on what should and shouldn’t be included in the robots.txt file. If you’d like to see what the big guys are doing, you can have a look at this collection of robots.txt files from Daily Blog Tips.
If you are interested to know mine, here is the robots.txt file that I use (feel free to copy and tinker with it)
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-*
Disallow: /feed/
Disallow: /trackback/
Disallow: /tag/
Disallow: /cgi-bin/
Disallow: /2008/
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
After you have done creating the robots.txt file, double check to ensure that you have not made any error on it which will otherwise cause the spider not to index any files meant to be indexed.
Optimizing your robots.txt file will help prevent Google penalizing you for duplicate content and also improve your search engine rankings.
Have you optimized yours today?
Update:
It is important to note that your robots.txt file should always reside in the root directory of your site; even though in some cases, your site may be in another sub-directory. Validate it by entering the full url www.yoursite.com/robots.txt in your web browser.
Did you miss any of these?
- Has Your Blog Been Indexed by Google Yet?
- 15 Most Under-Utilized Plugins for SEO
- The Way I Transfer My WordPress Blog To Another Host
- How To Use FTP - The Basic
- Google Hates Duplicate Content - The Fix
Readers who viewed this page, also viewed:
22 Comments
Got something to say?
Your input matters. If you have any thoughts on ' Optimize robots.txt for Better SEO', I'll be pleased to hear from you.







Thanks for the tip! And hop, rss feed added :)
Ludo’s last blog post..Les P’tits K-do de l’été #1
Great job, you know….I knew that file was important, but the full knowledge of how it worked had escaped me. Thank you lol :)
JK Swopes’s last blog post..How to Grow your Business
@Ludo: You are welcome!
@Joe: Hey, what a pleasant surprise to see you here and thanks for your lovely comment.
Yeah more web developers need to be aware of the robot.txt and is advantages and disadvantages. Great post!
As usual, a great overview Yan. I will keep this post handy. Thanks.
Nice tip.
I have installed sitemaps plugin. it does all the things. If i add the mentioned above in robot.txt will sitemap plugin overwrite the things when it updates the file?
Any idea?
Nihar’s last blog post..Increase comments on your blog using SezWho
@Nihar: Certainly not as I have the same plugin installed on this blog. In fact I did add the following at the end of the robots.txt file as a nice companion to the plugin
hey im really diggin your site. lots of good info and im glad i found it.
Hi Earl
Welcome to my little playground. This is a blog intended to help beginners to blog. I’m glad you find it useful. Let me know if there is anything I could help.
Cheers
Yan
Hi Yan,
that was nice to read.
It is interesting that directives in robots.txt are not cumulative for different user agents and the * wild card user agent.
If you have something like this:
User-agent: Googlebot
Disallow: /something/
User-agent: *
Disallow: /something/
Disallow: /another/
In this case only the directory “something” will be forbidden for the Googlebot, but not “another”.
Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …
@AM: Hey, thanks for that little tutorial. I’m actually a bit hesitating about the use of
What do you say? Is this the correct syntax?
As far as I have seen, Googles crawlers respect only the directives that are most specific for them. Googlebot-Image will discover the Allow directive, but it will not read or respect the Disallow directives for User-agent: * , because they are less specific.
That means that crawling of all directories, that are currently forbidden for all other crawlers, is now allowed for Googlebot-Image. This is maybe not what you want. You should repeat all Disallow directives in the specific section for Googlebot-Image, if you want to block these directories as well.
Write a complete Disallow section for each crawler and everything should be fine :)
I would omit the use of wildcards * in the User-agent: * section, if possible. In the past not all search engines supported this syntax. Google supports wildcards now.
Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …
@AM: I really appreciate your taking the time to explain to me on the proper use of robots.txt. If I understand you correctly, you are actually telling me to rewrite it this way
Do I miss anything here?
Yes, this is the right way.
Now /wp-content/uploads/
is not blocked, so Googlebot-Image can access it.
Do you know the blog “Sebastian’s Pamphlets”? Sebastian has written many good articles about the robots.txt and different ways to control search engine crawlers.
Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …
@AM: Thank you. Your input certainly has helped me to understand the working of robots.txt. I’ll update the above article and my robots.txt accordingly.
Sebastian’s Pamphlets? Not that I know of but I’ll be checking on it in a while and thanks for the intro.
I’m so sorry, do I get your name?
Hi Yan,
just visit the about page on my blog, the page is called “editorial” (second link from the top in the navi). I have written a little paragraph in english :)
@York - Thank you and now it’s more personal. Since your blog is in German I was having a hard time the other day trying to figure out your name.
I hope I could learn more from you as SEO is always the subject of my interest. If there is an article on SEO that you think may interest the readers here, I’ll be glad if you would write for this blog as a guest blogger sharing your knowledge for the benefit of others.
Have a good day, York..
Hi Yan,
thank for this huge compliment! Writing articles in English is very hard for me, because it is not my first language. And I’m far from being an SEO expert. But I will share my experiences in the comments, if there is something that I can contribute.
Alphane Moon’s last blog post..Dark Glow Visuals
I usually avoid google to index my feed too. I hope this will reduce duplicate content on my blog.
dewaji’s last blog post..Busby SEO Challenge
Nice tips.
Btw, I allow Google index tags :)
ThemeLib.com’s last blog post..Colourise dark colorful blogger template
wow nice. im gonna add one to my server
Trackbacks
[...] stuart wrote an interesting post today onHere’s a quick excerpt… all search engine spiders (Google, Yahoo, MSN and others) whereas. Disallow: when leave blank basically tell those spiders to crawl and index the entire site which is of course not the most prudent thing to do with respect to SEO. … [...]