Optimize robots.txt for Better SEO

by Yan

If you aren’t in the know, search engine is not as smart as what you think it is. It can’t tell which pages of your site should be included in the index and which files to ignore. Therefore we need to instruct it on what to do using a robots.txt file.

What is robots.txt file?

A robots.txt file is a set of instructions that tell search engine robots which pages of your site to be crawled and indexed. In most cases, your site is consist of many files or folders i.e. admin folders, cgi-bin, image folder, which are not relevant to the search engines.

Basically the purpose of creating a robots.txt file is to improve site indexation by telling search engine crawler to only index your content pages and to ignore other pages (i.e. monthly archives, categories folders or your admin files) that you do not want them to appear on the search index because it may lead to duplicate content issue.

How to create robots.txt file?

A robots.txt is just a simple text file that can be easily created using any text editor such as Notepad and save it exactly as robots.txt.

Here is the basic syntax of a robots.txt file


User-agent: *
Disallow: [files or folders that are excluded]

User-agent: * The asterisk (*) or wildcard means that all the instructions within the robot.txt file are applicable to all search engine spiders (Google, Yahoo, MSN and others) whereas

Disallow: when leave blank basically tell those spiders to crawl and index the entire site which is of course not the most prudent thing to do with respect to SEO.

Basic example of robots.txt file

If you include an instruction that looks like this:


User-agent: *
Disallow: /wp-admin/

It basically tells all robots that they cannot index the contents of that /wp-admin/ directory. I hope it does make sense so far, doesn’t it?

There seems to be a number of different view points on what should and shouldn’t be included in the robots.txt file. If you’d like to see what the big guys are doing, you can have a look at this collection of robots.txt files from Daily Blog Tips.

If you are interested to know mine, here is the robots.txt file that I use (feel free to copy and tinker with it)


User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-*
Disallow: /feed/
Disallow: /trackback/
Disallow: /tag/
Disallow: /cgi-bin/
Disallow: /2008/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

After you have done creating the robots.txt file, double check to ensure that you have not made any error on it which will otherwise cause the spider not to index any files meant to be indexed.

Optimizing your robots.txt file will help prevent Google penalizing you for duplicate content and also improve your search engine rankings.

Have you optimized yours today?

Update:
It is important to note that your robots.txt file should always reside in the root directory of your site; even though in some cases, your site may be in another sub-directory. Validate it by entering the full url www.yoursite.com/robots.txt in your web browser.

Related posts

{ 5 trackbacks }

BlackHat Planet » Blog Archive » Optimize robots.txt for Better SEO - internet marketing, seo, internet marketing online, seo services, internet marketing tool, dw230 seo, internet marketing strategy, company seo, internet marketing services,
July 16, 2008 at 11:34 am
15 SEO Tips for Beginners
August 27, 2008 at 11:57 pm
SevenToTen - Blogging About Everything » 15 SEO Tools for Beginners
October 13, 2008 at 8:18 pm
Lesson 18 - Search Engine Optimizations | code95.com Blog
January 2, 2009 at 9:29 am
Blogging Part 4 | home work
July 21, 2009 at 11:23 pm

{ 22 comments… read them below or add one }

Ludo July 17, 2008 at 1:49 am

Thanks for the tip! And hop, rss feed added :)

Ludo’s last blog post..Les P’tits K-do de l’été #1

Reply

JK Swopes July 18, 2008 at 2:43 am

Great job, you know….I knew that file was important, but the full knowledge of how it worked had escaped me. Thank you lol :)

JK Swopes’s last blog post..How to Grow your Business

Reply

Yan July 18, 2008 at 3:55 am

@Ludo: You are welcome!

@Joe: Hey, what a pleasant surprise to see you here and thanks for your lovely comment.

Reply

Chris July 18, 2008 at 4:36 am

Yeah more web developers need to be aware of the robot.txt and is advantages and disadvantages. Great post!

Reply

Technology For Non Techies July 18, 2008 at 6:45 am

As usual, a great overview Yan. I will keep this post handy. Thanks.

Reply

Nihar July 20, 2008 at 5:05 am

Nice tip.

I have installed sitemaps plugin. it does all the things. If i add the mentioned above in robot.txt will sitemap plugin overwrite the things when it updates the file?

Any idea?

Nihar’s last blog post..Increase comments on your blog using SezWho

Reply

Yan July 20, 2008 at 10:07 am

@Nihar: Certainly not as I have the same plugin installed on this blog. In fact I did add the following at the end of the robots.txt file as a nice companion to the plugin


# BEGIN XML-SITEMAP-PLUGIN
Sitemap: http://thoushallblog.com/sitemap.xml.gz
# END XML-SITEMAP-PLUGIN

Reply

Earl July 21, 2008 at 9:13 am

hey im really diggin your site. lots of good info and im glad i found it.

Reply

Yan July 21, 2008 at 9:50 am

Hi Earl

Welcome to my little playground. This is a blog intended to help beginners to blog. I’m glad you find it useful. Let me know if there is anything I could help.

Cheers
Yan

Reply

Alphane Moon July 23, 2008 at 5:11 pm

Hi Yan,

that was nice to read.
It is interesting that directives in robots.txt are not cumulative for different user agents and the * wild card user agent.

If you have something like this:

User-agent: Googlebot
Disallow: /something/

User-agent: *
Disallow: /something/
Disallow: /another/

In this case only the directory “something” will be forbidden for the Googlebot, but not “another”.

Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …

Reply

Yan July 23, 2008 at 7:51 pm

@AM: Hey, thanks for that little tutorial. I’m actually a bit hesitating about the use of


User-agent: Googlebot-Image
Allow: /wp-content/uploads/

What do you say? Is this the correct syntax?

Reply

Alphane Moon July 23, 2008 at 8:34 pm

As far as I have seen, Googles crawlers respect only the directives that are most specific for them. Googlebot-Image will discover the Allow directive, but it will not read or respect the Disallow directives for User-agent: * , because they are less specific.

That means that crawling of all directories, that are currently forbidden for all other crawlers, is now allowed for Googlebot-Image. This is maybe not what you want. You should repeat all Disallow directives in the specific section for Googlebot-Image, if you want to block these directories as well.

Write a complete Disallow section for each crawler and everything should be fine :)

I would omit the use of wildcards * in the User-agent: * section, if possible. In the past not all search engines supported this syntax. Google supports wildcards now.

Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …

Reply

Yan July 23, 2008 at 10:26 pm

@AM: I really appreciate your taking the time to explain to me on the proper use of robots.txt. If I understand you correctly, you are actually telling me to rewrite it this way


User-agent: Googlebot-Image
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes

Do I miss anything here?

Reply

Alphane Moon July 23, 2008 at 11:43 pm

Yes, this is the right way.

Now /wp-content/uploads/
is not blocked, so Googlebot-Image can access it.

Do you know the blog “Sebastian’s Pamphlets”? Sebastian has written many good articles about the robots.txt and different ways to control search engine crawlers.

Alphane Moon’s last blog post..Und jetzt etwas ganz anderes …

Reply

Yan July 24, 2008 at 8:57 am

@AM: Thank you. Your input certainly has helped me to understand the working of robots.txt. I’ll update the above article and my robots.txt accordingly.

Sebastian’s Pamphlets? Not that I know of but I’ll be checking on it in a while and thanks for the intro.

I’m so sorry, do I get your name?

Reply

Alphane Moon July 25, 2008 at 4:52 am

Hi Yan,

just visit the about page on my blog, the page is called “editorial” (second link from the top in the navi). I have written a little paragraph in english :)

Reply

Yan July 25, 2008 at 11:12 am

@York – Thank you and now it’s more personal. Since your blog is in German I was having a hard time the other day trying to figure out your name.

I hope I could learn more from you as SEO is always the subject of my interest. If there is an article on SEO that you think may interest the readers here, I’ll be glad if you would write for this blog as a guest blogger sharing your knowledge for the benefit of others.

Have a good day, York..

Reply

Alphane Moon July 31, 2008 at 10:54 pm

Hi Yan,

thank for this huge compliment! Writing articles in English is very hard for me, because it is not my first language. And I’m far from being an SEO expert. But I will share my experiences in the comments, if there is something that I can contribute.

Alphane Moon’s last blog post..Dark Glow Visuals

Reply

dewaji August 2, 2008 at 8:50 pm

I usually avoid google to index my feed too. I hope this will reduce duplicate content on my blog.

dewaji’s last blog post..Busby SEO Challenge

Reply

ThemeLib.com August 3, 2008 at 2:04 am

Nice tips.

Btw, I allow Google index tags :)

ThemeLib.com’s last blog post..Colourise dark colorful blogger template

Reply

josh August 7, 2008 at 11:02 am

wow nice. im gonna add one to my server

Reply

Joel Drapper October 31, 2008 at 10:34 am

Thanks Yan. I have just been going through all your old posts. Never thought of this before.

Joel Drapper´s lastest post..Google’s Chat Session Presentation and Q&A

Reply

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Previous post: WordPress Template Tag

Next post: 5 Possible Ways to Speed Up Your Blog