I was recently looking at Google’s Webmaster Tools, and more specifically the robots.txt section. As I reviewed the robots.txt for my blog, I thought about whether I can make it better. For those who aren’t familiar with the file, it provides information to web robots about what files/directories they shouldn’t access. Search engines use web robots to add pages to their search results.
As I was reviewing the robots.txt file, I was thinking about a more efficient method of preventing duplicate content from appearing in the search results without having to modify the robots.txt file. After performing a search I found a good way of handling it within WordPress.
Preventing Duplicate Content
One of the concerns of many who manage web sites/blogs is preventing duplicate content from appearing in the search engine results. For WordPress blogs, the content of a post can appear in the actual post, a category listing, or an archive listing on the blog. For those concerned with duplication issues, they would only like the actual post content to be listed, and not other pages that may contain that post.
As mentioned earlier, editing the robots.txt file is one method of preventing the duplicate content. You can simply disallow the web robots from indexing the category or archive pages. If you add a new date to the archive, you will need to remember to edit the robots.txt file. A better solution is to have WordPress do this for you.
Besides the robots.txt file, search engine web robots also use specific meta tags, that are located in the head section of a webpage, to determine which pages they can index. By editing the header.php file of your WordPress template, you can allow only specific pages to be indexed by search engines.
Here is the code:
<meta name="googlebot" content="index,follow" />
<meta name="robots" content="index,follow" />
<meta name="msnbot" content="index,follow" />
<?php } else { ?>
<meta name="googlebot" content="noindex,follow" />
<meta name="robots" content="noindex,follow" />
<meta name="msnbot" content="noindex,follow" />
<?php }?>
As you can see, an if statement is used to determine the page type. In the above case, the if statement checks to see if the current page is either of the following:
- A single post page.
- A static web page.
- The home page of the blog.
If current page matches any of the above page types, then the HTML meta tags shown in green are written to the head section of the web page. The meta tags in green indicate that the page should be indexed and all the links should be followed. You can see this in the content attribute. The name attribute indicates the web robot. The name "robots" is a generic catch-all value.
For pages that don’t meet the above criteria, such as the category and archives pages, the HTML code indicated in red is applied to the web page. In this case, the web robots are told to not index the page, but still follow the links.
After updating the header.php file with the above code, I verified the meta tag values on various pages. Static pages, post pages, and the home page all indicated that they should be indexed. All other pages told the robots to not index the page.
Making a simple change like this allows me to control which pages are indexed by search engines, while at the same time prevents duplicate content from my blog from appearing in the search results.
For Blogger users, I’ll write a post in the future that shows how to accomplish the same task with Blogger.