Chapter 4: Duplicate Content in a Post-Panda World

Editor's Note: In a post-Panda world, ignoring duplicate content is not an option. The potential damage is too great to risk. In this post, which was originally published on The Moz Blog on Nov. 16, 2011, “Dr. Pete” explains what duplicate content is, why it's such a big deal, and how inbound marketers can diagnose and fix duplicate content issues.

In early 2011, Google launched the first phase of the “Panda” update, which would prove to be a wake-up call for SEO issues webmasters had been ignoring for too long. One of those issues was duplicate content. While duplicate content as an SEO problem has been around for years, the way Google handles it has evolved dramatically and seems to only get more complicated with every update. Panda upped the ante even more.

This chapter is an attempt to cover the topic of duplicate content, as it stands today, in depth. This is designed to be a comprehensive resource—a complete discussion of what duplicate content is, how it happens, how to diagnose it, and how to fix it. Maybe we'll even round up a few rogue pandas along the way.

What Is Duplicate Content?

Let's start with the basics. Duplicate content exists when any two (or more) pages share the same content. If you're a visual learner, here's an illustration for you:

9781118551585-un0401.eps

Easy enough, right? So, why does such a simple concept cause so much difficulty? One problem is that people often make the mistake of thinking that a “page” is a file or document sitting on their web server. To a crawler (like Googlebot), a page is any unique URL it happens to find, usually through internal or external links. Especially on large, dynamic sites, creating two URLs that land on the same content is surprisingly easy (and often unintentional).

Why Do Duplicates Matter?

Duplicate content as an SEO issue was around long before the Panda update, and has taken many forms as the algorithm has changed. Here's a brief look at some major issues with duplicate content over the years.

The Supplemental Index

In the early days of Google, just indexing the web was a massive computational challenge. To deal with this challenge, some pages that were seen as duplicates or very low quality were stored in a secondary index called the supplemental index. These pages automatically became second-class citizens, from an SEO perspective, and lost any competitive ranking ability.

Around late 2006, Google integrated supplemental results back into the main index, but those results were still often filtered out. You know you've hit filtered results anytime you see this warning at the bottom of a Google SERP (search engine results page):

9781118551585-un0402.tif

Even though the index was unified, results were still omitted, with obvious consequences for SEO. Of course, in many cases, these pages really were duplicates or had very little search value, and the practical SEO impact was negligible, but not always.

The Crawl Budget

It's always tough to talk limits when it comes to Google, because people want to hear an absolute number. There is no absolute crawl budget or fixed number of pages that Google will crawl on a site. There is, however, a point at which Google may give up crawling your site for a while, especially if you keep sending spiders down winding paths.

Although the crawl budget isn't absolute, even for a given site, you can get a sense of Google's crawl allocation for your site in Google Webmaster Tools (under Diagnostics, select Crawl Stats).

9781118551585-un0403.tif

So, what happens when Google hits so many duplicate paths and pages that it gives up for the day? Practically speaking, the pages you want indexed may not get crawled. At best, they probably won't be crawled as often.

The Indexation Cap

Similarly, there's no set cap to how many pages of a site Google will index. There does seem to be a dynamic limit, though, and that limit is relative to the authority of the site. If you fill up your index with useless, duplicate pages, you may push out more important, deeper pages. For example, if you load up on thousands of internal search results, Google may not index all of your product pages. Many people make the mistake of thinking that more indexed pages is better. I've seen too many situations where the opposite was true. All else being equal, bloated indexes dilute your ranking ability.

The Penalty Debate

Long before Panda, a debate would erupt every few months over whether or not there was a duplicate content penalty. While these debates raised valid points, they often focused on semantics—whether or not duplicate content caused a Capital-P Penalty. While I think the conceptual difference between penalties and filters is important, the upshot for a site owner is often the same. If a page isn't ranking (or even indexed) because of duplicate content, then you've got a problem, no matter what you call it.

The Panda Update

Since Panda (starting in February 2011), the impact of duplicate content has become much more severe in some cases. It used to be that duplicate content could only harm that content itself. If you had a duplicate, it might go supplemental or get filtered out. Usually, that was okay. In extreme cases, a large number of duplicates could bloat your index or cause crawl problems and start impacting other pages.

Panda made duplicate content part of a broader quality equation—now, a duplicate content problem can impact your entire site. If you're hit by Panda, non-duplicate pages may lose ranking power, stop ranking altogether, or even fall out of the index. Duplicate content is no longer an isolated problem.

Three Kinds of Duplicates

Before we dive into examples of duplicate content and the tools for dealing with them, I'd like to cover three broad categories of duplicates: true duplicates, near duplicates, and cross-domain duplicates. I'll be referencing these three main types in the examples later in the chapter.

True Duplicates

A true duplicate is any page that is 100 percent identical (in content) to another page. These pages only differ by the URL.

9781118551585-un0404.eps

Near Duplicates

A near duplicate differs from another page (or pages) by a very small amount—it could be a block of text, an image, or even the order of the content.

9781118551585-un0405.eps

An exact definition of “near” is tough to pin down, but I'll discuss some examples in detail later.

Cross-domain Duplicates

Cross-domain duplicates occur when two or more websites share the same piece of content.

9781118551585-un0406.eps

These duplicates could be either true or near duplicates. Contrary to what some people believe, cross-domain duplicates can be a problem even for legitimate, syndicated content.

Tools for Fixing Duplicates

This may seem out of order, but I want to discuss the tools for dealing with duplicates before I dive into specific examples. That way, I can recommend the appropriate tools to fix each example without confusing anyone.

404 (Not Found)

Of course, the simplest way to deal with duplicate content is to just remove it and return a 404 error. If the content really has no value to visitors or search, and if it has no significant inbound links or traffic, then total removal is a perfectly valid option.

301-Redirect

Another way to remove a page is via a 301-redirect. Unlike a 404, the 301 tells visitors (humans and bots) that the page has permanently moved to another location. Human visitors seamlessly arrive at the new page. From an SEO perspective, most of the inbound link equity is also passed to the new page. If your duplicate content has a clear canonical URL, and the duplicate has traffic or inbound links, then a 301-redirect may be a good option.

Robots.txt

Another option is to leave the duplicate content available for human visitors, but block it for search crawlers. The oldest and probably still easiest way to do this is with a robots.txt file (generally located in your root directory). It looks something like this:

User-agent: *

Disallow: /dupe-page.htm

Disallow: /dupe-folder/

One advantage of robots.txt is that it's relatively easy to block entire folders or even URL parameters. The disadvantage is that it's an extreme and sometimes unreliable solution. While robots.txt is effective for blocking uncrawled content, it's not great for removing content already in the index. The major search engines also seem to frown on its overuse, and don't generally recommend robots.txt for duplicate content.

Meta Robots

You can also control the behavior of search bots at the page level, with a header-level directive known as the Meta Robots tag (or sometimes Meta Noindex). In its simplest form, the tag looks something like this:

<head>

<meta name=”ROBOTS” content=”NOINDEX, NOFOLLOW” />

</head>

This page-level directive tells search bots not to index this particular page or follow links on it. Anecdotally, I find it a bit more SEO-friendly than robots.txt, and because the tag can be created dynamically with code, it can often be more flexible.

The other common variant for Meta Robots is the content value “NOINDEX, FOLLOW”, which allows bots to crawl the paths on the page without adding the page to the search index. This can be useful for pages like internal search results, where you may want to block certain variations (I'll discuss this more later) but still follow the paths to product pages.

One quick note: There is no need to ever add a Meta Robots tag with “INDEX, FOLLOW” to a page. All pages are indexed and followed by default (unless blocked by other means).

News publishers who syndicate their content may ensure that only the original versions of their articles appear in Google News. They can do this by directing their syndication partners to use a Meta Robots tag that disallows Google News from indexing the syndicated version of the article. The tag looks like this:

<head>

<meta name=”GOOGLEBOT-NEWS” content=”NOINDEX” />

</head>

Using this tag alone will prevent syndicated content from appearing in Google News search results, but still allow the content to be indexed by other by other crawlers. If a publisher would like to restrict syndicated content from appearing in any type of search results, they should have their syndication partners use the universal Meta Noindex tag instead.

News publishers can also point Google News towards the original version of the articles by using the Rel=Canonical directive, just as they would to specify a canonical version of any other type of page. For details on how to the Rel=Canonical directive, see the following section.

Editor's Note: In November of 2010, Google introduced a set of “experimental” tags for publishers of syndicated content to indicate the original source of a republished article. The Meta Syndication-Source directive, shown following, was deprecated in 2012.

<head>

<meta name=”syndication-source” content=”http://example.com/news” />

</head>

Rel=Canonical

In 2009, the search engines banded together to create the Rel=Canonical directive, sometimes called just Rel-canonical or the Canonical Tag. This allows webmasters to specify a canonical version for any page. The tag goes in the page header (like Meta Robots), and a simple example looks like this:

<head>

<link rel=”canonical” href=”http://www.example.com” />

</head>

When search engines arrive on a page with a canonical tag, they attribute the page to the canonical URL, regardless of the URL they used to reach the page. So, for example, if a bot reached the above page using the URL www.example.com/index.html, the search engine would not index the additional, non-canonical URL. Typically, it seems that inbound link equity is also passed through the canonical tag.

It's important to note that you need to clearly understand what the proper canonical page is for any given website template. Canonicalizing your entire site to just one page, or the wrong pages, can be very dangerous.

Google URL Removal

In Google Webmaster Tools (GWT), you can request that an individual page (or directory) be manually removed from the index. Click Optimization in the lefthand nav, and then choose Remove URLs from the dropdown list and you'll see a button called “Create a new removal request” in the main content area. Click that button and you'll see something like this:

9781118551585-un0407.tif

Since this tool only removes one URL or path at a time, completely at Google's discretion, it's usually a last-ditch approach to fixing duplicate content. If you want to remove a page permanently, then you need to 404, robots.txt block or Meta Noindex the page before requesting removal. At Google's discretion, pages removed via GWT that are not blocked by other methods can reappear in the index after 90 days.

Google Parameter Blocking

You can also use GWT to specify URL parameters that you want Google to ignore (which essentially blocks indexation of pages with those parameters). If you click Site Configuration, followed by URL parameters, you'll get a list that looks something like this:

9781118551585-un0408.tif

This list shows URL parameters that Google has detected, as well as the settings for how those parameters should be crawled. Keep in mind that the “Let Googlebot decide” setting doesn't reflect other blocking tactics, like robots.txt or Meta Robots. If you click Edit, you'll see the following options:

9781118551585-un0409.tif

Personally, I find the wording a bit confusing. “Yes” means the parameter is important and should be indexed, while “No” means the parameter indicates a duplicate. The GWT tool seems to be effective (and can be fast), but I don't usually recommend it as a first line of defense. It won't impact other search engines, and it can't be read by SEO tools and monitoring software. It could also be modified by Google at any time.

Bing URL Removal

Bing Webmaster Center (BWC) has tools very similar to GWT's options. In fairness to Bing, I think their parameter blocking tool came before Google's version. To request a URL removal in Bing, select the Index tab, followed by Block URLs and Block URL and Cache. You'll get a pop-up that looks like this:

9781118551585-un0410.tif

BWC actually gives you a wider range of options, including blocking a directory and your entire site. Obviously, that last one usually isn't a good idea.

Bing Parameter Blocking

In the same section of BWC (Index) is an option called URL Normalization. The name implies that Bing treats this more like canonicalization, but there's only one option, ignore. Like Google, you get a list of auto-detected parameters that you can add or modify.

9781118551585-un0411.tif

As with the GWT tools, I'd consider the Bing versions to be a last resort. Generally, I'd only use these tools if other methods have failed, and one search engine is just giving you grief.

Rel=Prev and Rel=Next

In September of 2011, Google gave us a new tool for fighting a particular form of near-duplicate content known as paginated search results. I describe the problem in more detail in the next section. Essentially, though, paginated results are any content or searches where the results are broken up into chunks, with each chunk (say, ten results) having its own page/URL.

You can now tell Google how paginated content connects by using a pair of tags much like Rel=canonical. They're called Rel-Prev and Rel-Next. Implementation is a bit tricky, but here's a simple example.

<head>

<link rel=”prev” href=”http://www.example.com/search/2” />

<link rel=”next” href=”http://www.example.com/search/4” />

</head>

In this example, the search bot has landed on page 3 of search results, so you need two tags: a Rel-Prev pointing to page 2, and a Rel-Next pointing to page 4. Where it gets tricky is that you're almost always going to have to generate these tags dynamically, as your search results are probably driven by one template.

Google has pushed this approach harder over time, although the data on its effectiveness seems mixed. Bing didn't originally honor Rel=Prev/Next, but announced limited support for the tags in April of 2012. I'll briefly discuss other methods for dealing with paginated content in the next section.

Internal Linking

It's important to remember that your best tool for dealing with duplicate content is to not create it in the first place. Granted, that's not always possible, but if you find yourself having to patch dozens of problems, you may need to re-examine your internal linking structure and site architecture.

When you do correct a duplication problem, such as with a 301-redirect or the canonical tag, it's also important to make your other site cues reflect that change. It's amazing how often I see someone set a 301 or canonical to one version of a page, while they continue to link internally to the non-canonical version and fill their XML sitemap with non-canonical URLs. Internal links are strong signals, and sending mixed signals will only cause you problems.

Don't Do Anything

Finally, you can let the search engines sort it out. This is what Google recommended you do for years, actually. Unfortunately, in my experience, especially for large sites, this is almost always a bad idea. It's important to note, though, that not all duplicate content is a disaster, and Google certainly can filter some of it out without huge consequences. If you only have a few isolated duplicates floating around, leaving them alone is a perfectly valid option.

Rel=“alternate” hreflang=“x”

In 2012, Google introduced a new way of dealing with translated content and same-language content with regional variations (such as US English versus UK English). Implementation of these tags is complex and very situational, but here's a simple example of a site that has both an English and Spanish version (under a sub-domain):

<head>

<link rel=”alternate” hreflang=”en” href=”http://www.example.com/” />

<link rel=”alternate” hreflang=”es” href=”http://es.example.com/” />

</head>

These tags would tell Google where to find both the English and Spanish-language versions of the content. Like canonical tags, they are page-level and need to be placed on all relevant pages (pointing to the alternate language version of each specific page).

Examples of Duplicate Content

So, now that we've worked backwards and sorted out the tools for fixing duplicate content, what does duplicate content actually look like in the wild? I'm going to cover a wide range of examples that represent the issues you can expect to encounter on a real website.

www versus Non-www

For site-wide duplicate content, this is probably the biggest culprit. Whether you've got bad internal paths or have attracted links and social mentions to the wrong URL, you've got both the www version and non-www (root domain) version of your URLs indexed:

www.example.com

example.com

Most of the time, a 301-redirect is your best choice here. This is a common problem, and Google is good about honoring redirects for cases like these.

You may also want to set your preferred address in Google Webmaster Tools. Under Configuration and Settings, you should see a section called Preferred domain that looks like this:

9781118551585-un0412.tif

There's a quirk in GWT where, to set a preferred domain, you may have to create GWT profiles for both your www and non-www versions of the site. While this is annoying, it won't cause any harm. If you're having major canonicalization issues, I'd recommend it. If you're not, then you can leave well enough alone and let Google determine the preferred domain.

Staging Servers

While much less common than the preceding issue, this problem is often also caused by subdomains. In a typical scenario, you're working on a new site design for a relaunch, your dev team sets up a subdomain with the new site, and they accidentally leave it open to crawlers. What you end up with is two sets of indexed URLs that look something like this:

www.example.com

staging.example.com

Your best bet is to prevent this problem before it happens by blocking the staging site with robots.txt . If you find your staging site indexed, though, you'll probably need to 301-redirect those pages or Meta Noindex them.

Trailing Slashes (“/”)

This is a problem people often have questions about, although it's less of an SEO issue than it once was. Technically, in the original HTTP protocol, a URL would be considered a completely different URL if you simply added a slash to the end of it. Here's a simple example:

www.example.com/products

www.example.com/products/

The first URL would represent a page, whereas the second URL would signal a folder. These days, almost all browsers automatically add the trailing slash behind the scenes and resolve both versions the same way. Google automatically canonicalizes these URLs in the majority of cases, but a 301-redirect is the preferred solution if you see duplicates in the index.

Secure (https) Pages

If your site has secure pages (designated by the https: protocol), you may find that both secure and non-secure versions are getting indexed. This most frequently happens when navigation links from secure pages—like shopping cart pages—also end up secured. This is usually due to relative paths creating variants like this:

www.example.com

https://www.example.com

Ideally, these problems are solved by the site architecture itself. In many cases, it's best to Noindex secure pages—shopping cart and check-out pages have no place in the search index. After the fact, though, your best option is a 301-redirect. Be cautious with any site-wide solutions—if you 301-redirect all “https:” pages to their “http:” versions, you could end up removing security entirely. This is a tricky problem to solve, and it should be handled carefully.

Home page Duplicates

While the three preceding issues can all create home page duplicates, the home page has a couple unique problems of its own. The most typical problem is that both the root domain and the actual home page document name get indexed. For example:

www.example.com

www.example.com/index.htm

Although this problem can be solved with a 301-redirect, it's often a good idea to add a canonical tag on your home page. Home pages are uniquely afflicted by duplicates, and a proactive canonical tag can prevent a lot of problems.

Of course, it's important to also be consistent with your internal paths. If you want the root version of the URL to be canonical, but then link to /index.htm in your navigation, you're sending mixed signals to Google every time the crawlers visit.

Session IDs

Some websites (especially e-commerce platforms) tag each new visitor with a tracking parameter. On occasion, that parameter ends up in the URL and gets indexed, creating something like this:

www.example.com

www.example.com/?session=12345678

That image really doesn't do the problem justice, because in reality you can end up with a duplicate for every single session ID and page combination that gets indexed. Session IDs in the URL can easily add thousands of duplicate pages to your index.

The best option, if it is possible to implement on your particular site/platform, is to remove the session ID from the URL altogether and store it in a cookie. There are very few good reasons to create these URLs, and no reason to let bots crawl them. If that's not feasible, implementing the canonical tag site-wide is a good bet. If you really get stuck, you can block the parameter in Google Webmaster Tools and Bing Webmaster Central.

Affiliate Tracking

This problem looks a lot like session IDs and happens when sites provide a tracking variable to their affiliates. This variable is typically appended to landing page URLs, like so:

www.example.com

www.example.com/?affiliate=mozbot99

The damage is usually a bit less extreme than home page duplicates, but it can still cause large-scale duplication. The solutions are similar to session IDs. Ideally, you can capture the affiliate ID in a cookie and 301-redirect to the canonical version of the page. Otherwise, you'll probably either need to use canonical tags or block the affiliate URL parameter.

Duplicate Paths

Having duplicate paths to a page is perfectly fine, but when duplicate paths generate duplicate URLs, then you've got a problem. Let's say a product page can be reached one of three ways:

www.example.com/electronics/ipad2

www.example.com/apple/ipad2

www.example.com/tag/favorites/ipad2

Here, the iPad2 product page can be reached by two categories and a user-generated tag. User-generated tags are especially problematic, because they can theoretically spawn unlimited versions of a page.

Ideally, these path-based URLs shouldn't be created at all. However a page is navigated to, it should only have one URL for SEO purposes. Some will argue that including navigation paths in the URL is a positive cue for site visitors, but even as someone with a usability background, I think the cons almost always outweigh the pros here.

If you already have variations indexed, then a 301-redirect or canonical tag are probably your best options. In many cases, implementing the canonical tag will be easier, since there may be too many variations to easily redirect. Long-term, though, you'll need to re-evaluate your site architecture.

Functional Parameters

Functional parameters are URL parameters that change a page slightly, but have no value for search, and are essentially duplicates. For example, let's say that all of your product pages have a printable version, and that each version has its own URL.

www.example.com/product.php?id=1234

www.example.com/product. php?id=1234&print=1

Here, the print=1 URL variable indicates a printable version, which normally would have the same content but a modified template. Your best bet is to not index these at all, with something like a Meta Noindex , but you could also use a canonical tag to consolidate these pages.

International Duplicates

These duplicates occur when you have content for different countries which share the same language, all hosted on the same root domain (it could be subfolders or subdomains). For example, you may have an English version of your product pages for the US, UK, and Australia.

www.example.com/us/product/ipad2

www.example.com/uk/product/ipad2

www.example.com/au/product/ipad2

Unfortunately, this one's a bit tough—in some cases, Google will handle it perfectly well and rank the appropriate content in the appropriate countries. In other cases, even with proper geo-targeting, they won't. It's often better to target the language itself than the country, but there are legitimate reasons to split off country-specific content, such as pricing.

If your international content does get treated as duplicate content, you may want to try using the hreflang attribute (http://mz.cm/YET7la).

Even though Google seems to be taking it more seriously, this can be a complex situation to resolve, as Google uses many cues for internationalization.

Search Sorts

So far, all of the examples I've given have been true duplicates. I'd like to dive into a few examples of near duplicates, since that concept is a bit fuzzy. A few common examples pop up with internal search engines, which tend to spin off many variants—sortable results, filters, and paginated results being the most frequent problems.

Search sort duplicates pop up whenever a sort (ascending/descending) creates a separate URL. While the two sorted results are technically different pages, they add no additional value to the search index and contain the same content, just in a different order. URLs might look like this:

www.example.com/search.php?keyword=ipad

www.example.com/search.php?keyword=ipad&sort=desc

In most cases, it's best just to block the sortable versions completely, usually by adding a Meta Noindex selectively to pages called with that parameter. In a pinch, you could block the sort parameter in Google Webmaster Tools and Bing Webmaster Central.

Search Filters

Search filters are used to narrow an internal search—it could be price, color, features, etc. Filters are very common on e-commerce sites that sell a wide variety of products. Search filter URLs look a lot like search sorts, in many cases:

www.example.com/search.php?category=laptop

www.example.com/search.php?category=laptop?price=1000

The solution here is similar to the preceding one—don't index the filters. As long as Google has a clear path to products, indexing every variant usually causes more harm than good.

Search Pagination

Pagination is an easy problem to describe and an incredibly difficult one to solve. Any time you split internal search results into separate pages, you have paginated content. The URLs are easy enough to visualize:

www.example.com/search.php?category=laptop

www.example.com/search.php?category=laptop?page=2

Of course, over hundreds of results, one search can easily spin out dozens of near duplicates. While the results themselves differ, many important features of the pages (titles, meta descriptions, headers, copy, templates, etc.) are identical. Add to that the problem the fact that Google isn't a big fan of “search within search” (i.e., having their search pages land on yours).

In the past, Google has said to let it sort pagination out—problem is, Google hasn't done it very well. Recently, Google introduced Rel-Prev and Rel-Next, as described earlier. Google has championed this solution, but the effectiveness of it can be very situational.

You have three other, viable options (in my opinion), although how and when they're viable depends a lot on the situation:

1. You can Meta Noindex, Follow pages 2+ of search results. Let Google crawl the paginated content, but don't let it be indexed.

2. You can create a View All page that links to all search results at one URL, and let Google auto-detect it. This seems to be Google's other preferred option.

3. You can create a View All page and set the canonical tag of paginated results back to that page. This is unofficially endorsed, but the pages aren't really duplicates in the traditional sense, so some claim it violates the intent of Rel-canonical.

Pagination for SEO is a very difficult topic, and can be complicated by search sorts, filters, and other variables. Often, multiple solutions have to be brought into play.

Product Variations

Product variant pages are pages that branch off from the main product page and only differ by one feature or option. For example, you might have a page for each color a product comes in:

www.example.com/product/ipod/nano

www.example.com/product/ipod/nano/blue

www.example.com/product/ipod/nano/red

It can be tempting to want to index every color variation, hoping it pops up in search results, but in most cases I think the cons outweigh the pros. If you have a handful of product variations and are talking about dozens of pages, fine. If product variations spin out into hundreds or thousands, though, it's best to consolidate. Although these pages aren't technically true duplicates, I think it's okay to Rel-canonical the options back up to the main product page.

I purposely used “static” URLs in this example to demonstrate a point. Just because a URL doesn't have parameters, that doesn't make it immune to duplication. Static URLs (i.e., parameter-free URLs) may look prettier, but they can be duplicates just as easily as dynamic URLs.

Geo-keyword Variations

Once upon a time, Local SEO meant just copying all of your pages hundreds of times, adding a city name to the URL, and swapping out that city in the page copy. It created URLs like these:

www.example.com/product/ipad2/new-york

www.example.com/product/ipad2/chicago

www.example.com/product/ipad2/miami

These days, not only is Local SEO a lot more sophisticated, but these pages are almost always going to look like near duplicates. If you have any chance of ranking, you're going to need to invest in legitimate, unique content for every geographic region you spin out. If you aren't willing to make that investment, then don't create the pages. They'll probably backfire.

Other “Thin” Content

This isn't really an example, but I wanted to stop and explain a word we throw around a lot when it comes to content: thin. While thin content can mean a variety of things, I think many examples of thin content are near duplicates, like product variations. Whenever you have pages that vary by only a tiny percentage of content, you risk those pages looking low-value to Google. If those pages are heavy on ads (with more ads than unique content), you're at even more risk. When too much of your site is thin, it's time to revisit your content strategy.

Syndicated Content

These last three examples all relate to cross-domain content. Here, the URLs don't really matter—they could be wildly different. Syndicated content and scraped content (see the following section) differ only by intent. Syndicated content is any content you use with permission from another site. However you retrieve and integrate it, that content is available on another site (and, often, many sites).

While syndication is legitimate, it's still likely that one or more copies will get filtered out of search results. You could roll the dice and see what happens, but conventional SEO wisdom says that you should link back to the source and probably set up a cross-domain canonical tag. A cross-domain canonical looks just like a regular canonical, but with a reference to someone else's domain.

Of course, a cross-domain canonical tag means that, assuming Google honors the tag, your page won't get indexed or rank. In some cases, that's fine—you're using the content for its value to visitors. Practically, I think it depends on the scope. If you occasionally syndicate content to beef up your own offerings but also have plenty of unique material, then link back and leave it alone. If a larger part of your site is syndicated content, then you could find yourself running into trouble. Unfortunately, using the canonical tag means you'll lose the ranking ability of that content, but it could keep you from getting penalized or having Panda-related problems.

Scraped Content

Scraped content is just like syndicated content, except that you didn't ask permission (and might even be breaking the law). The best solution: QUIT BREAKING THE LAW!

Seriously, no de-duping solution is going to satisfy the scrapers among you, because most solutions will knock your content out of ranking contention. The best you can do is pad the scraped content with as much of your own, unique content as possible.

Cross-ccTLD Duplicates

Finally, it's possible to run into trouble when you copy same-language content across countries—see the functional parameters example—even with separate Top-Level Domains (TLDs). Fortunately, this problem is fairly rare, but we see it with English-language content and even with some content in European languages. For example, I frequently see questions about Dutch content on Dutch and Belgian domains ranking improperly.

Unfortunately, there's no easy answer here, and most of the solutions aren't traditional duplicate-content approaches. In most cases, you need to work on your targeting factors and clearly show Google that the domain is tied to the country in question.

Which URL Is Canonical?

I'd like to take a quick detour to discuss an important question—whether you use a 301-redirect or a canonical tag, how do you know which URL is actually canonical? I often see people making a mistake like this:

<head>

<link rel=”canonical” href=”http://www.example.com/product.php”>

</head>

The problem is that product.php is just a template—you've now collapsed all of your products down to a single page (that probably doesn't even display a product). In this case, the canonical version probably includes a parameter, like id=1234.

The canonical page isn't always the simplest version of the URL—it's the simplest version of the URL that generates UNIQUE content. Let's say you have these three URLs that all generate the same product page:

www.example.com/product.php?id=1234

www.example.com/product.php?id=1234&print=1

www.example.com/product.php?id=1234&session=12345678

Two of these versions are essentially duplicates, and the print and session parameters represent variations on the main product page that should be de-duped. The id parameter is essential to the content, though—it determines which product is actually being displayed.

So, consider yourself warned. As much trouble as rampant duplicates can be, bad canonicalization can cause even more damage in some cases. Plan carefully, and make absolutely sure you select the correct canonical versions of your pages before consolidating them.

Tools for Diagnosing Duplicates

So, now that you recognize what duplicate content looks like, how do you go about finding it on your own site? Here are a few tools to get you started—I won't claim it's a complete list, but it covers the bases.

Google Webmaster Tools

In Google Webmaster Tools, you can pull up a list of duplicate title tags and meta descriptions that Google has crawled. While these don't tell the whole story, they're a good starting point. Many URL-based duplicates will naturally generate identical metadata. In your GWT account, navigate to Optimization, followed by HTML Improvements, and you'll see a table like this:

9781118551585-un0413.tif

You can click on “Duplicate meta descriptions” and “Duplicate title tags” to pull up a list of the duplicates. This is a great first stop for finding your trouble spots.

Google's Site: Command

When you already have a sense of where you might be running into trouble and need to take a deeper dive, Google's site: command is a very powerful and flexible tool. What really makes site: powerful is that you can use it in conjunction with other search operators.

Let's say, for example, that you're worried about home page duplicates. To find out if Google has indexed any copies of your home page, you could use the site: command with the intitle: operator, like this:

9781118551585-un0414.tif

Put the title in quotes to capture the full phrase, and always use the root domain (leave off www) when making a wide sweep for duplicate content. This will detect both www and non-www versions, as well as any other subdomains.

Another powerful combination is site: plus the inurl: operator. You could use this to detect parameters, such as the search-sort problem mentioned above.

9781118551585-un0415.tif

The inurl: operator can also detect the protocol used, which is handy for finding out whether any secure (https:) copies of your pages have been indexed.

9781118551585-un0416.tif

You can also combine the site: operator with regular search text, to find near duplicates (such as blocks of repeated content). To search for a block of content across your site, just include it in quotes.

9781118551585-un0417.tif

I should also mention that searching for a unique block of content in quotes is a cheap and easy way to find out if people have been scraping your site. Just leave off the site: operator and search for a long or unique block entirely in quotes.

Of course, these are just a few examples, but if you really need to dig deep, these simple tools can be used in powerful ways. Ultimately, the best way to tell if you have a duplicate content problem is to see what Google sees.

Your Own Brain

Finally, it's important to remember to use your own brain. Finding duplicate content often requires some detective work, and relying too heavily on tools can leave some gaps in what you find. One critical step is to systematically navigate your site to find where duplicates are being created. For example, does your internal search have sorts and filters? Do those sorts and filters get translated into URL variables, and are they crawlable? If they are, you can use the site: command to dig deeper. Even finding a handful of trouble spots using your own sleuthing skills can end up revealing thousands of duplicate pages, in my experience.

I Hope That Covers It

If you've made it this far—congratulations—you're probably as exhausted as I am. I hope that covers everything you'd want to know about the current state of duplicate content. Some of these topics, like pagination, are extremely tricky in practice, and there's often not one “right” answer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset