Reducing Duplicate Content with ASP.NET MVC

As you're all no doubt aware, ASP.NET MVC recently went RTM. This brings the MVC-style of coding, made very popular by Ruby-on-Rails to the ASP.NET world. I've been eager to start using MVC for months, but I've been holding off until I knew the API was locked down so I don't have to change anything.

Unfortunately, like WebForms, MVC has some "issues" with regards to duplicate content, making it not all that SEO-friendly.

What do you mean, Duplicate Content?

Duplicate content is just that - the same content repeated on multiple pages/sites. This might not sound like a big deal, but it's not something search engines like. They don't want the search results to show the same content multiple times across different websites so they often penalise or hide duplicate content. Additionally, if you have two pages with the same content, your inbound links might become split between the two - reducing the pagerank passed to either.

What's this got to do with ASP.NET MVC?

Unfortunately ASP.NET MVC makes it easy to have the same content indexed multiple times. I've listed the main problems below.

Case-Sensitivity. In ASP.NET (or rather IIS and Windows), URLs are not case sensitive. That means you can write Default.asp, default.asp or even DeFalT.aSp and still get the same page. While you'll probably stick to the same case within your website, it wouldn't be hard for someone to create links to your site with different casing (e.g. they might have CAPS LOCK turned on).

Default Documents. Most websites have a default document set up to serve when a filename is not provided in the request. E.g. http://mydomain.com/ might actually serve up http://mydomain.com/default.asp, but it won't tell the browser that's what it did. It will serve it up as if the two are different URLs.

Trailing Slashes. While the above problems are general ASP.NET/IIS issues, trailing slashes are something that only really become a problem with MVC or other URL rewriting/routing. In ASP.NET if you requested http://mydomain.com/files and you had a folder named files, IIS would issue a redirect to mydomain.com/files/. However, in ASP.NET MVC the URL routing will treat trailing slashes the same as requests without. So http://mydomain.com/controller/action is exactly the same as http://mydomain.com/controller/action/ and therefore results in duplicate content.

Query Strings. Query strings can be a big problem for duplicate content. Imagine if you can add ?sort=field to the end of your page to have a table re-ordered. To a search engine this looks like another page, but the content is mostly the same. Fortunately, ASP.NET MVC doesn't really use query strings thanks to the excellent URL routing.

So, what can we do?

Lowercase URLs. We can force all requests to our application to be lowercase by catching them in BeginRequest in Global.asax and redirecting to the lowercase version if they contain any uppercase characters.

protected void Application_BeginRequest(Object sender, EventArgs e)
{
	// Get the requested URL so we can do some validation on it.
	// We exclude the query string, and add that later, so it's not included
	// in the validation
	string url = (Request.Url.Scheme + "://" + HttpContext.Current.Request.Url.Authority + HttpContext.Current.Request.Url.AbsolutePath);

	// If we've got uppercase characters, fix
	if (Regex.IsMatch(url, @"[A-Z]"))
		PermanentRedirect(url.ToLower() + HttpContext.Current.Request.Url.Query);
}

/// <summary>
/// Redirects with a 301 header to pass along any incoming
/// PageRank/link value.
/// </summary>
/// <param name="url">The URL to redirect to</param>
private void PermanentRedirect(string url)
{
	Response.Clear();
	Response.Status = "301 Moved Permanently";
	Response.AddHeader("Location", url);
	Response.End();
}

Now if anyone requests a URL with uppercase characters, they'll be redirected with a 301 redirect. This works great, but we have a problem. All URLs generated internally by MVC will continue to use Action and Controller names in Pascal case (assuming that's how your classes are named). This means every link within our site will cause two requests (the first being a redirect). To fix this, we can override the default behaviour for creating URLs. We'll create a new extension method for the RouteCollection class called MapRouteLowercase which instead of creating a Route will create an instance of a new class, called LowercaseRoute. This class will override the GetVirtualPath method to lowercase the URL before passing it back. I can't take credit for this code, I pretty much just copied it from Graham O'Neale's blog.

public class LowercaseRoute : System.Web.Routing.Route
{
	public LowercaseRoute(string url, IRouteHandler routeHandler)
		: base(url, routeHandler) { }
	public LowercaseRoute(string url, RouteValueDictionary defaults, IRouteHandler routeHandler)
		: base(url, defaults, routeHandler) { }
	public LowercaseRoute(string url, RouteValueDictionary defaults, RouteValueDictionary constraints, IRouteHandler routeHandler)
		: base(url, defaults, constraints, routeHandler) { }
	public LowercaseRoute(string url, RouteValueDictionary defaults, RouteValueDictionary constraints, RouteValueDictionary dataTokens, IRouteHandler routeHandler)
		: base(url, defaults, constraints, dataTokens, routeHandler) { }

	public override VirtualPathData GetVirtualPath(RequestContext requestContext, RouteValueDictionary values)
	{
		VirtualPathData path = base.GetVirtualPath(requestContext, values);

		if (path != null)
			path.VirtualPath = path.VirtualPath.ToLowerInvariant();

		return path;
	}
}

public static class RouteCollectionExtensions
{
	public static void MapRouteLowercase(this RouteCollection routes, string name, string url, object defaults)
	{
		routes.MapRouteLowercase(name, url, defaults, null);
	}

	public static void MapRouteLowercase(this RouteCollection routes, string name, string url, object defaults, object constraints)
	{
		if (routes == null)
			throw new ArgumentNullException("routes");

		if (url == null)
			throw new ArgumentNullException("url");

		var route = new LowercaseRoute(url, new MvcRouteHandler())
		{
			Defaults = new RouteValueDictionary(defaults),
			Constraints = new RouteValueDictionary(constraints)
		};

		if (String.IsNullOrEmpty(name))
			routes.Add(route);
		else
		routes.Add(name, route);
	}
}

You can put these classes anywhere. Because MapRouteLowercase is an extension method, you can just call it on the RouteCollection class in place of the existing MapRoute call in your Global.asax.

// Home stuff
routes.MapRouteLowercase(
	"Default",
	"{page}",
	new { controller = "Home", action = "Index", page = 1 },
	new { page = @"\d+" }
);

Default Documents. While this issue doesn't affect MVC in the same way, there's a very similar problem. In ASP.NET MVC the default routing is {controller}/{action} but it sets a default action of Index. That means on a newly-created project, both /Home/Index and /Home will serve up the same content.

To work around this, and provide some nicer URLs, I changed the routing a little so that my default actions where mapped to the root and a seperate route dealt with the homepage (which accepts pages, to allow browsing to older posts).

public static void RegisterRoutes(RouteCollection routes)
{
	routes.IgnoreRoute("{resource}.axd/{*pathInfo}");

	// Posts
	routes.MapRouteLowercase(
		"Posts",
		"posts/{url}",
		new { controller = "Post", action = "Display" }
	);

	// Tags
	routes.MapRouteLowercase(
		"Tags",
		"tags/{url}/{page}",
		new { controller = "Tag", action = "Display", page = 1 },
		new { page = @"\d+" }
	);

	// Home stuff
	routes.MapRouteLowercase(
		"Default",
		"{page}",
		new { controller = "Home", action = "Index", page = 1 },
		new { page = @"\d+" }
	);

	// Home stuff
	routes.MapRouteLowercase(
		"Home",
		"{action}",
		new { controller = "Home", action = "" }
	);

	// Catch-all for any unmatched URL
	routes.MapRouteLowercase(
		"Error Catch-All",
		"{*path}",
		new { controller = "Home", action = "NotFound" } // NotFound doesn't exist, so HandleUnknownAction will be fired
	);
}

Trailing Slashes. To avoid trailing slashes and a few other minor issues (such as people adding /1 to a URL to get page 1, which is served up without the /1) I added some additional rules to my Global.asax as below.

protected void Application_BeginRequest(Object sender, EventArgs e)
{
	// Get the requested URL so we can do some validation on it.
	// We exclude the query string, and add that later, so it's not included
	// in the validation
	string url = (Request.Url.Scheme + "://" + HttpContext.Current.Request.Url.Authority + HttpContext.Current.Request.Url.AbsolutePath);

	// If we're not a request for the root, and end with a slash, strip it off
	if (HttpContext.Current.Request.Url.AbsolutePath != "/" && HttpContext.Current.Request.Url.AbsolutePath.EndsWith("/"))
		PermanentRedirect(url.Substring(0, url.Length - 1) + HttpContext.Current.Request.Url.Query);

	// If we end with /1 we're a page 1, and don't need (shouldn't have) the page number
	if (HttpContext.Current.Request.Url.AbsolutePath.EndsWith("/1"))
		PermanentRedirect(url.Substring(0, url.Length - 2) + HttpContext.Current.Request.Url.Query);

	// If we have double-slashes, strip them out
	else if (HttpContext.Current.Request.Url.AbsolutePath.Contains("//"))
		PermanentRedirect(url.Replace("//", "/") + HttpContext.Current.Request.Url.Query);

	// If we've got uppercase characters, fix
	else if (Regex.IsMatch(url, @"[A-Z]"))
		PermanentRedirect(url.ToLower() + HttpContext.Current.Request.Url.Query);
}

This seems to stop many of the issues I came up with, however the double-slash seems to be passed through (in AbsolutePath) as a single slash here (Vista/IIS7) so doesn't work. I've left it in just in case this behaves differently on other web servers.

Is there anything else I should do?

As of February, Google, Yahoo, ASK and Microsoft Live Search support a new Canonical meta-tag. This allows you to specify on a page that this page is duplicate content and any incoming links should instead be attributed to another page. If your site has query strings or other potential for multiple requests to serve up the same content I would recommend inserting this tag to make sure the search engines choose your prefered page.

Related Reading

Ok, so this one's not related to duplicate content, but it's a great ASP.NET MVC resource, so it's worth taking a look.

Comments

Derek Fowler, Thursday, 16 April 2009

Nice post.

In the uppercase checking bit it would be quicker to just ToLower() everything without performing the Regex check. Even if the Regex is compiled, just doing a ToLower() on its own takes about a third of the time.

Danny Tuppeny, Thursday, 16 April 2009

Good point. The part of the code actually came from Graham O'Neale's blog, I just added to it.

The code has evolved slightly from the above, since it now handles legacy redirects (to take care of my .html pages here on Blogger and 301 them to the nicer format in my new blog) and other things.

I've got plans to change that code again because currently it'll perform multiple chain-redirects if your request fails multiple conditions (eg. it's uppercase, and the wrong domain, and is an old redirect). The new code will fix everything in one redirect.

Anonymous, Friday, 24 April 2009

This post was helpful. However, I think the argument can be made that 301-redirecting URLs containing upper case letters is a bit of overkill.

If you use all-lowercase URLs to begin with (the recommended practice), this problem mostly goes away. If all your URLs are lowercase, then all your inbound links tend to be lower-case. Why? Because people linking to you cut-and-paste those URLs 99% of the time.

Also, case sensitivity is not an issue for the Googlebot when it spiders your site. This issue only affects your inbound links from other sites. And again, most people just cut and paste.

I understand the mentality of 301-redirecting all bogus or aliased URLs and I've used it frequently (for example, the trailing slash issue, and to route from www- to non-www- and vice-versa).

At the same time, there's such a thing as too much. URL aliasing is one of those things which sounds like it causes a lot of problems in theory, but once you get a dozen or so sites under your belt you discover that by and large, Google has actually done their job pretty well.

A great example of this: there's *very little* SEO value to the www-to-non-www redirect anymore. Approaching zero. Google has two mechanisms to determine a "canonical" page. Even without those mechanisms, the idea that "OMG half my link juice is going to be squandered" is just patently false. That's not how PageRank was calculated back in the day, and of course, nowadays PageRank has taken a back seat to other indicators.

My point here is that if you only expose lowercase URLs to the outside world, most of these issues go away.

Danny Tuppeny, Friday, 24 April 2009

> 301-redirecting URLs containing upper case letters is a bit of overkill

It has advantages and it's only 2 lines of code. I wouldn't say it's overkill.

> Also, case sensitivity is not an issue for the Googlebot when it spiders your site.

That's not true. There are websites online that have the same page indexed with different cases (correctly, as per the W3C spec):

"URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive."

> A great example of this: there's *very little* SEO value to the www-to-non-www redirect anymore. Approaching zero.

Again, "very little" is not none. It's trivial to do, why not?

> My point here is that if you only expose lowercase URLs to the outside world, most of these issues go away.

"Most". Again, every little helps. I like my URLs to be consistent and I like to avoid issues that can be solved so easily, even if they are small.

Nowhere in my post does it say "OMG U MUST DO THIS OR UR WEBSITE IS DOOMED", it's simply there for people that want to do it. It's very little effort and gives you consistent URLs. Some people want it - let them have it.

Kevin, Friday, 22 May 2009

Awesome, awesome post. Thanks!

In my experience, tight control of your URLs, status codes, and overall site architecture from the bot perspective is essential to making the most of inbound links and linkjuice flow throughout the site. Don't leave it up to the engines to figure it out. Do it for them and you'll head off indexing problems and be better poised for success.

introspective, Monday, 13 July 2009

I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty. Should I stop publish my articles on article directories?

Danny Tuppeny, Friday, 24 July 2009

@introspective: I don't know :(

Everyone seems to do it (even blogs of Google and their employees!), but it should be treated as duplicate content. I'm not sure what Google expect us to do, nor how badly this effects us. I would be very surprised if they couldn't handle blog-style listings/categories/tags/archives.

I'd say don't worry about it. It's probably an exception you can live with, but I'm no SEO expert.

Chris, Wednesday, 25 November 2009

Excellent post, very useful.

Post a Comment