Prevent Crawlers On Your Data Service

by KodefuGuru 29. June 2009 17:01

At ConvergeSC 2009, an attendee asked me to describe how to prevent crawlers from trolling your ADO.NET Data Service. I explained as best as I could, but I felt like a blog post might make it more clear.

To a bot, your data service looks like any other site on the web. Sure, it’s reading either Atom, POX, JSON, or some other bizarre format you’ve concocted, but it’s still data coming down through http and discoverable via links. There are ways to prevent a bot from crawling, but information available on the web that doesn’t require authentication can be crawled.

The first way to prevent your service from being crawled by a legitimate bot is to put a Robots.txt in the root of the site. Inside the file, put the following lines:

User-agent: *
Disallow: /

This locks down the entire site from being crawled by the bot. If your service coexists with a site you want to be crawled, you can change the Disallow option to /MyService.svc/. be sure to include the closing slash so other pages aren’t accidentally matched.

The conference attendee seemed to be concerned specifically about anchor tags and AJAX. If you’re using the OnClick event of the anchor tag, most spiders will not follow it. However, if the uri is in an href, a crawler will pick it up. Bing and Google will honor a rel attribute with the “nofollow” value to prevent indexing the page. However, Yahoo and Ask will still follow and index a link with that attribute.

If your service is publicly available, using the robots.txt is the way to go. If it’s not publicly available, the service should already be locked down through authentication techniques.

Tags: ,

Kodefu

Comments

6/29/2009 5:04:05 PM #

trackback

Trackback from DotNetShoutout

Prevent Crawlers On Your Data Service

DotNetShoutout

6/29/2009 5:07:50 PM #

trackback

Trackback from WebDevVote.com

Prevent Crawlers On Your Data Service

WebDevVote.com

6/29/2009 5:08:27 PM #

trackback

Trackback from DotNetKicks.com

Prevent Crawlers On Your Data Service

DotNetKicks.com

7/8/2009 1:33:16 PM #

Jonathan Bates

Have you ever using a custom HTTPModule to trap and redirect bots?

Jonathan Bates United States

7/17/2009 9:31:50 AM #

chris

I haven't done that, but I do see as how that would be more secure than depending on them to honor the robots.txt. However, wouldn't a bot that disregards the robots.txt change the user-agent to look human as well?

chris United States

Add comment




  Country flag

biuquote
  • Comment
  • Preview
Loading



Powered by BlogEngine.NET 1.6.0.0
Theme by Mads Kristensen

Whois KodefuGuru

Chris Eargle

Chris Eargle
.NET Community Champion

LinkedIn Twitter Technorati Facebook

MVP - Visual C#

 

INETA Community Champions
Friend of RedGate
Telerik .NET Ninja
Community blogs & blog posts

I am a #52er


World Map

RecentComments

Comment RSS

Tag cloud

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2010