vADC Docs

Stingray Spider Catcher

by ben_1 on ‎02-21-2013 03:03 AM - edited on ‎07-08-2015 10:49 AM by PaulWallace (1,843 Views)

spider-large.jpgWeb spiders are clever critters - they are automated programs designed to crawl over web pages, retrieving information from the whole of a site. (For example, Spiders power search engines and shopping comparison sites). But what do you do if your website is being overrun by the bugs? How can you prevent your service from being slowed down by a badly written, over-eager web spider?

 

Web spiders, (sometimes called robots, or bots), are meant to adhere to the Robot exclusion standard. By putting a file called robots.txt at the top of your site, you can restrict the pages that a web spider should load. However, not all spiders bother to check this file. Even worse, the standard gives no control over how often a spider may fetch pages. A poorly written spider could hammer your site with requests, trying to discover the price of everything that you are selling every minute of the day. The problem is, how do you stop these spiders while allowing your normal visitors to use the site without restrictions?

 

As you might expect, Stingray has the answer! The key feature to use is the 'Request Rate Shaping' classes. These will prevent any one user from fetching too many pages from your site.

 

Let's see how to put them to use:

 

Create a Rate Shaping Class

 

You can create a new class from the Catalogs page. You need to pick at least one rate limit: the maximum allowed requests per minute, or per second. For our example, we'll create a class called 'limit' that allows up to 100 requests a minute.

 

Put the rate shaping class into use - first attempt

 

Now, create a TrafficScript rule to use this class. Don't forget to add this new rule to the Virtual Server that runs your web site.

 

rate.use( "limit" );

 

This rule is run for each HTTP request to your site. It applies the rate shaping class to each of them.

 

However, this isn't good enough. We have just limited the entire range of visitors to the site to 100 requests a minute, in total. If we leave the settings as is, this would have a terrible effect on the site. We need to apply the rate shaping limits to each individual user.

 

Rate shaping - second attempt

 

Edit the TrafficScript rule and use this code instead:

 

rate.use( "limit", connection.getRemoteIP() );

 

We have provided a second parameter to the rate.use() function. The rule is taking the client IP address and using this to identify a user. It then applies the rate shaping class separately to each unique IP address. So, a user coming from IP address 1.2.3.4 can make up to 100 requests a minute, and a user from 5.6.7.8 could also make 100 requests at the same time.

 

Now, if a web spider browses your site, it will be rate limited.

 

Improvements

 

We can make this rate shaping work even better. One slight problem with the above code is that sometimes you may have multiple users arriving at your site from one IP address. For example, a company may have a single web proxy. Everyone in that company will appear to come from the same IP address. We don't want to collectively slow them down.

 

To work around this, we can use cookies to identify individual users. Let's assume your site already sets a cookie called 'USERID'. The value is unique for each visitor. We can use this in the rate shaping:

 

# Try reading the cookie
$userid = http.getCookie( "USERID" );
if( $userid == "" ) { $userid = connection.getRemoteIP(); }
rate.use( "limit", $userid );

 

This TrafficScript rule tries to use the cookie to identify a user. If it isn't present, it falls back to using the client IP address.

 

Even more improvements

 

There are many other possibilities for further improvements. We could detect web spiders by their User-Agent names, or perhaps we could only rate shape users who aren't accepting cookies. But we have already achieved our goal - now we have a means to limit the page requests by automated programs, while allowing ordinary users to fully use the site.

 

This article was originally written by Ben Mansell in December 2006.

Contributors