vADC Blog

Merging RSS feeds using Java Extensions (12/17/2008)

by chrisboyle on ‎06-29-2012 02:02 PM - last edited on ‎07-14-2015 01:44 PM by rickl44 (619 Views)

One of Stingray Traffic Manager's most powerful features is the ability to run Java on your traffic manager, allowing you to use a wide variety of existing libraries. For example, using Java's XML APIs, you can manipulate data on the fly more intelligently than with TrafficScript alone.

 

As a simple demonstration, this article includes a code walkthrough to fetch RSS feeds from several locations and produce one merged, sorted feed, which is more convenient to subscribe to and can be manipulated in other ways at the same time.

 

Why use Stingray for this?

 

The Servlet API lets you write Java code for this sort of task, but setting up and maintaining a Java application server can be a pain, especially considering that you might have to set up new host and OS. In many situations, this is overkill. Fortunately, Stingray includes a Java application server, which is a good place to develop this sort of functionality (you can attach a remote debugger to your servlet), quickly deploy it and manage it side-by-side with other services.

 

Anatomy of an RSS feed

 

Before we walk through the source, let's take a look at the structure of an RSS feed. We're only considering version 2.0 here, to keep the code simple. Wikipedia has a *complete example feed*, but the important elements for our purposes are as follows: There are *several different libraries* for handling XML in Java. We're using https://jaxp.dev.java.net/, the Java API for XML Processing, which is included in the JDK on Stingray appliances, so this example will work out of the box.

 

To see it in action, download http://blogs.riverbed.com/files/mergefeeds.class and add it to your Stingray Traffic Manager (upload it under Catalogs/Java, then add the resulting rule to a virtual server as a request rule). To try it on different feeds, find the extension under Catalogs/Java and put a space-separated list of RSS2 URLs in a parameter called feeds. You can also add a title, and a dateformat if your feeds use a different *date format*

 

Code walkthrough

 

The code below is slightly abridged; you can *download the full source*. We begin with the usual servlet skeleton and a couple of factories we'll use later, one for building DOMs, the other for transforming them back to XML:

 

public class MergeFeeds extends HttpServlet

{

static final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

static final TransformerFactory tf = TransformerFactory.newInstance();

public void doGet( HttpServletRequest req,HttpServletResponse res )

throws ServletException, IOException

{

 

The first thing we need to do is look at the configuration we mentioned earlier. You can retrieve parameters set in the Stingray UI using getInitParameter(), which will return either a String or null.

 

String[] urls = {

"[http://knowledgehub.zeus.com/xmlsrv/rss2.php | http://knowledgehub.zeus.com/xmlsrv/rss2.php]",

"[http://knowledgehub.zeus.com/xmlsrv/rss2.comments.php | http://knowledgehub.zeus.com/xmlsrv/rss2.comments.php]"

};

String urlList = getInitParameter( "feeds" );

if( urlList != null ) urls = urlList.split(" ");

 

We handle the other parameters similarly, and then set our output content type. Many sites still serve RSS as text/html, which might be accepted by most readers, but is obviously incorrect.

 

res.setContentType( "application/rss+xml" );

 

Next we create our output document, the channel element (but not the root element yet) and a *TreeMap*, which will keep the entries in order. The Java libraries already know how to compare *Date* objects and can trivially be told to reverse the comparison. Note that an element like We now have the entire structure of a feed in d. As well as pulling out all the items, we're going to use a slight hack here to get all the correct attributes on the root element, which will mostly be XML namespaces, such asxmlns:dc="[http://purl.org/dc/elements/1.1/ | http://purl.org/dc/elements/1.1/]". We'll simply copy the root element and its attributes (but not its children) from the first feed we process. You could easily construct the root element manually and use *setAttribute()* if you prefer. We then connect that to our channel element from earlier.

 

// Copy the root element from the first feed, for xmlns attributes

if( xml.getFirstChild() == null ) {

Node rss = xml.importNode( d.getElementsByTagName("rss").item(0), false );

xml.appendChild( rss );

rss.appendChild( channel );

}

 

Now we just need to pull out the item elements, parse their dates using the *SimpleDateFormat* and put them into our sorted list. We use *importNode()* again to import the nodes into our document, like we did with the root element, but this time we copy the children too.

 

// For each item in the feed...

NodeList feedItems = d.getElementsByTagName( "item" );

for( int i = 0; i < feedItems.getLength(); i++ ) {

// Get the date

NodeList nl = feedItems.item(i).getChildNodes();

Date date = new Date(); // now

for( int j = 0; j < nl.getLength(); j++ ) {

if( ! nl.item(j).getNodeName().equalsIgnoreCase( "pubDate" ) ) continue;

try { date = sdf.parse( nl.item(j).getFirstChild().getNodeValue() ); }

catch( ParseException ignored ) {} // use current time

}

// Store the item (in reverse date order)

items.put( date, xml.importNode( feedItems.item(i), true ) );

}

}

 

Finally, we just check that we have a valid document and transform it back into XML.

 

if( xml.getFirstChild() == null ) throw new ServletException( "No valid feeds!" );

// Append all the items (sorted), and output the resulting document

for( Node n : items.values() ) channel.appendChild( n );

PrintWriter out = res.getWriter();

try {

tf.newTransformer().transform( new DOMSource( xml ), new StreamResult( out ) );

out.flush();

} catch( TransformerConfigurationException e ) { throw new ServletException(e); }

catch( TransformerException e ) {} // Probably the client went away

}

}

 

Exercises for the reader

 

Depending on the nature of your feeds, you might want to include support for:

 

  • Atom
  • older RSS formats
  • other date formats (some sites use non-RFC822 formats)
  • duplicate removal using the guid or the link address (perhaps several feeds post the same link and you only want to see it once)

 

Java Extensions don't just allow you to do arbitrary XML processing; you can also choose which vendor's XML implementation you want to use. As Michael noted *in his article on XML validation*, you can install the Intel XML Suite on your Stingray Traffic Manager, and since it provides the same JAXP API, the code we've used here will start using it, no source changes or recompilation required.