I’ve made several posts on Hashdot and the RJack gems, but have yet to post on the why? or what? of it. This post should fill that gap by way of a non-trivial example.

At Evri I’m responsible for content acquisition and text extraction. We’re parsing volumes of news and general web content, and separating the kernels of semantically rich text from the chaff i.e. site navigational boxes, advertisements, etc. Normally this content is processed “offline”, but I had reason recently to package it as a HTTP-XML service for transforming web content in real time. As the content processing system already existed as a mixture of java (for the heavy lifting) and ruby, I went looking for a solution for service construction that would be simple, flexible enough to reuse the existing processing code and setup, and well performing.

As the XML production (output) code was already in java, and because I wanted the performance advantages of the JVM’s native threads, I decided to write two java servlets: one to simply test obtaining the web content by URL as a proxy (using thread-safe and connection pooling commons-httpclient), and the second extending the first to actually transform the content HTML into an internal XML representation. So now I have two servlets that need to be setup in an HTTP server, I want them to share a common HTTPClient setup, and the transforming servlet also needs a reference to a custom setup content processor. How was I going to wire all this together?

Standards, Frameworks, and XML configuration

Classical java would have me create a Web Application Archive (WAR) which contains the compiled Servlets plus a web.xml Deployment Descripter providing the mapping of servlets to URIs. Since the container will be constructing the servlets, access to the HTTPClient and processor must be arranged by non-direct means, presumably through a ServletContext which allows Object-type attributes. Back in the day. with earlier Servlet specs, this would have been done with some form of initialization servlet. Now its possible to register a context listener which does the same. I could write all the java classes and XML myself, or depend on something like the Spring Framework’s ContextLoaderListener. In either case it will be a challenge to keep the ruby based content processor setup without resorting to an additionally complex integration, with java setting up a jruby interpreter (via BSF or otherwise). Note also that none of this complexity helps to address such matters as how to get this all installed in a production environment, how the service will be started, monitored, etc. The two servlets are themselves pretty simple. How might we shed some of the surrounding baggage?

Un-Inversion of Control

Enter the beautifully designed Jetty Web Server which offers full compliance with all of the committee Servlet/JSP standards, but also offers direct programmatic setup under the guise of embedding. It was quick work to prototype the same embedded HTTP server setup in jruby. I later took up authoring a generic jetty gem packaging and setup façade as a side project.

As I’ve setup a lot of classic java web applications in the past, reviewing the script for this post gave me an eery feeling that somehow there must be a few more XML configuration files or bash scripts somewhere that were needed to make this all work. That’s not the case here. Lets quickly review the prerequisites:

  1. Java JDK or JRE
  2. Hashdot installed with JRuby in /opt.
  3. RJack logback and jetty gems:
      % jgem install logback jetty
    
  4. A commons-httpclient gem which has not yet been published.
  5. The Evri proprietary content_processor and proxy_transformer ruby Modules which, sorry to say, I won’t be sharing.

(Note: I’ve also changed a few of the setup details of these to avoid giving too many details.)

Dependency Injection

Fancy terminology aye? The HTTP Client and content processor are simply created as local variables http (l28) and processor (l33) in the script, and passed directly on construction of the ProxyServlet and TransformServlet (l46-7). Isn’t that refreshingly direct? The Jetty::ServerFactory provided in the jetty gem is used to create a the complete HTTP server, with a healthy dose of sensible defaults. The set_context_servlets call (l35) uses a ruby hash for mapping paths of the root '/' context to these servlets. There is no Web Application, because there is no need for one. (However the jetty gem supports full webapps as well when needed.)

Lifecycle Events

More fancy terminology. We have some shutdown work we’d like to do after the server port is closed. Once Jetty is started via server.start (l51), the server.join call (l52) will block until the server exits, for example via SIGTERM. We then have an opportunity to cleanup, for example, gracefully closing the HTTP client and any kept-alive connections (l55). Other example might include writing out some final reporting information.

Code or Configuration?

The original Jetty embedding examples are written in Java and thus need to be compiled. Any settings such as the HTTP port number are thus hard coded in the jetty examples. The distinction is a good bit less clear in the case of this ruby script. The script itself is almost entirely in a declarative style, save perhaps for the final server start, join and cleanup (l51-55). It also uses some local variables, but this isn’t unlike named object references in Jetty or Spring XML configuration. It would certainly be easy enough to break a final set of configuration elements out into a separate XML configuration file (or YAML, or even java proprieties), but what is the gain? Is it not easy enough to change any necessary settings in this script directly? Is there not a clear advantage to having all aspects of the service setup in a single unified file, rather than spread out into multiple XML configuration files and the typically required bash wrapper script?

Where is the Ruby at Runtime?

While ruby is used for setup, wiring, dependency injection, or whatever you’d like to call it, ruby is interestingly absent as an explicit reference in any of the Java code, where there is not a a single JRuby or BSF dependency reference. Furthermore, the ProxyServlet never utilizes a single line of ruby code in its runtime execution. We’ve effectively made good use of ruby to setup this servlet but take on absolutely zero ruby performance penalties during runtime.

The content processor by comparison, as used by the TransformServlet uses a pipeline or chain-of-responsibility pattern with some links in the chain implemented in ruby. The ruby links are non-performance intensive processing logic steps, and don’t measurably detract from the performance. Note that if they did become a performance bottleneck, they could be rewritten in pure-java without needing to modify or recompile the TransformServlet. Thus by putting ruby in control of service boot-strapping, ruby can easily be injected into the service and mixed with java components where advantageous and without any complexities incurred on the java side. Some additional example code for this kind of ruby injection can be found in the Jetty::ServerFactory rdoc, where for example, I demonstrate implementing a Servlet in ruby and passing it to Jetty.

Logging, Daemonizing, and other oft ignored Subtleties

The Hashdot launcher and hashdot properties are used to arrange for the service to daemonize, redirect STDOUT/STDERR to a log file, and set a Java heap size. This will make it simple to add the service to inittab or other UNIX process monitor. The logback gem is loaded (l10,14) which loads the slf4j gem as a dependency. Jetty in turn detects the presence of SLF4J and will use it for logging. SLF4J via Logback output to STDOUT at INFO level is arranged pragmatically (l22-5), thus avoiding a separate logback.xml. Below is an example of log output using jetty-service script provided with the jetty gem itself (and with the logback gem available).

1    INFO  org.mortbay.log - Logging to Logger[org.mortbay.log] via org.mortbay.log.Slf4jLog
58   INFO  org.mortbay.log - jetty-6.1.12
280  INFO  org.mortbay.log - NO JSP Support for /, did not find org.apache.jasper.servlet.JspServlet
388  INFO  org.mortbay.log - Started SelectChannelConnector@0.0.0.0:38225
388  INFO  jetty-service - Listening on port: 38225

Its important to redirect STDOUT and STDERR because any fatal errors like uncaught runtime exceptions or java crash dumps will end up here. Its also convenient when debugging a problem to be able to pkill -HUP proxy-transform-service and get all of the thread stack dumps into the log. Finally, its quite common for an application which brings in many open-source components to have cases where these components write to STDOUT or STDERR under unusual circumstances. An approach that redirects STDOUT/STDERR to /dev/null or an obscure alternative log file will commonly result in this information being lost. Thus I find a certain old school elegance in coalescing all the logging output to STDOUT, using Hashdot to redirect this to a file at the system (not just java) level, and using an external tool like logrotate to provide log rotation.

proxy-transform-service script

#!/opt/bin/jruby
#. hashdot.profile += daemon
#. hashdot.vm.options += -Xmx256m
#. hashdot.io_redirect.file = /var/log/proxy-transformer.log

require 'rubygems'

gem 'logback', '~> 0.9'
gem 'jetty', '~> 6.1'
gem 'commons-httpclient', '~> 3.1'

require 'logback'
require 'jetty'
require 'commons-httpclient'

require 'content_processor'
require 'proxy_transformer'

# Logback
Logback.configure do
  Logback.root.add_appender( Logback::ConsoleAppender.new )
  Logback.root.level = Logback::INFO
end

# HTTP Client
http = CommonsHttpClient::Facade.new
http.max_total_connections = 100
http.connection_timeout = 1500 #ms

# Content Processor
processor = ContentProcessor.new do |p|
  p.centroid_weight = 0.3
  p.filter_threads = 1
end

# HTTP Server
factory = Jetty::ServerFactory.new
factory.max_threads = 50
factory.port = 8080
factory.max_idle_time_ms = 10000

include ProxyTransformer
factory.set_context_servlets( '/',
  { '/proxy'     => ProxyServlet.new( http.client ),
    '/transform' => TransformServlet.new( http.client, processor ) }

server = factory.create

server.start
server.join

# Shutdown, cleanup
http.shutdown