Friday, November 11, 2011

Optimizing the Heck Out of F#: HTTP Request Parsing

As part of the WebSharper web server effort, I have been writing an HTTP request parser. Tuning the parser for performance for the common simple case (small, correct HTTP request) has improved performance 8-fold, from 30K to 250K sample requests parsed every second on a single core of my Core i3. Let me review what I have learned from this.


Accessing array elements goes through a bounds check. Unmanaged C++ code clearly wins here. C# has unsafe regions, but F# does not. So what can we do in F# to be competitive? The only option left is using bulk operations from the standard library. BCL is not at all helpful here - it is not clear from the documentation which functions drop the check. Also, many operations one would want are simply missing.

For an example where it matters, I was not able to match the performance of System.Array.FindIndex with any F# code I wrote to do the same job.

I imagine this is a killer problem for numerical computing. With unavoidable bounds checking, one really cannot hope to design numerical code in safe managed .NET that would match fortran routines.


Generic code gets a staggering performance hit when certain simple operations such as equality do not get specialized to the simple value type you are using. Polymorphism has a cost. Inline F# functions sometimes help here. But it is unfortunate there is no flag to monomorphise some code. MLton users, I envy you here.

Value Types

Using value types such as structs and enums reduces the GC pressure. Note, however, that they still get boxed sometimes. For example, if a struct implements an interface, code that expects an interface implementation will receive the struct boxed. This code has to be specialized to structs.


If we care about every bit of performance, mutation matters. However, I found myself wasting lots of time trying to wrap my head around a problem thinking about it in terms of mutation. Clearly, the diagnosis is premature optimization. What I found more helpful is writing a purely functional solution and then transforming it to eliminate allocations and introduce mutation.

Note also that the GC is good enough in most cases. One cannot afford to allocate on the heap per every byte, but allocating short-lived objects does not matter much until you need to do it 100K times a second.


Profiling is a life-saver. I used a SlimTune profiler this time. My first discovery was that using System.Collections.Specialized.NameValueCollection for headers is really expensive. It spends a lot of time computing case-insensitive hash values for the header keys. What a bother, especially when the application does not look into the headers. I settled for queuing the headers instead and exposing them as a sequence.

The profiler helps to spend your time effectively - optimizing what really matters.

Specifics of HTTP Request Parsing

The problem is rather simple: HTTP requests keep arriving and need to be parsed and forwarded to the responder thread. In the keep-alive scenario many requests arrive on the same socket. If there is pipelining, they all come at once.

What I wanted to solve here is parsing the requests incrementally, so that if half of a request arrives we say OK and suspend in mid-air until more data is available.

Iteratees are the general solution here. However, iteratees are allocating on the heap, and F#, unlike Haskell, does not do any program transformation magic to simplify them. For this reason it seems that it is not the ideal solution, at least on the byte level.

What I ended up doing instead with incomplete requests is re-parsing. The parsing logic is expressed over a TextReader-like interface. Parser return codes are Done, Error, or Waiting. If the parser says Waiting, I keep the data in the buffer. If it succeeds, the data is discarded. Errors cannot be recovered from.

To some extent micro-parsers can be combined without using the heap. The trick here is to use mutation to return the result on success. Since the return code is an enum, I can join parsers with `&&&`:

parseMethod r req
&&& skipChar r ' '
&&& parseUntil r ' ' &req.uri
&&& parseVersion r req
&&& parseHeaders r req

In case of an early error, parsing does not stop, but there is no reason to care since most requests are well-formed.

To work with TextReader-like interface and avoid allocation, I use a constant-space ring buffer that acts as a limited-size queue for bytes. Most servers limit the size of the request head to 8192, this is what I do as well. It provides its own TextReader that assumes ASCII encoding.

The most rewarding optimization was adding custom methods to the buffer and its reader to make parseUntil and r.ReadLine possible. Instead of going byte-by-byte through several layers of indirection, I switched to System.Array.IndexOf. A ring buffer needs to do it at most twice per operation.

Monday, November 7, 2011

An F# Web Server From Sockets and Up

I have implemented a simple web server in F#. The idea was to try to marry .NET asynchronous socket operations with F# async. Result: F# async seems to be the right tool for the job of webserver implementation: it makes asynchronous programming intuitive without adding too much performance overhead. The server executes 3500 keep-alive or 1000 normal request per second on my Core i3 machine, compared to 2500/500 requests per second using IIS or System.Net.HttpListener.

Asynhronous Socket Operations

Working with sockets in .NET is done with the Socket class. From the MSDN documentation, the recommended approach is to use the asynchronous methods such as AcceptAsync, SendAsync and ReceiveAsync. These methods register callbacks to be executed when data arrives or ships through the socket. As a result of the callback approach, no threads are blocked by slow connections.

Sockets and F# Async

Unfortunately, the default interface is not very intuitive. The example code is atrocious. Since the operations are callback-based, this seems like a good match for
F# async. I went for the first mapping that came to mind:

Implementing this interface is easy - it is just working around the boilerplate: creating a SocketAsyncEventArgs object, registering a callback, calling the method, checking the result for errors. I was able to express all of it in a single helper method:


It seems that the common optimization paths include pooling sockets, pooling SocketAsyncEventArgs, and pooling buffers to prevent memory fragmentation. The latest point is the most interesting. Socket code itself is written in unmanaged C and passing data between garbage-collected .NET code and C code is done by pinning the .NET array used as a buffer. A pinned array is never relocated by the garbage collector, so the C code
has no trouble finding it. A lot of pinned arrays to work around make garbage collector's job harder - memory gets fragmented.

To avoid fragmentation issues, instead of allocating a lot of small arrays I allocate one huge array and then lease sections of it to be used as buffers by the socket operations.

I have not yet tried pooling `Socket` or `SocketAsyncEventArgs` objects in a similar manner.


For benchmarking I have used Apache Bench (ab) tool running on Arch Linux inside a VirtualBox VM. All benchmarks involved dynamically generating and serving a "HELLO, WORLD" document on my Core i3 laptop, with ab -k -c 1000 -n 10000:

ServerKeep-alive r/sRegular r/s
F# WebServer35001000
Haskell warp/wai GHC 735003500
IIS 2500500
node.js (Windows)800400
node.js (Linux)?3000

I do not feel very good about these numbers, in particular because I have seen claims of Haskell WARP doing 90000 r/s on only slightly faster hardware (8-core Core i5). It may be that I am hitting VirtualBox networking overhead or I have not built the Haskell code with proper flags.

But for what they are worth, the numbers seem to indicate that F# async is a good enough foundation for web servers with performance in the IIS league. It does not need to be faster, it just needs to be good enough. The real advantage is that F# async code is tremendously easier to read and write than explicit callback code.

EDIT: Please do take the benchmarks with a grain of salt. They are far from comprehensive or correctly done.