Testing complex server code

2010-10-13

As I mentioned in the release notes for v6.3 here, I’ve added some code to prevent potential recursion issues if certain performance improvements are enabled.

In Windows Vista and later it’s possible to set the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS flag on a socket using SetFileCompletionNotificationModes(). When this flag is set an overlapped operation can complete “in-line” and the completion operation can be handled on the thread that issued the operation rather than on one of the threads that is servicing the IO completion port that is associated with the socket. This is great as it means that if, for example, data is already available when an overlapped read is issued then we avoid a potentially costly context switch to an I/O thread to handle this data. The downside of this is that the code for handling overlapped completions becomes potentially recursive. If we issue a read and it completes straight away and is handled on the thread that issued it then the code that handles the read completion is likely to issue another read which itself may complete “in-line”, etc. With a suitable rate of supply of inbound data this can lead to stack overflows due to unconstrained recursion.

The code of The Server Framework is built in such a way that you can selectively take advantage of functionality available on later Windows platforms without needing to change your own code. The interface that the framework presents to you remains the same and you don’t need to change your code to perform well on Windows 2000 and better on Windows Server 2008; in this case you simply define a value in your Config.h file and recompile. One of the downsides (for me, at least) of this kind of flexibility is that it can be harder to adjust the framework code to operate in the correct way when various options are enabled. For example, if I were writing custom code for a server implementation and not using the framework then I might move the point where I issue the next read operation so that it doesn’t cause recursion if FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is enabled.

Given that I can’t easily adjust the code to avoid the potential recursion, that some recursion in this area would actually be beneficial when there IS data immediately available and that in most servers the chance of the recursion becoming out of control is slim I decided that the best approach was to break the recursion at a configurable depth. Breaking the recursion is actually pretty easy in this situation as all we need to do is force a trip through the IO completion port for either an I/O completion or for the issuing of an I/O operation. The later is actually easier to do as we already have code in place to deal with the marshalling of I/O operations through the IO completion port in some situations. All we need to do is force an operation to be marshalled when it otherwise wouldn’t be.

The resulting code looks something like this:

template <class Base>
void TStreamSocketConnectionManager<Base>::PostIoOperation(
   JetByteTools::IO::IHandler &handler,
   JetByteTools::IO::IBuffer &buffer,
   const JetByteTools::IO::IBuffer::Operation operation,
   const bool forceDispatchToPool)
{
   buffer.SetOperation(operation);
   buffer.AddRef();

   CIORecursionLimiter recursionLimiter(RECURSION_LIMIT);

   if (forceDispatchToPool|| !recursionLimiter.CanMakeRecursiveCall())
   {
      m_pool.Dispatch(
         handler,
         &buffer,
         buffer.GetUsed());
   }
   else
   {
      handler.HandleOperation(
         &buffer,
         buffer.GetUsed(),
         ERROR_SUCCESS);
   }
}

The recursion limiter uses thread local storage to manage a counter which counts the number of times CanMakeRecursiveCall() is called recursively on this thread and returns false when the limit is exceeded. It’s not ideal but it works nicely, is pretty efficient, totally optional (all of the conditional code selection has been removed from the above code) and doesn’t require major structural changes to the surrounding code. BUT it’s quite a significant change that has quite fundamental effects on a small number of server designs and I foresee that I’ll want to adjust the implementation in the future to make it better so it’s the kind of code that really would benefit from a unit test or two to ensure that it works as designed and continues to work as design in the face of subsequent changes. Luckily I have a comprehensive set of unit tests for most of the code in The Server Framework, but how do you unit test something like this?

The unit tests that I have for much of The Server Framework are probably best described as integration tests; though they’re unit testing very small and isolated areas of code, they require a lot of the rest of the framework to be operating before the target code can be tested. For example, many of the tests that test read completion handling, filtering, etc. require that the test harness sets up a complete client and server system that actually communicates via sockets and which is created, started, tested, stopped and destroyed all under control of the unit test. This is very worthwhile as each test may only test the functionality of a small area of code but requires that much of the framework is operating correctly. Tests occasionally flush out completely unrelated race conditions, etc, simply because code is executed so often in so many contrived situations.

The actual recursive read completion test looks something like this.

template <class T>
void ConnectionManagerTest<T>::TestRecursiveReadCompletions()
{
   TestManager manager(10);

   manager.ReadOnConnect();
   manager.ReadOnReadComplete();
   manager.IOThreadsAreInPool();

   {
      JetByteTools::Socket::Mock::CTestSocketServer server;

      {
         CSmartStreamSocket socket = manager.Connect(
            server.GetPort());

         THROW_ON_FAILURE(true == server.WaitForAccept());

         manager.CheckResult(
            _T("|OnPreOutgoingConnect")
            _T("|OnConnectionEstablished")
            _T("|RequestRead")
            _T("|HandleOperation|"));

         // Force our I/O operations to play by the real rules
         // so that we can see when the recursion is broken
         // by a dispatch through the IOCP.

         manager.SetDispatchViaIOCP(true);

         // first read isn't recursive

         const size_t buffersFull = RECURSION_LIMIT + 1;

         for (size_t i = 0; i < buffersFull; ++i)
         {
            server.Write(1, "0123456789");
         }

         // sleep here to ensure that the sends have all arrived
         // in the tcp stack before we trigger the first read to
         // complete - :(

         ::Sleep(1000);

         THROW_ON_FAILURE(true == manager.ProcessEvent());

         JetByteTools::Win32::_tstring result;

         for (size_t i = 0; i < RECURSION_LIMIT - 1; ++i)
         {
            result += _T("|HandleOperation")
                      _T("|ReadCompleted")
                      _T("|RequestRead")
                      _T("|HandleOperation");
         }

         // The call that breaks the recursion
         // doesn't get handled...

         result += _T("|HandleOperation")
                   _T("|ReadCompleted")
                   _T("|RequestRead|");

         manager.CheckResult(result);

         THROW_ON_FAILURE_EX(true == manager.ProcessEvent());

         manager.CheckResult(
            _T("|HandleOperation")
            _T("|HandleOperation")
            _T("|ReadCompleted")
            _T("|RequestRead")
            _T("|HandleOperation|"));
      }
   }

   manager.SetDispatchViaIOCP(false);

   THROW_ON_FAILURE_EX(true == manager.ProcessEvent());

   manager.CheckResult(
      _T("|HandleOperation")
      _T("|OnClientClose")
      _T("|HandleOperation")
      _T("|OnConnectionClientClose")
      _T("|HandleOperation")
      _T("|OnConnectionClosure: OrderlyClose")
      _T("|OnConnectionClosed")
      _T("|ReleaseSocket")
      _T("|OnSocketReleased|"));
}

The test is tempatised so that it can run on various connection manager base classes, at present this allows for the same tests to be run on the filtering connection manager (used for layering SSL, compression, etc on a TCP stream) and the non filtering connection manager. We first create our ’test manager’ this is a class derived from the code under test (a connection manager) that has a selection of mock objects plugged into it to allow it to run under the control and monitoring of the test harness. For example the IO completion port used logs the operations that occur on it and allows manual dispatches that would normally go through the IOCP to be dispatched directly on the thread making the call, this makes for easier testing for some scenarios but we disable this functionality for this test. We configure the test manager so that it automatically issues a read operation when a connection is established and so that it issues a new read when the previous read completes. This is quite a common server design. The 10 that we pass in to the manager is the size of buffers to use, we’re deliberately selecting a small value so that it’s easy to provide enough data in the TCP stack so that we know we can rely on reads completing immediately.

Next we create a test server. This is the counterpart to our test connection manager and provides a similarly instrumented socket server that we can run during the test. It gives us something to connect to. We obtain the port from the server (it selects one itself and keeps trying so we can be sure the test wont fail due to the port already being in use) and then we have the connection manager connect to the server. We wait for the instrumented server to accept the connection and then check that the expected callbacks have been called on the manager. Note that since the manager’s connect call is synchronous we can be sure that these callbacks have all been executed before the connect call returns so there’s no race condition to worry about here.

Now we write enough data to ensure that our recursion limiter will limit the recursion. Then we have a bit of a hack to ensure that the data actually gets into the connection manager’s TCP stack read buffer before we allow the first read to be issued. Whilst this kind of thing is regrettable, I’m pragmatic in my testing and it’s better to have a test with a nasty hack in it than no test at all.

The connection manager’s read has completed but our instrumented IO completion port hasn’t let it out yet, we manually trigger the processing of the completion and this causes a new read to be issued. Since the correct options are enabled in the code and data is available the read completes in-line and the completion handler is called recursively. This code continues until the recursion limiter breaks the recursion and forces a read request to be marshalled through the IOCP. Since our IOCP is instrumented for the test we control it and at present we’re not allowing any more completions to be processed and so our code returns to the test harness and we have a marshalled read operation pending in the IOCP. We handle that with a call to ProcessEvent() and check all of the callbacks that we expected occurred.

Finally we allow the server to be destroyed, allow the final read completion to occur in the manager and check that we get the correct sequence of callbacks for the connection closure and socket release. The test is complete.

There’s a lot of scaffolding in place to make up this test but it works, it’s repeatable and it’s reliable. The actual test contains a fair amount of conditional code for testing the various combinations of build options that adjust how this test actually operates. For example, the same test will also test that recursion doesn’t happen when it’s built with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS disabled. There are, as of today, 518 tests in the Socket Tools library alone, many of them run up client and server systems in the test and all of these tests are built and executed on all supported compilers and all supported platforms. I haven’t quite got every combination of build options built independently and combined but we cover most of the key options and it’s easy to set up a custom test run should a particular combination of options be required to reproduce a client problem.

Many people would look at what we were trying to test and say that it can’t be tested. Many people will nit pick over whether what I have here is a unit test or an integration test or whatever, it doesn’t matter. The important thing is that every time the build machines build the code this test is part of the unit test suite that determines if the code could be released.

I’m lucky, I got hooked on testing a long time ago and much of The Server Framework has been built in a Test Driven Development style which makes it easy to test. Testing some scenarios is quite complex but it’s well worth the effort as it makes complex changes and refactorings easy.