Windows 8 Registered I/O and I/O Completion Ports
In my last blog post I introduced the Windows 8 Registered I/O Networking Extensions, RIO. As I explained there are three ways to retrieve completions from RIO, polled, event driven and via an I/O Completion Port (IOCP). This makes RIO pretty flexible and allows it to be used in many different designs of servers. The polled scenario is likely aimed at very high performance UDP or High Frequency Trading style situations where you may be happy to burn CPU so as to process inbound datagrams as fast as possible. The event driven style may also help here, allowing you to wait efficiently rather than spin, but it’s the IOCP style that currently interests me most at present as this promises to provide increased performance to more general purpose networking code.
Please bear in mind the caveats from my last blog post, this stuff is new, I’m still finding my way, the docs aren’t in sync with the headers in the SDK and much of this is based on assumption and intuition.
How do RIO and IOCP work together?
RIO’s completions arrive via a completion queue, which is fixed sized data structure that is shared between user space and kernel space (via locked memory?) and which does not require a kernel mode transition to dequeue from (see this BUILD video for more details on RIO’s internals). As I showed last time, you specify how you want to retrieve completions when you create the queue, either providing an event to be signalled, an IOCP to be posted to or nothing if you will simply poll the queue. When using an IOCP you get a notification sent to you when the completion queue is no longer empty after you have indicated that you want to receive completions by calling RIONotify()
.
Simplified code for handling an IOCP driven RIO completion queue might look like this:
if (::GetQueuedCompletionStatus(
hIOCP,
&numberOfBytes,
&completionKey,
&pOverlapped,
INFINITE))
{
const DWORD numResults = 10;
RIORESULT results[numResults];
ULONG numCompletions = rio.RIODequeueCompletion(
queue,
results,
numResults);
while (numCompletions)
{
for (ULONG i = 0; i < numCompletions; ++i)
{
// deal with request completion...
}
numCompletions = rio.RIODequeueCompletion(
queue,
results,
numResults);
}
rio.RIONotify(queue);
}
Of course, in real code you’d likely use the completionKey
to pass details of the queue that’s being operated on, a pointer to a structure that you can plug the queue into once you’ve created it, perhaps.
Anyway, the gist of it is that once the RIO completion queue is not empty you will get an IOCP completion if you have called RIONotify()
but you will not get another IOCP completion until you call RIONotify()
again. At first this seems a little strange. After all, the completion queue could send another IOCP completion when it is, or becomes, non-empty once you have called RIODequeueCompletion()
once. However, having to call RIONotify()
to explicitly request a new notification is probably a good thing. It places you in complete control of which threads are currently accessing the RIO completion queue and given the nature of RIO completion queues this also means that by using the pattern above you can be sure that only one thread is processing completions for a given socket at a given time. Of course if you are using separate RIO completion queues for send and receive operations then you may have one IOCP thread processing send completions and one processing receive completions at the same time. If you use just one RIO completion queue for both send and receive then you can be sure that only this thread is currently processing completions for a given socket. This is different to the normal IOCP model where all of the threads in your I/O pool could be processing completions for the same socket if enough operations have completed.
Why is this a good thing?
This behaviour is good because it means that, hopefully, you need do nothing clever to retain sequencing in completions. Assuming reads and writes complete to the RIO completion queue in the expected order (and I’d be very surprised if they didn’t) then the fact that you can guarantee that you only have one thread processing completions for a given socket means that you’re guaranteed, at that point, at least, that the completions are in the correct order. With IOCP the completions are placed in the queue in order but the fact that one or more threads from your I/O pool could be processing completions for the same socket simultaneously means that you need to actively ensure that the completions are processed in order (if that matters to you, and, more often than not, it does). This is more important in RIO as (according to the RIO BUILD video) your socket’s send and receive buffers are not used and so you need to have ample receives pending to ensure that you don’t stall your TCP connection or lose UDP datagrams. With TCP, multiple pending receives require sequencing to ensure the stream is processed in the correct order.
It’s also good because, quite frankly, there’s no need to be notified again until you’ve drained the RIO completion queue and if completions keep arriving you get to stay on one thread and dequeue them. As I mentioned earlier, dequeuing completions doesn’t involve a kernel mode transition and so we’ve suddenly switched to a polled design where we only need a kernel mode transition when we run out of work to do.
Why is this a bad thing?
Unfortunately this behaviour may make it a little harder to get full utilisation of all your I/O threads. You need to make sure that you have enough RIO completion queues so that each of your I/O threads can do some work and you need to hope that you’ve spread your connections across your queues in such a way that one RIO completion queue (i.e. a subset of connections) doesn’t have more work to do than others. I expect the way that a general purpose framework assigns new connections to RIO completion queues and how many queues and of what size will all be things that I’ll work out over time with some experimentation. If you have more RIO completion queues than you do I/O threads then you also need to be careful that you don’t stick your thread to one completion queue by looping for long periods on a single RIO completion queue; I expect that a configurable limit on how many RIO results to process per IOCP completion would do the trick.
Something else to think about is the fact that now if an I/O thread blocks before it has called RIONotify()
then you’re blocking all of the sockets that are associated with the RIO completion queue that you’re currently processing, not just a single connection.
Wrapping up
It looks pretty easy to scale RIO completion processing using IOCP notification but the details are not going to be the same as you’re used to with normal IOCP completions. Each IOCP completion represents a potentially infinite block of RIO results for a subset of your connections. Expect your RIO architectures to be familiar, yet different to what you’re doing now with IOCP.
Code is here
Full source can be found here on GitHub.
This isn’t production code, error handling is simply “panic and run away”.
This code is licensed with the MIT license.