< Winsock 2 APIs & Scalability | Scalability Main | I/O Completion Port Client-Server Example >

Scalable Winsock Applications 6 Part 2

What do we have in this chapter 6 part 2?

Scalable Server Architecture
Accepting Connections
Data Transfers
TransmitFile() and TransmitPackets()
Resource Management
Server Strategies
High Throughput
Maximizing Connections
Performance Numbers
Winsock Direct and Sockets Direct Protocol

Scalable Server Architecture

Now that we've introduced the Microsoft-specific extensions, we'll get into the details of implementing a scalable server. Because this chapter focuses on connection-oriented protocols such as TCP/IP, we will first discuss accepting connections followed by managing data transfers. The last section will discuss resource management in more detail.

Accepting Connections

The most common action a server performs is accepting connections. The Microsoft extension AcceptEx() is the only Winsock function capable of accepting a client connection via overlapped I/O. As we mentioned previously, the AcceptEx() function requires that the client socket be created beforehand by calling socket. The socket must be unbound and unconnected, although it is possible to re-use socket handles after calling TransmitFile(), TransmitPackets(), or DisconnectEx().

A responsive server must always have enough AcceptEx() calls outstanding so that incoming client connections may be handled immediately. However, there is no magic number of outstanding AcceptEx() calls that will guarantee that the server will be able to accept the connection immediately. Remember that the TCP/IP stack will automatically accept connections on behalf of the listening application, up to the backlog limit. For Windows NT Server, the maximum backlog value is currently 200. If a server posts 15 AcceptEx() calls and then a burst of 50 clients connect to the server, none of the clients' connections will be rejected. The server's accept calls will satisfy the first 15 connections and the system will accept the remaining connections silently, this dips into the backlog amount so that the server will be able to accept 165 additional connections. Then when the server posts additional AcceptEx() calls, they will succeed immediately because one of the system queued connections will be returned.

The nature of the server plays an important role in determining how many AcceptEx() operations to post. For example, a server that is expected to handle many short-lived connections from a great number of clients may want to post more concurrent AcceptEx() operations than a server that handles fewer connections with longer lifetimes. A good strategy is to allow the number of AcceptEx() calls to vary between a low and high watermark. An application can keep track of the number of outstanding AcceptEx() operations that are pending. Then, when one or more of those completes and the outstanding count decreases below the set watermark, additional AcceptEx() calls may be posted. Of course, if at some point an AcceptEx() completes and the number of outstanding accepts is greater than or equal to the high watermark then no additional calls should be posted in the handling of the current AcceptEx().

On Windows 2000 and later versions, Winsock provides a mechanism for determining if an application is running behind in posting adequate AcceptEx() calls. When creating the listening socket, associate it with an event by using the WSAEventSelect() API call and registering for FD_ACCEPT notification. If there are no pending AcceptEx() operations but there are incoming client connections (accepted by the system according to the backlog value), then the event will be signaled. This can even be used as an indication to post additional AcceptEx() operations.

One significant benefit of using AcceptEx() is the capability to receive data in addition to accepting the client connection. For servers whose clients send an initial request this is ideal. However, as we mentioned in previous chapter, the AcceptEx() operation will not complete until at least one byte of data has been received. To prevent malicious attacks or stale connections, a server should cycle through all client socket handles in outstanding AcceptEx() operations and call getsockopt with SO_CONNECT_TIME, which will return regardless of whether the socket is actually connected. If it is connected, the return value is greater than zero. A value of -1 indicates it is not connected. If the WSAEventSelect() suggestion is implemented, then when the event is signaled it is a good time to check whether the client socket handles in outstanding accept calls are connected. Once an AcceptEx() call accepts an incoming connection, it will then wait to receive data, and at this point there is one less outstanding accept call. Once there are no remaining accepts, the event will be signaled on the next incoming client connection. As a word of warning, applications should not under any circumstances close a client socket handle used in an AcceptEx() call that has not been accepted because it can lead to memory leaks. For performance reasons, the kernel-mode structures associated with an AcceptEx() call will not be cleaned up when the unconnected client handle is closed until a new client connection is established or until the listening socket is closed.

Although it may seem logical and simpler to post AcceptEx() requests in one of the worker threads handling notification from the completion port, you should avoid this because socket creation process is expensive. In addition, any complex computations should be avoided within the worker threads so the server may process the completion notifications as fast as possible. One reason socket creation is expensive is the layered architecture of Winsock 2.0. When the server creates a socket, it may be routed through multiple providers, each performing their own tasks, before the socket is created and returned to the application. Instead, a server should create client sockets and post AcceptEx() operations from a separate thread. When an overlapped AcceptEx() completes in the worker thread, an event can be used to signal the accept issuing thread.

Data Transfers

Once clients are connected, the server will need to transfer data. This process is fairly straightforward, and once again, all data sent or received should be performed with overlapped I/O. By default, each socket has an associated send and receive buffer that is used to buffer outgoing and incoming data, respectively. In most cases these buffers should be left alone, but it is possible to change them or set them to zero by calling setsockopt with the SO_SNDBUF or SO_RCVBUF options.

Let's look at how the system handles a typical send call when the send buffer size is non-zero. When an application makes a send call, if there is sufficient buffer space, the data is copied into the socket's send buffers, the call completes immediately with success, and the completion is posted. On the other hand, if the socket's send buffer is full, then the application's send buffer is locked and the send call fails with WSA_IO_PENDING. After the data in the send buffer is processed (for example, handed down to TCP for processing), then Winsock will process the locked buffer directly. That is, the data is handed directly to TCP from the application's buffer and the socket's send buffer is completely bypassed.

The opposite is true for receiving data. When an overlapped receive call is performed, if data has already been received on the connection, it will be buffered in the socket's receive buffer. This data will be copied directly into the application's buffer (as much as will fit), the receive call returns success, and a completion is posted. However, if the socket's receive buffer is empty, when the overlapped receive call is made, the application's buffer is locked and the call fails with WSA_IO_PENDING. Once data arrives on the connection, it will be copied directly into the application's buffer, bypassing the socket's receive buffer altogether.

Setting the per-socket buffers to zero generally will not increase performance because the extra memory copy can be avoided as long as there are always enough overlapped send and receive operations posted. Disabling the socket's send buffer has less of a performance impact than disabling the receive buffer because the application's send buffer will always be locked until it can be passed down to TCP for processing. However, if the receive buffer is set to zero and there are no outstanding overlapped receive calls, any incoming data can be buffered only at the TCP level. The TCP driver will buffer only up to the receive window size, which is 17 KB, TCP will increase these buffers as needed to this limit; normally the buffers are much smaller. These TCP buffers (one per connection) are allocated out of non-paged pool, which means if the server has 1000 connections and no receives posted at all, 17 MB of the non-paged pool will be consumed! The non-paged pool is a limited resource, and unless the server can guarantee there are always receives posted for a connection, the per-socket receive buffer should be left intact.

Only in a few specific cases will leaving the receive buffer intact lead to decreased performance. Consider the situation in which a server handles many thousands of connections and cannot have a receive posted on each connection (this can become very expensive, as you'll see in the next section). In addition, the clients send data sporadically. Incoming data will be buffered in the per-socket receive buffer and when the server does issue an overlapped receive, it is performing unnecessary work. The overlapped operation issues an I/O request packet (IRP) that completes, immediately after which notification is sent to the completion port. In this case, the server cannot keep enough receives posted, so it is better off performing simple non-blocking receive calls.

TransmitFile() and TransmitPackets()

For sending data, servers should consider using the TransmitFile() and TransmitPackets() API functions where applicable. The benefit of these functions is that a great deal of data can be queued for sending on a connection while incurring just a single user-to-kernel mode transition. For example, if the server is sending file data to a client, it simply needs to open a handle to that file and issue a single TransmitFile() instead of calling ReadFile() followed by a WSASend(), which would invoke many user-to-kernel mode transitions. Likewise, if a server needs to send several memory buffers, it also can build an array of TRANSMIT_PACKETS_ELEMENT structures and use the TransmitPackets() API. As we mentioned, these APIs allow you to disconnect and re-use the socket handles in subsequent AcceptEx() calls.

Resource Management

On a machine with sufficient resources, a Winsock server should have no problem handling thousands of concurrent connections. However, as the server handles increasingly more concurrent connections, a resource limitation will eventually be encountered. The two limits most likely to be encountered are the number of locked pages and non-paged pool usage. The locked pages limitation is less serious and more easily avoided than running out of the non-paged pool.

With every overlapped send or receive operation, it is probable that the data buffers submitted will be locked. When memory is locked, it cannot be paged out of physical memory. The operating system imposes a limit on the amount of memory that may be locked. When this limit is reached, overlapped operations will fail with the WSAENOBUFS error. If a server posts many overlapped receives on each connection, this limit will be reached as the number of connections grow. If a server anticipates handling a very high number of concurrent clients, the server can post a single zero byte receive on each connection. Because there is no buffer associated with the receive operation, no memory needs to be locked. With this approach, the per-socket receive buffer should be left intact because once the zero-byte receive operation completes, the server can simply perform a non-blocking receive to retrieve all the data buffered in the socket's receive buffer. There is no more data pending when the non-blocking receive fails with WSAEWOULDBLOCK. This design would be for servers that require the maximum possible concurrent connections while sacrificing the data throughput on each connection.

Of course, the more you are aware of how the clients will be interacting with the server, the better. In the previous example, a non-blocking receive is performed once the zero-byte receive completes to retrieve the buffered data. If the server knows that clients send data in bursts, then once the zero-byte receive completes, it may post one or more overlapped receives in case the client sends a substantial amount of data (greater than the per-socket receive buffer that is 8 KB by default).

Another important consideration is the page size on the architecture the server is running on. When the system locks memory passed into overlapped operations, it does so on page boundaries. On the x86 architecture, pages are locked in multiples of 4 KB. If an operation posts a 1 KB buffer, then the system is actually locking a 4 KB chunk of memory. To avoid this waste, overlapped send and receive buffers should be a multiple of the page size. The Windows API GetSystemInfo() can be called to obtain the page size for the current architecture.

Hitting the non-paged pool limit is a much more serious error and is difficult to recover from. Non-paged pool is the portion of memory that is always resident in physical memory and can never be paged out. Kernel-mode operating system components, such as a driver, typically use the non-paged pool that includes Winsock and the protocol drivers such as tcpip.sys. Each socket created consumes a small portion of non-paged pool that is used to maintain socket state information. When the socket is bound to an address, the TCP/IP stack allocates additional non-paged pool for the local address information. When a socket is then connected, a remote address structure is also allocated by the TCP/IP stack. In all, a connected socket consumes about 2 KB of non-paged pool and a socket returned from accept or AcceptEx() uses about 1.5 KB of non-paged pool (because an accepted socket needs only to store the remote address). In addition, each overlapped operation issued on a socket requires an I/O request packet to be allocated, which uses approximately 500 non-paged pool bytes.

As you can see, the amount of non-paged pool each connection uses is not great; however, as the number of clients connecting increases, the amount of non-paged pool the server uses can be significant. For example, consider a server running Windows 2000 (or greater) with 1 GB physical memory. For this amount of memory there will be 256 MB set aside for the non-paged pool. In general, the amount of non-paged pool allocated is one quarter the amount of physical memory with a 256 MB limit on Windows 2000 and later versions and a limit of 128 MB on Windows NT 4.0. With 256 MB of non-paged pool, it is possible to handle 50,000 or more connections, but care must be taken to limit the number of overlapped operations queued for accepting new connections as well as sending and receiving on existing connections. In this example, the connected sockets alone consume 75 MB on non-paged pool (assuming each socket uses 1.5 KB of non-paged pool as mentioned). Therefore, if the zero-byte overlapped receive strategy is used, then a single IRP is allocated for each connection, which uses another 25 MB of non-paged pool.

If the system does run out of non-paged pool, there are two possibilities. In the best-case scenario, Winsock calls will fail with WSAENOBUFS. The worst-case scenario is the system crashes with a terminal error. This typically occurs when a kernel mode component (such as a third-party driver) doesn't handle a failed memory allocation correctly. As such there is no guaranteed way to recover from exhausting the non-paged pool, and furthermore, there is no reliable way of monitoring the available amount of non-paged pool because any kernel mode component can chew up non-paged pool. The main point of this discussion is that there is no magical or programmatic method of determining how many concurrent connections and overlapped operations are acceptable. Also, it is virtually impossible to determine whether the system has run out of non-paged pool or exceeded the locked page count because both will result in Winsock calls failing with WSAENOBUFS. Testing must be performed on the server. Because of these factors, the developer must test the server's performance with varying numbers of concurrent connections and overlapped operations in order to find a happy medium. If programmatic limits are imposed to prevent the server from exhausting non-paged pool, you will know that any WSAENOBUFS failures are generally the result of exceeding the locked page limit, and that can be handled in a graceful manner programmatically, such as further restricting the number of outstanding operations or closing some of the connections.

Server Strategies

In this section, we'll take a look at several strategies for handling resources depending on the nature of the server. Also, the more control you have over the design of the client and server allows you to design both accordingly to avoid the limitations and bottlenecks discussed previously. Again, there is no foolproof method that will work 100 percent in all situations. Servers can be divided roughly into two categories: high throughput and high connections. A high throughput server is more concerned with pushing data on a small number of connections. Of course, the meaning of the phrase “small number of connections” is relative to the amount of resources available on the server. A high connection server is more concerned with handling a large number of connections and is not attempting to push large data amounts.

High Throughput

An FTP server is an example of a high throughput server. It is concerned with delivering bulk content. In this case, the server is concerned with processing each connection to minimize the amount of time required to transfer the data. To do so, the server must limit the number of concurrent connections because the greater the simultaneous connections, the lower the throughput will be on each connection. An example would be an FTP server that refuses a connection because it is too busy.

The goal for this strategy is I/O. The server should keep enough receives or sends posted to maximize throughput. Because each overlapped I/O requires memory to be locked as well as a small portion of non-paged pool for each IRP associated with the operation, it is important to limit I/O to a small set of connections. It is possible for the server to continually accept connections and have a relatively high number of established connections, but I/O must be limited to a smaller set.

In this case, the server may post a number of sends or receives on a subset of the established clients. For example, the server could handle client connections in a first-in, first-out manner and post a number of overlapped sends and/or receives on the first 100 connections. After those clients are handled, the server can move on the next set of clients in the queue. In this model, the number of outstanding send and receive operations are limited to a smaller set of connections. This prevents the server from blindly posting I/O operations on every connection, which could quickly exhaust the server's resources.

The server should take care to monitor the number of operations outstanding on each connection so it may prevent malicious clients from attacking it. For example, a server designed to receive data from a client, process it, and send some sort of response should keep track of how many sends are outstanding. If the client is simply flooding the server with data but not posting any receives, the server may end up posting dozens of overlapped sends that will never complete. In this case, once the server finds that there are too many outstanding operations, it can close the connection.

Maximizing Connections

Maximizing the number of concurrent client connections is the more difficult of the two strategies. Handling the I/O on each connection becomes difficult. A server cannot simply post one or more sends or receives on each connection because the amount of memory (both in terms of locked pages and non-paged pool) is great. In this scenario, the server is interested in handling many connections at the expense of throughput on each connection. An example of this would be an instant messenger server. The server would handle many thousands of connections but would need to send or receive only a small number of bytes at a time.

For this strategy, the server does not necessarily want to post an overlapped receive on each connection because this would involve locking many pages for each of the receive buffers. Instead, the server can post an overlapped zero-byte receive. Once the receive completes, the server would perform a non-blocking receive until WSAEWOUDLBLOCK is returned. This allows the server to immediately receive all buffered data received on that connection. Because this model is geared toward clients that send data intermittently, it minimizes the number of locked pages but still allows processing of data on each connection.

Performance Numbers

This section covers performance numbers from the different servers provided in previous and this chapters. The various servers tested are those using blocking sockets, non-blocking with select, WSAAsyncSelect(), WSAEventSelect(), overlapped I/O with events, and overlapped I/O with completion ports. Table 6-3 summarizes the results of these tests. For each I/O model, there are a couple of entries. The first entry is where 7000 connections were attempted from three clients. For all of these tests, the server is an echo server. That is, for each connection that is accepted, data is received and sent back to the client. The first entry for each I/O model represents a high-throughput server where the client sends data as fast as possible to the server. Each of the sample servers does not limit the number of concurrent connections. The remaining entries represent the connections when the clients limit the rate in which they send data so as to not overrun the bandwidth available on the network. The second entry for each I/O model represents 12,000 connections from the client, which is rate limiting the data sent. If the server was able to handle the majority of the 12,000 connections, then the third entry is the maximum number of clients the server was able to handle.

As we mentioned, the servers used are those provided from previous chapter except for the I/O completion port server, which is a slightly modified version of the previous chapter completion port server except that it limits the number of outstanding operations. This completion port server limits the number of outstanding send operations to 200 and posts just a single receive on each client connection. The client used in this test is the I/O completion port client from previous chapter. Connections were established in blocks of 1000 clients by specifying the ‘-c 1000' option on the client. The two x86-based clients initiated a maximum of 12,000 connections and the Itanium system was used to establish the remaining clients in blocks of 4000. In the tests that were rate limited, each client block was limited to 200,000 bytes per second (using the ‘-r 200000' switch). So the average send throughput for that entire block of clients was limited to 200,000 bytes per second (not that each client was limited to this amount).

Table 6-3 I/O Method Performance Comparison
I/O Model	Attempted/Connected	Memory Used (KB)	Non-Paged Pool	CPU Usage	Threads	Throughput (Send/ Receive Bytes Per Second)
Blocking	7000/ 1008	25,632	36,121	10–60%	2016	2,198,148/ 2,198,148
	12,000/ 1008	25,408	36,352	5– 40%	2016	404,227/ 402,227
Non-blocking	7000/ 4011	4208	135,123	95–100%*	1	0/0
	12,000/ 5779	5224	156,260	95–100%*	1	0/0
WSAAsync Select	7000/ 1956	3640	38,246	75–85%	3	1,610,204/ 1,637,819
	12,000/ 4077	4884	42,992	90–100%	3	652,902/ 652,902
WSAEvent Select	7000/ 6999	10,502	36,402	65–85%	113	4,921,350/ 5,186,297
	12,000/ 11,080	19,214	39,040	50–60%	192	3,217,493/ 3,217,493
	46,000/ 45,933	37,392	121,624	80–90%	791	3,851,059/ 3,851,059
Overlapped (events)	7000/ 5558	21,844	34,944	65–85%	66	5,024,723/ 4,095,644
	12,000/12,000	60,576	48,060	35–45%	195	1,803,878/ 1,803,878
	49,000/48,997	241,208	155,480	85–95%	792	3,865,152/ 3,834,511
Overlapped (completion port)	7000/ 7000	36,160	31,128	40–50%	2	6,282,473/ 3,893,507
	12,000/12,000	59,256	38,862	40–50%	2	5,027,914/ 5,027,095
	50,000/49,997	242,272	148,192	55–65%	2	4,326,946/ 4,326,496

The server was a Pentium 4 1.7 GHz Xeon with 768 MB memory. Clients were established from three machines: Pentium 2 233MHz with 128 MB memory, Pentium 2 350 MHz with 128 MB memory, and an Itanium 733 MHz with 1 GB memory. The test network was a 100 MB isolated hub. All of the machines tested had Windows XP installed.

The blocking model is the poorest performing of all the models. The blocking server spawns two threads for each client connection: one for sending data and one for receiving it. In both test cases, the server was unable to handle a fraction of the connections because it hit a system resource limit on creating threads. Thus the CreateThread() call was failing with ERROR_NOT_ENOUGH_MEMORY. The remaining client connections failed with WSAECONNREFUSED.

The non-blocking model faired only somewhat better. It was able to accept more connections but ran into a CPU limitation. The non-blocking server puts all the connected sockets into an FD_SET, which is passed into select. When select completes, the server uses the FD_ISSET macro to search to determine if that socket is signaled. This becomes inefficient because the number of connections increases. Just to determine if a socket is signaled, a linear search through the array is required! To partially alleviate this problem, the server can be redesigned so that it iteratively steps through the FD_SETs returned from select. The only issue is that the server then needs to be able to quickly find the SOCKET_INFO structure associated with that socket handle. In this case, the server can provide a more sophisticated cataloging mechanism, such as a hash tree, which allows quicker lookups. Also note that the non-paged pool usage is extremely high. This is because both AFD and TCP are buffering data on the client connections because the server is unable to read the data fast enough (as indicated by the zero-byte throughput) as indicated by the high CPU usage.

The WSAAsyncSelect() model is acceptable for a small number of clients but does not scale well because the overhead of the message loop quickly bogs down its capability to process messages fast enough. In both tests, the server is able to handle only about a third of the connections made. The clients receive many WSAECONNREFUSED errors indicating that the server cannot handle the FD_ACCEPT messages quickly enough so the listen backlog is not exhausted. However, even for those connections accepted, you will notice that the average throughput is rather low (even in the case of the rate limited clients).

Surprisingly, the WSAEventSelect() model performed very well. In all the tests, the server was, for the most part, able to handle all the incoming connections while obtaining very good data throughput. The drawback to this model is the overhead required to manage the thread pool for new connections. Because each thread can wait on only 64 events, when new connections are established new threads have to be created to handle them. Also, in the last test case in which more than 45,000 connections were established, the machine became very sluggish. This was most likely due to the great number of threads created to service the many connections. The overhead for switching between the 791 threads becomes significant. The server reached a point at which it was unable to accept any more connections due to numerous WSAENOBUFS errors. In addition, the client application reached its limitation and was unable to sustain the already established connections (we'll discuss this in detail later).

The overlapped I/O with events model is similar to the WSAEventSelect() in terms of scalability. Both models rely on thread pools for event notification, and both reach a limit at which the thread switching overhead becomes a factor in how well it handles client communication. The performance numbers for this model almost exactly mirror that of WSAEventSelect(). It does surprisingly well until the number of threads increases.

The last entry is for overlapped I/O with completion ports, which is the best performing of all the I/O models. The memory usage (both user and non-paged pool) and accepted clients are similar to both the overlapped I/O with events and WSAEventSelect() model. However, the real difference is in CPU usage. The completion port model used only around 60 percent of the CPU, but the other two models required substantially more horsepower to maintain the same number of connections. Another significant difference is that the completion port model also allowed for slightly better throughput.

While carrying out these tests, it became apparent that there was a limitation introduced due to the nature of the data interaction between client and server. The server is designed to be an echo server such that all data received from the client was sent back. Also, each client continually sends data (even if it's at a lower rate) to the server. This results in data always pending on the server's socket (either in the TCP buffers or in AFD's per-socket buffers, which are all non-paged pool). For the three well-performing models, only a single receive is performed at a time; however, this means that for the majority of the time, there is still data pending. It is possible to modify the server to perform a non-blocking receive once data is indicated on the connection. This would drain the data buffered on the machine. The drawback to this approach in this instance is that the client is constantly sending and it is possible that the non-blocking receive could return a great deal of data, which would lead to starvation of other connections (as the thread or completion thread would not be able to handle other events or completion notices). Typically, calling a non-blocking receive until WSAEWOULDBLOCK works on connections where data is transmitted in intervals and not in a continuous manner.

From these performance numbers it is easily deduced that WSAEventSelect() and overlapped I/O offer the best performance. For the two event based models, setting up a thread pool for handling event notification is cumbersome but still allows for excellent performance for a moderately stressed server. Once the connections increase and the number of threads increases, then scalability becomes an issue as more CPU is consumed for context switching between threads. The completion port model still offers the ultimate scalability because CPU usage is less of a factor as the number of clients increases.

Winsock Direct and Sockets Direct Protocol

Winsock Direct is a high-speed interconnect introduced on Windows 2000 Datacenter Server. It is a protocol that runs over special hardware available from several vendors, such as Giganet, Compaq, and others. What is so special about Winsock Direct is that it completely bypasses the TCP stack and goes directly to the network interface card, which allows for extremely high-speed data communications. The advantage of Winsock Direct is that it is completely transparent to a TCP Winsock application. That is, if a TCP application is run on a machine with a Winsock Direct capable card, it transparently goes over the Winsock Direct route (given that it is the appropriate route) instead of over a regular Ethernet connection.

The Sockets Direct Protocol is the next evolution of the Winsock Direct protocol. It is designed to run over Infiniband-compatible hardware available in newer releases of the Windows operating system and a typical application includes SAN. The on the wire protocol is slightly different than that of Winsock Direct but it is still transparent to the applications.

Because Winsock Direct is designed to be transparent, the same issues encountered with “regular” Winsock applications still apply when running over Winsock Direct. Applications still have to manage the number of outstanding overlapped operations so as to not exceed the locked pages or non-paged pool limits.

< Winsock 2 APIs & Scalability | Scalability Main | I/O Completion Port Client-Server Example >