< Chap 6: Index | Winsock2 Main | Performance, Scalability & Winsock2 APIs >



Scalable Winsock Applications 6 Part 1



What do we have in this chapter 6 part 1?

  1. APIs and Scalability

  2. AcceptEx()

  3. GetAcceptExSockaddrs()

  4. TransmitFile()

  5. TransmitPackets()

  6. ConnectEx()

  7. DisconnectEx()

  8. WSARecvMsg()


Developing Winsock applications has always been considered to be cryptic and difficult to learn. In reality, there are only a few basic principles, such as socket creation, connecting a socket, accepting connections, and sending and receiving data. The real difficultly lies in developing a scalable Winsock application that can handle a single connection or thousands of connections. This chapter describes how to write scalable Winsock applications for Windows NT. The main focus is the server side of the client-server model; however, some of the topics apply equally to both.

This discussion of writing scalable applications applies to server applications and therefore only applies to Windows NT 4.0 and later versions. We don't include earlier versions of Windows NT because many of the features we will cover require Winsock 2, which is available only on Windows NT 4.0 and later versions. Finally, the focus of our discussion will be on the TCP/IP protocol. However, all of the topics we cover can easily apply to other connection-oriented, stream-based protocols. Some of the topics apply to UDP/IP as well (such as resource management) but connectionless, message-based protocols themselves will not be covered.

This chapter will first discuss the different Winsock API functions designed for use in scalable, high-performance applications such as AcceptEx(), TransmitFile(), and ConnectEx(). Typically, these are Microsoft-specific extensions added with different versions of the operating system because the original Winsock specification leaves out several key asynchronous functions. We'll then cover the necessary steps for implementing a scalable server and discuss how to handle low resource conditions that occur when the number of connections becomes very large.


APIs and Scalability


The only I/O model that provides true scalability on Windows NT platforms is overlapped I/O using completion ports for notification. In previous chapter, we covered the various methods of socket I/O and explained that for a large number of connections, completion ports offer the greatest flexibility and ease of implementation. Mechanisms like WSAAsyncSelect() and select() are provided for easier porting from Windows 3.1 and UNIX, respectively, but are not designed to scale. The event-based models are not scalable because of the operating system limit of simultaneous wait events.

The other major advantages of overlapped I/O are the several Microsoft-specific extensions that can only be called in an overlapped manner. When you use overlapped I/O there are several options for how the notifications can be received. Event-based notification is not scalable because the operating system limit of waiting on 64 objects necessitates using many threads. This is not only inefficient but requires a lot of housekeeping overhead to assign events to available worker threads. Overlapped I/O with callbacks is not an option for several reasons. First, many of the Microsoft-specific extensions do not allow Asynchronous Procedure Calls (APCs) for completion notification. Second, due to the nature of how APCs are handled on Windows, it is possible for an application thread to starve. Once a thread goes into an alertable wait, all pending APCs are handled on a first in first out (FIFO) basis. Now consider the situation in which a server has a connection established and posts an overlapped WSARecv() with a completion function. When there is data to receive, the completion routine fires and posts another overlapped WSARecv(). Depending on timing conditions and how much work is performed within the APC, another completion function is queued (because there is more data to be read). This can cause the server's thread to starve as long as there is pending data on that socket.

Before delving deeper into the architecture of scalable Winsock applications, let's discuss the Microsoft-specific extensions that will aid us in developing scalable servers. These APIs are TransmitFile(), AcceptEx(), ConnectEx(), TransmitPackets(), DisconnectEx(), and WSARecvMsg(). There is a related extension function, GetAcceptExSockaddrs(), which is used in conjunction with AcceptEx().

Before describing each of the extension API functions, it is important to point out that these functions are defined in MSWSOCK.H. Also, only three of the functions (TransmitFile(), AcceptEx(), and GetAcceptExSockaddrs()) are actually exported from MSWSOCK.DLL. However, applications should avoid using those. Instead, applications should dynamically load the extension function, which is required for all the remaining extension APIs. Not all providers have to support these APIs, so it is best to explicitly load these APIs from the provider you are using.




Perhaps the most useful extension API for scalable TCP/IP servers is AcceptEx(). This function allows the server to post an asynchronous call that will accept the next incoming client connection. This function is defined as:



    IN SOCKET sListenSocket,

    IN SOCKET sAcceptSocket,

    IN PVOID lpOutputBuffer,

    IN DWORD dwReceiveDataLength,

    IN DWORD dwLocalAddressLength,

    IN DWORD dwRemoteAddressLength,

    OUT LPDWORD lpdwBytesReceived,

    IN LPOVERLAPPED lpOverlapped



The first parameter is the listening socket, and sAcceptSocket is a valid, unbound socket handle that will be assigned to the next client connection. So the socket handle for the client needs to be created before posting the AcceptEx() call. This is necessary because socket creation is expensive, and if a server is interested in handling client connections as fast as possible, it needs to have a pool of sockets already created on which new connections will be assigned.

The four parameters that follow sAcceptSocket are related. The lpOutputBuffer is required and is filled in with the local and remote addresses for the client connection as well as an optional buffer to receive the first data chunk received from the client. The dwReceiveDataLength indicates how many bytes of the supplied buffer should be used to receive data sent by the client. An application may choose not to receive data and may specify zero. The dwLocalAddressLength specifies the size of the socket address structure corresponding to the address family of the client socket plus 16 bytes. The local address of the client socket connection is placed in the lpOutputBuffer following the receive data if specified. The dwRemoteAddressLength is the same. The remote address of the client connection will be written to the lpOutputBuffer following the receive data (if specified) and the local address. Note that dwReceiveDataLength may be zero but dwLocalAddressLength and dwRemoteAddressLength cannot be.

The lpdwBytesReceived indicates the number of bytes received on the newly-established client connection if the operation succeeds immediately. Finally, lpOverlapped is the WSAOVERLAPPED structure for this overlapped operation. This parameter is required, if you want to perform a blocking accept call, just use accept or WSAAccept().

Before going any farther, let's take a quick look at an example using the AcceptEx() function. The following code creates an IPv4 listening socket and posts a single AcceptEx().


SOCKET                    s, sclient;

HANDLE                    hCompPort;


GUID                           GuidAcceptEx=WSAID_ACCEPTEX;


// The WSAOVERLAPPEDPLUS type will be described in detail in

// another chapter and includes a WSAOVERLAPPED structure as well as

// context information for the overlapped operation.


SOCKADDR_IN                       salocal;

DWORD                                    dwBytes;

char                                            buf[1024];

int                                                buflen=1024;


// Create the completion port

hCompPort = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, (ULONG_PTR)0, 0);

// Create the listening socket


// Associate listening socket to completion port

CreateIoCompletionPort((HANDLE)s, hCompPort, (ULONG_PTR)0, 0);

// Bind the socket to the local port

salocal.sin_family = AF_INET;

salocal.sin_port   = htons(5150);

salocal.sin_addr.s_addr = htonl(INADDR_ANY);

bind(s, (SOCKADDR *)&salocal, sizeof(salocal));

// Set the socket to listening

listen(s, 200);

// Load the AcceptEx function












// Create the client socket for the accepted connection

sclient = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);

// Initialize our "extended" overlapped structure

memset(&ol, 0, sizeof(ol));

ol.operation = OP_ACCEPTEX;

ol.client    = sclient;





             buflen - ((sizeof(SOCKADDR_IN) + 16) * 2),

             sizeof(SOCKADDR_IN) + 16,

             sizeof(SOCKADDR_IN) + 16,





// Call GetQueuedCompletionStatus within the completion function

// After the AcceptEx() operation completes associate the accepted client

// socket with the completion port


This snippet sample is a bit simplified but it shows the necessary steps. It shows how to set up the listening socket, which you've seen before. Then it shows how to load the AcceptEx() function. Applications should always load the extension functions themselves to avoid the performance penalty of the exported extension functions from MSWSOCK.DLL, because for each call they simply end up loading the same function. Next, the application-specific overlapped structure is established, which contains necessary information concerning the asynchronous operation so that when it completes the server can figure out what happened. The actual declaration of this type is not included for the sake of simplicity. Finally, once the AcceptEx() operation completes, the newly-accepted client socket should be associated with the completion port.

Also be aware that because of the high performance nature of AcceptEx(), the listening socket's socket attributes are not automatically inherited by the client socket. To do this, the server must call setsockopt() with SO_UPDATE_ ACCEPT_CONTEXT with the client socket handle.

Another point to be aware of, which we mentioned in previous chapter, is that if a receive buffer is specified to AcceptEx() (for example, dwReceiveDataLength is greater than zero), then the overlapped operation will not complete until at least one byte of data has been received on the connection. So a malicious client could post many connections but never send any data. Previous chapter discusses methods to prevent this by using the SO_CONNECT_TIME socket option. The AcceptEx() function is available on Windows NT 4.0 and later versions.




This is really a companion function to AcceptEx() because it is required to decode the local and remote addresses contained within the buffer passed to the accept call. As you remember, a single buffer will contain any data received on the connection as well as the local and remote addresses for that connection. Any data indicated to be received will always be placed at the start of this buffer followed by the addresses. However, these addresses are in a packed form and the GetAcceptExSockaddrs() function will decode them into the appropriate SOCKADDR structure for the address family. This function is defined as:


VOID PASCAL FAR GetAcceptExSockaddrs (

    IN PVOID lpOutputBuffer,

    IN DWORD dwReceiveDataLength,

    IN DWORD dwLocalAddressLength,

    IN DWORD dwRemoteAddressLength,

    OUT struct sockaddr **LocalSockaddr,

    OUT LPINT LocalSockaddrLength,

    OUT struct sockaddr **RemoteSockaddr,

    OUT LPINT RemoteSockaddrLength



The first four parameters are the same as those in the AcceptEx() call and they must match the values passed to AcceptEx(). That is, if 1024 was specified as dwReceiveDataLength in AcceptEx then the same value must be passed to GetAcceptExSockaddrs. The remaining four parameters are SOCKADDR pointers and their lengths for the local and remote addresses. These parameters are all output parameters. The following code illustrates how you would call GetAcceptExSockaddrs() after the AcceptEx() call in our previous example completes:


// buf and bufflen were defined previously

SOCKADDR *lpLocalSockaddr=NULL, *lpRemoteSockaddr=NULL;

int                                LocalSockaddrLen=0, RemoteSockaddrLen=0;



// Load the GetAcceptExSockaddrs function



                                       buflen - ((sizeof(SOCKADDR_IN) + 16) * 2),

                                       sizeof(SOCKADDR_IN) + 16,

                                       sizeof(SOCKADDR_IN) + 16,






After the function completes, the lpLocalSockaddr and lpRemoteSockaddr point to within the specified buffer where the socket addresses have been unpacked into the correct socket address structure.





TransmitFile() is an extension API that allows an open file to be sent on a socket connection. This frees the application from having to manually open the file and repeatedly perform a read from the file, followed by writing that chunk of data on the socket. Instead, an open file handle is given along with the socket connection and the file data is read and sent on the socket all within kernel mode. This prevents the multiple kernel transitions required when you perform the file read yourself. This API is defined as:


BOOL PASCAL FAR TransmitFile (

    IN SOCKET hSocket,

    IN HANDLE hFile,

    IN DWORD nNumberOfBytesToWrite,

    IN DWORD nNumberOfBytesPerSend,

    IN LPOVERLAPPED lpOverlapped,


    IN DWORD dwReserved



The first parameter is the connection socket. The hFile parameter is a handle to an open file. This parameter can be NULL in which case the lpTransmitBuffers are transmitted. Of course it doesn't make much sense to use TransmitFile() to send memory-only buffers. nNumberOfBytesToWrite is the number of bytes to send from the file. A value of zero indicates send the entire file. The nNumberOfBytesPerSend indicates the size of each block of data sent in each send operation. If zero is specified, the system uses the default send size. The default send size on Windows NT Workstation is 4k and on Windows Server it is 64k. The lpOverlapped structure is optional. Note that if the OVERLAPPED structure is omitted, then the file transfer begins at the current file pointer position. Otherwise, the offset values in the OVERLAPPED structure can indicate where the operation starts. The lpTransmitBuffers is a TRANSMIT_FILE_BUFFERS structure that contains memory buffers to transmit before and after the file is transmitted. This parameter is optional. The last parameter is optional flags, which affect the behavior of the file operation. Table 6-1 contains the possible flags and their meaning. Multiple flags may be specified.


Table 6-1 TransmitFile() Flags





Start a transport-level disconnect after the TransmitFile() operation has been queued.


Prepare the socket handle to be reused. After the TransmitFile() completes, the socket handle may be used as the client socket in AcceptEx(). This flag is valid only if TF_DISCONNECT is also specified.


Indicates the file transfer to use the system's default thread. This is useful for large file sends.


This option also indicates the TransmitFile() operation to use system threads for processing.


Indicates that kernel asynchronous procedure calls should be used instead of worker threads to process the TransmitFile() request. Note that kernel APCs can only be scheduled to run when the application is in a wait state (not necessarily an alertable wait state though).


Indicates that the TransmitFile() request should return immediately even though the data may not have been acknowledged by the remote end. This flag should not be used with TF_DISCONNECT or TF_REUSE_SOCKET.


The TransmitFile() function is useful for file-based I/O such as Web servers. In addition, one beneficial feature of TransmitFile() is the capability of specifying the flags TF_DISCONNECT and TF_REUSE_SOCKET. When both of these flags are specified, the file and/or memory buffers are transmitted and the socket is disconnected once the send operation has completed. Also, the socket handle passed to the API can then be used as the client socket in AcceptEx() or the connecting socket in ConnectEx(). This is extremely beneficial because socket creation is very expensive. A server can use AcceptEx to handle client connections, then use TransmitFile() to send data (specifying these flags), and afterward the socket handle may be used in a subsequent call to AcceptEx().

Note that you can call TransmitFile() with a NULL file handle and NULL lpTransmitBuffers but still specify TF_DISCONNECT and TF_REUSE_SOCKET. This call will not send any data but allows the socket to be reused in AcceptEx(). This is a good workaround for platforms that do not support the DisconnectEx() API discussed later in this chapter. Finally, the TransmitFile() function is available on Windows NT 4.0 and later version. Also, because TransmitFile() is geared toward server applications, it is fully functional only on server versions of Windows. On home and professional versions, there may be only two outstanding TransmitFile() (or TransmitPackets()) calls at any given time. If there are more, then they are queued and not processed until the executing calls are finished.




The TransmitPackets() extension is similar to TransmitFile() because it too is used to send data. The difference between them is that TransmitPackets() can send both files and memory buffers in any number and order. This function is defined as:



    SOCKET hSocket,


    DWORD nElementCount,

    DWORD nSendSize,

    LPOVERLAPPED lpOverlapped,

    DWORD dwFlags



The first parameter is the connected socket on which to send the data. Also, TransmitPackets() works over datagram and stream-oriented protocols (such as TCP/IP and UDP/IP), unlike TransmitFile(). The lpPacketArray is an array of one or more TRANSMIT_PACKETS_ELEMENT structures, which we'll define shortly. nElementCount simply indicates the number of members in the TRANSMIT_PACKETS_ELEMENT array. nSendSize is the same as the nNumberOfBytesPerSend parameter of TransmtFile. lpOverlapped indicates the overlapped structure is optional. dwFlags are the same as those for TransmitFile(). See Table 6-1 for the options. The only exception is that the flag names begin with TP instead of TF, but their meanings are the same. And because TransmitPackets() works over datagrams, the TP_DISCONNECT and TP_REUSE_SOCKET have no meaning for datagrams and specifying them will result in an error. The TRANSMIT_PACKETS_ELEMENT structure is defined as:




            ULONG dwElFlags;



#define TP_ELEMENT_FILE     2

#define TP_ELEMENT_EOP      4

            ULONG cLength;

            union {

                        struct {

                                    LARGE_INTEGER   nFileOffset;

                                    HANDLE                    hFile;


                        PVOID            pBuffer;




The first field indicates the type of buffer contained in this element, either memory or file as given by TP_ELEMENT_MEMORY and TP_ELEMENT_FILE, respectively. The TP_ELEMENT_EOP flag can be bitwise OR'ed in with one of the other two flags. It indicates that this element should not be combined with the following element in a single send operation. This allows the application to shape how the traffic is placed on the wire. The cLength field indicates how many bytes to transfer from the file's memory buffer. If the element contains a file pointer, then a cLength of zero indicates transmit the entire file. The union contains either a pointer to a buffer in memory or a handle to an open file as well as an offset value into that file. It is possible to reference the same file handle in multiple elements of the TRANSMIT_PACKETS_ELEMENT. In this case, the offset can specify where to begin the transfer. Alternately, a value of -1 indicates begin transmitting at the current file pointer position in that file.

A word of caution about using TransmitPackets() with datagram sockets: the system is able to process and queue the send requests extremely fast, and it is possible that too many datagrams will pile up in the protocol driver. At this point, for unreliable protocols it is perfectly acceptable for the system to drop packets before they are even sent on the wire!

The TransmitPackets() extension API is available on Windows XP and later version and is subject to the same type of limitation that TransmitFile() is. On a non-server version of Windows NT, there can be only two outstanding TransmitPackets() (or TransmitFile()) calls at any given time.




The ConnectEx() extension function is a much-needed API available with Windows XP and later versions. This function allows for overlapped connect calls. Previously, the only way to issue multiple connect calls without using one thread for each connect was to use multiple non-blocking connects, which can be cumbersome to manage. This function is defined as:



                                                   IN SOCKET s,

                                                   IN const struct sockaddr FAR *name,

                                                   IN int namelen,

                                                   IN PVOID lpSendBuffer,

                                                   IN DWORD dwSendDataLength,

                                                   OUT LPDWORD lpdwBytesSent,

                                                   IN LPOVERLAPPED lpOverlapped);


The first parameter is a previously bound socket. The name parameter indicates the remote address to connect to and namelen is the length of that socket address structure. The lpSendBuffer is an optional pointer to a block of memory to send after the connection has been established, and dwSendDataLength indicates the number of bytes to send. lpdwBytesSent is updated to indicate the number of bytes sent successfully after the connection was established, if the operation completed immediately. lpOverlapped is the OVERLAPPED structure associated with this operation. This extension function can be called only in an overlapped manner.

Like with AcceptEx() function, because ConnectEx() is designed for performance, any previously set socket options or attributes are not automatically copied to the connected socket. To do so, the application must call SO_UPDATE_CONNECT_CONTEXT on the socket after the connection is established. In addition, as with AcceptEx(), socket handles that have been “disconnected and re-used,” either by TransmitFile(), TransmitPackets(), or DisconnectEx(), may be used as the socket parameter to ConnectEx().

There isn't anything difficult about the ConnectEx() API, and the only requirement is the socket passed into ConnectEx() needs to be previously bound with a call to bind(). There are no special flags, and it simply is an overlapped version of connect with the optional bonus of sending a block of data after the connection is established.




This extension API is simple. It takes a socket handle and performs a transport level disconnect and prepares the socket handle for re-use in a subsequent AcceptEx() call. Both the TransmitFile() and TransmitPackets() APIs allow the socket to be disconnected and re-used after the send operation completes, but this standalone API was introduced for those applications that don't use either of those two APIs before shutting down. This extension API is available with Windows XP or later versions. However, for Windows 2000 or Windows NT 4.0 it is possible to call TransmitFile() with a null file handle and buffers but specify the disconnect and re-use flags, which will achieve the same results. This API is defined as:



    IN SOCKET s,

    IN LPOVERLAPPED lpOverlapped,

    IN DWORD  dwFlags,

    IN DWORD  dwReserved



The first two parameters are self-explanatory. The dwFlags parameter can specify zero or TF_REUSE_SOCKET. If the flags are zero, then this function simply disconnects the connection. To be able to re-use the socket in AcceptEx, the TF_REUSE_SOCKET flag must be specified. The last parameter must be zero; otherwise, WSAEINVAL will be returned. If this function is invoked with an overlapped structure and if there are still pending operations on the socket, the DisconnectEx() call will return FALSE with the error WSA_IO_PENDING. The operation will complete once all pending operations are finished and the transport level disconnect has been issued. Otherwise, if it is called in a blocking manner, the function will not return until pending I/O is completed and the disconnect has been issued. Note that the DisconnectEx() function works only on connection-oriented sockets.




This last extension function is not too interesting in the discussion of high-performance, scalable I/O, but it is new to Windows XP (and later versions) and we chose to be consistent and cover it with the rest of the extension APIs. The WSARecvMsg() is nothing more than a complicated WSARecv() with the exception that it returns information about which interface the packet was received on. This is useful for datagram sockets that are bound to the local wildcard address on a multihomed machine and need to know which interface a packet arrived on. This function is defined as:



    IN SOCKET s,


    OUT LPDWORD lpdwNumberOfBytesRecvd,

    IN LPWSAOVERLAPPED lpOverlapped,




Most of the parameters are self-explanatory. Unlike the other extension functions, which cannot be called with an overlapped completion routine, this one can. The parameter that requires explaining is lpMsg. This is a WSAMSG structure that contains the buffers for receiving data as well as the informational buffers that will contain information about the data received. This structure is defined as:


typedef struct _WSAMSG {

    LPSOCKADDR   name;            /* Remote address */

    INT          namelen;                      /* Remote address length */

    LPWSABUF     lpBuffers;          /* Data buffer array */

    DWORD        dwBufferCount;   /* Number of elements in the array */

    WSABUF       Control;               /* Control buffer */

    DWORD        dwFlags;              /* Flags */



The first field is a buffer that will contain the address of the remote system and namelen specifies how large the address buffer is. lpBuffers and dwBufferCount are the same as in WSARecv(). The Control field specifies a buffer that will contain the optional control data. Lastly, dwFlags is also the same as in WSARecv() and WSARecvFrom(). However, there are additional flags that can be returned that provide information about the packet received. These new flags are described in Table 6-2.


Table 6-2 Flags Returned from WSARecvMsg()





Datagram was received as a link-layer broadcast or with a destination address that was a broadcast address.


Datagram was truncated. There was more data that could be copied to the supplied receive buffer.


Control data was truncated. The buffer supplied in the WSAMSG Control field was too small to receive the control data.


By default, no control information is returned when WSARecvMsg() is called. To enable control information, one or more socket options must be set on the socket, indicating the type of information to be returned. Currently, only one option is supported, which is IP_PKTINFO for IPv4 and IPV6_PKTINFO for IPv6. These options return information about which local interface the packet was received on.

Once the appropriate socket option is set and the WSARecvMsg() completes, the control information requested is returned via the Control buffer specified in the WSAMSG parameter. Each type of information requested is preceded by a WSACMSGHDR structure that indicates the type of information following as well as its size. This header structure is defined as:


typedef struct _WSACMSGHDR {

    SIZE_T      cmsg_len;

    INT         cmsg_level;

    INT         cmsg_type;

    /* followed by UCHAR cmsg_data[ ] */



Within MSWSOCK.H, several useful macros are defined that extract the message headers and their data.




< Chap 6: Index | Winsock2 Main | Performance, Scalability & Winsock2 APIs >