IBM HTTP Server - Diagnosing problems with sidd

During the initial SSL handshake between browser and web server, a SSL session is established, and characteristics such as client authentication and allowable ciphers are determined. This initial handshake is computationally intensive.

For subsequent TCP connections, the browser normally attempts to resume the prior SSL session instead of establishing a new SSL session, in order to avoid the expense of a full handshake. In order to support this resumption of SSL sessions, IBM HTTP Server maintains a cache of SSL sessions which can be resumed.

For the Windows platform, only one IBM HTTP Server child process is used to handle client connections, so an in-process cache maintained by the security library is sufficient. This document about sidd does not apply to the Windows platform.

For platforms other than Windows, multiple IBM HTTP Server child processes are normally used to handle client connections, so the cache of sessions must be accessible to all of those child processes. A session id cache daemon is provided (IHSROOT/bin/sidd), and it is started automatically when SSL support is enabled. It runs as a separate process.

Disabling sidd

If problems are experienced with sidd, there are certain circumstances where it can be safely disabled. Otherwise, most problems can be resolved with a configuration change.

AIX, HP-UX, Linux, Solaris

If a single long-lived child process is used to serve requests, sidd can be disabled and the internal security library cache used instead.

Disable the IBM HTTP Server sidd with the SSLCacheDisable directive and remove any existing SSLCacheEnable directives in httpd.conf.

z/OS

The IBM HTTP Server InfoCenter explains now to use the native z/OS equivalent of sidd.

Diagnosing sidd connect failures

For every SSL handshake, the httpd process handling the connection will communicate with the session id daemon. The communication takes place over a Unix (AF_UNIX) socket.

Certain types of problems can result in a connect failure, and one of the following messages may be seen:

Failure reason Example message Description
ECONNREFUSED [crit] (146)Connection refused: SSL0600S: Unable to connect to session ID cache The session id cache is not running or is temporarily overloaded or an operating system-specific issue has been encountered.
EPERM [crit] (13)Permission denied: SSL0600S: Unable to connect to session ID cache The filesystem permissions in the path to the session id cache socket do not permit the web server user id to access it.
generic [crit] SSL0600S: Unable to connect to session ID cache The customer is using an older level of IBM HTTP Server which does not log the exact failure reason.

IBM HTTP Server 1.3.26.x and 1.3.28.x users can upgrade to cumulative fix PK05084 or later to get the more descriptive message and save time diagnosing the problem.

IBM HTTP Server 2.0.x users can upgrade to cumulative fix PQ94389 or later to get the more descriptive message and save time diagnosing the problem.

If the customer cannot upgrade, work through the diagnosis steps for all of the other variations of this message.

All service levels of IBM HTTP Server 6.0 and later releases will log the specific failure reason.

The wording of "Connection refused" or "Permission denied" can vary from one platform to another.

Communication between httpd processes and sidd is required to reuse SSL sessions (i.e., avoid the expensive handshake on every TCP connection). Thus, if the connect error is occurring very frequently, it will result in a substantial increase in CPU utilization because SSL sessions will be reused infrequently.

General diagnosis steps

  1. Make sure that the Unix socket used by sidd resides on a normal, local filesystem, as some network or other filesystems don't support Unix sockets. The default location for the Unix sockets is

    IHSROOT/logs/siddport.

    If that does not reside on a normal filesystem, use the SSLCachePortFilename directive to place the Unix socket in a directory which resides on a local filesystem.

    Example:

    SSLCachePortFilename /var/run/siddport                         
    
  2. If more than one IBM HTTP Server instance is used on this machine, make sure that each has been configured with a specific Unix socket. This is usually a problem when two instances share the same server root or install location.

    Example:

    httpd-app1.conf
    SSLCachePortFilename /var/run/app1-siddport                         
    
    httpd-app2.conf
    SSLCachePortFilename /var/run/app2-siddport                         
    
  3. Make sure that one sidd process is running for every IBM HTTP Server instance. The parent process of sidd will be the parent httpd process of that web server instance.

    If sidd is not running for one or more instances, use the sslcacheerrorlog directive in the conf file to specify the name of a sidd error log. Restart the web server. Once sidd exits or fails to start up, check the sidd error log.

Specific diagnosis steps for the EPERM failure

This problem is caused by the web server user id (e.g., "www" or "nobody") not having permission to read the Unix socket used by sidd, the session id cache daemon). When this error occurs:

Consider /opt/IBMIHS as the example IBM HTTP Server install directory, and assume that customer did not use the SSLCachePortFilename directive to specify the location of the sidd socket, and www is the web server user id (value of User directive).

When IBM HTTP Server starts up, sidd will create the file /opt/IBMIHS/logs/siddport. When a new client SSL connection is received, mod_ibm_ssl will be running as user "www" and will try to connect to the sidd socket. So user "www" must have read and execute permissions to these directories:

        
/opt                                                                    
/opt/IBMIHS                                                             
/opt/IBMIHS/logs                                                        
And user "www" must have read permission to this "file":
/opt/IBMIHS/logs/siddport
Normally, when IBM HTTP Server is installed the directories will be world readable and executable. If the customer changes those permissions (on /opt, /opt/IBMIHS, or /opt/IBMIHS/logs) then permission errors will be received when new SSL connections are being established and mod_ibm_ssl tries to connect to the sidd socket. The SSLCachePortFilename directive can be used to place the sidd socket somewhere else.

Example:

SSLCachePortFilename /var/run/siddport                         

The actual file needs to be in a directory structure which, on your system, the web server user id can access.

If you have two instances of IBM HTTP Server that share an installation directory, they should each have a different argument to SSLCachePortFilename directive specified.

Specific diagnosis steps for the ECONNREFUSED failure

There are two classes of this error:

For solid failures, follow the general diagnosis steps above.

For intermittent failures, find how many handshakes are impacted by comparing the number of failures to the number of total handshakes.

Set LogLevel info in the web server configuration file, rename error_log so that a new one is created, and restart. After sufficient data has been gathered:

  1. Find the total number of SSL handshakes
  2. Find the number of sidd connect errors
    $ grep "SSL0600" logs/error_log | wc -l
    49
    
  3. Find the percentage of failures
    49 / 5073 is a little less than 1%
    

If the percentage of failure is less than 10%, it should have only a small impact on CPU usage.

If the percentage of failure is higher, check the operating system-specific notes below for known issues.

Operating system-specific notes

Solaris

Solaris 10

Solaris 10 has an apparent problem, seen both on SPARC and x64 platforms, which results in the ECONNREFUSED failure even under relatively light loads. This issue is tracked by Sun under bug id 6460268. Customers encountering the "Connection refused: SSL0600S" message on Solaris 10 should check with Sun on the availability of a fix for this problem.

Solaris 8 and 9

These levels of Solaris have a hard-coded queue length for the number of connections to an AF_UNIX socket. This hard-coded queue length is 32. The ECONNREFUSED failure will occur with 33 or more simultaneous attempts to communicate with sidd. This is tracked by Sun bug id 4352289. A fix is available for Solaris 9.

Linux

On Linux systems tested (2.4 and 2.6 kernels), the ECONNREFUSED error can only occur due to a configuration problem and/or the sidd process exiting. It will not occur intermittently, because the AF_UNIX support in the kernel will block a thread waiting to connect to sidd once the connect queue becomes full.