NLS considerations in requests and filenames

Must Gather

Why the character set of a request matters

As an example, the character à (accented 'a'), falls outside the 7 bit US-ASCII (greater than 127) range and is represented by the byte E0 using the ISO8859-1 character set. The UTF-8 encoded version is represented by the two byte sequence C3A0. Depending on the browser configuration, the request can then be generated with one of the following strings:

URLISO8859-1UTF-8
http://host/à http://host/%E0 http://host/%C3%A0
http://host/càt.html http://host/c%E0t.html http://host/c%C3%A0t.html

If the request is not properly url-escaped by the browser, it will show in your logs as:

URLISO8859-1UTF-8
http://host/à http://host/\xe0 http://host/\xc3\xa0
http://host/càt.html http://host/c\xe0t.html http://host/c\xc3\xa0t.html

IHS will generally (see below) not map between different character sets when trying to resolve a request URI to a filename. You must have agreement between the links and the filenames on disk, or URL-encode the links in advance to avoid ambiguity..

Simplest Solution

The typical problem our users encounter is that depending on browser settings or how a user finds a page, the request for the same resource comes in both kinds of encodings -- and only one works at a time.

The best short answer we have is: URL-encode your links ahead of time to take the decision out of the hands of the client.

Simply put, if your link contains characters outside of US-ASCII create links in the form of
http://host/%E0 or http://host/%C3%A0 (depending on what exists in the filesystem)

An example alternative solution using mod_rewrite to map UTF8 sequences to other charsets is provided below


Platform Considerations

Windows with IHS 2.x/6.x

IHS 2.0.42 and later on Windows will convert all requests into unicode and use the appropriate Windows API to find the matching file in the fileystem, regardless of local character set used when the file was created. The Windows platform is unique in that it always store an unambiguous unicode representaton of the filename. In the above example links of either type will resolve to the same filename.

Windows 1.3

The filename must match exactly (byte-for-byte) the filename as created in the filesystem, which typically means in a local codepage.

Note: On Windows, files are typically created with filenames in a local codepage but stored internally in unicode.

Unix: All IHS releases

The filename must match exactly (byte-for-byte) the filename as created in the filesystem. Because there are no guarantees or metadata available that describe the character set of a filesystem, neither IHS nor the OS perform any translation. You can see these low-level bytes (without the risk of environment or terminal issues) interfere by doing the following:

ls -1 *somewildcard*| od -t x1 -c

Example: I've created two files named càt -- each with an accented 'a' in the middle but encoded with a filename composed of differing character sets. The first file is the utf-8 version and the second uses the iso8859-1 single byte encoding. Command: ls -1 c*t.html | od -t x1 -c

Output for UTF8 file representation of càt:

0000000 63 c3  a0   74        (hex)
        c  303 240  t         (character/octal)
Output for ISO8859-1 representation of càt:
0000020   63 e0   74 
          c  340  t   

Without any other precautions in place, an ISO8859-1 encoded request and a UTF-8 encoded request would each find a different file inthe filesystem. If one of the files did not exists, then one of the requests would result in a 404. Read on for information on how these types of requests are manifested.

Issues

Problem 1: Browsers may send UTF-8 or ISO8859-1 requests.

When a user clicks or types a URL in the address bar that contains characters outside the 7 bit US-ASCII range, most browsers by default send the request in a single-byte encoding such as ISO8859-1. These bytes are then URL-encoded so that only 7 bit US-ASCII characters are present in the request.

Both Mozilla and Internet Explorer have non-default options to send requests as UTF-8. When this is selected by the user, characters that don't map to 7-bit US-ASCII characters are convereted to their multi-byte UTF-8 form before being url encoded.

The most straightforward way to circumvent these types of problem is to URL encode the links in your HTML documents so that the client only sees 7-bit characters. The string you chose to pre-encode would be determined by what sequence of bytes exists in your actual filesystem. Another option is creating symbolic links from the real files to the alternatively named versions.

Problem 2: Links in HTML documents may be composed of different encoding then what exists on the filesystem

Saving or converting HTML files that have external links with characters outside of 7-but US-ASCII as UTF-8 may change the behavior of your links if the filenames on disk are still encoded in a local codepage.

If your HTML is UTF-8 encoded and contains characters ourside of the 7 bit US-ASCII range, you have the following options: Make sure your filenames on disk are also utf-8 or pre-encode links in a URL-escaped form.

Map selected byte sequences from UTF-8 to local codepage:

An example usage of mod_rewrite to map UTF-8 requests to a local codepage is provided below. Because UTF-8 has strict rules about what sequences are invalid, this script assumes that any request containing invalid UTF-8 is already in the proper local codepage and doesn't alter it. If the request is a valid UTF-8 string, it performs the replacements it's configured for (it does not blanketly convert to the local character set to minimize false positives).

The script linked below is an UNSUPPORTED example to be used as a REFERENCE. We'd advise limiting the scope (i.e. do it in a directory context, restrict with a RewriteCond) what requests are seen by scripts of this nature for a variety of reasons.

nlsMap.pl

 
RewriteMap nlsmap prg:/opt/nlsMap.pl
RewriteCond ...
RewriteRule (.*) %{nlsmap:$1} 

Known Problems

  • Plugin incorrectly decodes certain incoming URLs into signed characters with negative values.

    This was fixed by APAR PK09023. The fix was included in WAS 5.0.2.13, 5.1.1.7, and 6.0.2.1 service pack plug-ins.