URL | ISO8859-1 | UTF-8 |
---|---|---|
http://host/à | http://host/%E0 | http://host/%C3%A0 |
http://host/càt.html | http://host/c%E0t.html | http://host/c%C3%A0t.html |
If the request is not properly url-escaped by the browser, it will show in your logs as:
URL | ISO8859-1 | UTF-8 |
---|---|---|
http://host/à | http://host/\xe0 | http://host/\xc3\xa0 |
http://host/càt.html | http://host/c\xe0t.html | http://host/c\xc3\xa0t.html |
IHS will generally (see below) not map between different character sets when trying to resolve a request URI to a filename. You must have agreement between the links and the filenames on disk, or URL-encode the links in advance to avoid ambiguity..
The typical problem our users encounter is that depending on browser settings or how a user finds a page, the request for the same resource comes in both kinds of encodings -- and only one works at a time.
The best short answer we have is:
URL-encode your links ahead of time to take the decision out of the hands of the client.
Simply put, if your link contains characters outside of US-ASCII create links in the form of
An example alternative solution using mod_rewrite to map UTF8 sequences to other charsets is provided below
Note: On Windows, files are typically created with filenames in a local codepage but stored internally in unicode.
Example: I've created two files named càt -- each with an accented 'a' in the middle but encoded with a filename composed of differing character sets. The first file is the utf-8 version and the second uses the iso8859-1 single byte encoding.
Command: ls -1 c*t.html | od -t x1 -c
Output for UTF8 file representation of càt:
Without any other precautions in place, an ISO8859-1 encoded request and a UTF-8 encoded request would each find a different file inthe filesystem. If one of the files did not exists, then one of the requests would result in a 404. Read on for information on how these types of requests are manifested.
Both Mozilla and Internet Explorer have non-default options to send requests as UTF-8. When this is selected by the user, characters that don't map to 7-bit US-ASCII characters are convereted to their multi-byte UTF-8 form before being url encoded.
The most straightforward way to circumvent these types of problem is to URL encode the links in your HTML documents so that the client only sees 7-bit characters. The string you chose to pre-encode would be determined by what sequence of bytes exists in your actual filesystem. Another option is creating symbolic links from the real files to the alternatively named versions.
Saving or converting HTML files that have external links with characters outside of 7-but US-ASCII as UTF-8 may change the behavior of your links if the filenames on disk are still encoded in a local codepage.
If your HTML is UTF-8 encoded and contains characters ourside of the 7 bit US-ASCII range, you have the following options: Make sure your filenames on disk are also utf-8 or pre-encode links in a URL-escaped form.
The script linked below is an UNSUPPORTED example to be used as a REFERENCE. We'd advise limiting the scope (i.e. do it in a directory context, restrict with a RewriteCond) what requests are seen by scripts of this nature for a variety of reasons.
This was fixed by APAR PK09023. The fix was included in WAS 5.0.2.13, 5.1.1.7, and 6.0.2.1 service pack plug-ins.
Simplest Solution
http://host/%E0 or http://host/%C3%A0 (depending on what exists in the filesystem)
Platform Considerations
Windows with IHS 2.x/6.x
IHS 2.0.42 and later on Windows will convert all requests into unicode and use the appropriate Windows API to find the matching file in the fileystem, regardless of local character set used when the file was created. The Windows platform is unique in that it always store an unambiguous unicode representaton of the filename. In the above example links of either type will resolve to the same filename.
Windows 1.3
The filename must match exactly (byte-for-byte) the filename as created in the filesystem, which typically means in a local codepage.
Unix: All IHS releases
The filename must match exactly (byte-for-byte) the filename as created in the filesystem. Because there are no guarantees or metadata available that describe the character set of a filesystem, neither IHS nor the OS perform any translation. You can see these low-level bytes (without the risk of environment or terminal issues) interfere by doing the following:
ls -1 *somewildcard*| od -t x1 -c
Output for ISO8859-1 representation of càt:
0000000 63 c3 a0 74 (hex)
c 303 240 t (character/octal)
0000020 63 e0 74
c 340 t
Issues
Problem 1: Browsers may send UTF-8 or ISO8859-1 requests.
When a user clicks or types a URL in the address bar that contains characters outside the 7 bit US-ASCII range, most browsers by default send the request in a single-byte encoding such as ISO8859-1. These bytes are then URL-encoded so that only 7 bit US-ASCII characters are present in the request.
Problem 2: Links in HTML documents may be composed of different encoding then what exists on the filesystem
Map selected byte sequences from UTF-8 to local codepage:
An example usage of mod_rewrite to map UTF-8 requests to a local codepage is provided below. Because UTF-8 has strict rules about what sequences are invalid, this script assumes that any request containing invalid UTF-8 is already in the proper local codepage and doesn't alter it. If the request is a valid UTF-8 string, it performs the replacements it's configured for (it does not blanketly convert to the local character set to minimize false positives).
RewriteMap nlsmap prg:/opt/nlsMap.pl
RewriteCond ...
RewriteRule (.*) %{nlsmap:$1}
Known Problems