Collaboration API Documentation

com.filenet.bso.api.util
Class BsoByteArrayWithCharset

java.lang.Object
  |
  +--com.filenet.bso.api.util.BsoByteArrayWithCharset

public class BsoByteArrayWithCharset
extends java.lang.Object

Class to convert a byte array into a string with appropriate "guessing" used to deduce the proper charset if one is not known when an instance is constructed. If the charset is not specified then a guess is made by...

  1. examining the first few bytes of the byte array for a Unicode BOM (Byte Order Mark) which is a standard way Unicode-enabled editors will mark their text files. This will check for UTF-8 BOMs and also big and little endian UTF-16 BOMs. This uses the Unicode code point U+FEFF which encodes to the following sequences... UTF-8: 0xEF 0xBB 0xBF; UTF-16LE: 0xFF 0xFE; UTF-16BE: 0xFE 0xFF.
  2. if no BOM, then try UTF-8 decoding the byte array (due to the self checking nature of UTF-8, this will fail most of the time if some charset other than UTF-8 encoded the bytes... watch out, however, for that remaining small percent where it thinks it was UTF-8 but it really was not!)
  3. if the UTF-8 decoding failed, then try decoding in the default system charset

Of course, if the charset is known when the instance is constructed, then that will be used with no guessing involved. Unfortunately, Apache 2.0.52 has an unhelpful feature of adding a default charset of ISO-8859-1 to all text/plain, text/html, etc., type files that don't have a charset, so that particular charset will be ignored if passed in the constructor.

After the string is decoded using a guessed at charset, some strings will contain some indication of the actual charset that should be used to decode them (hopefully, this will be pretty close to the start of the string!). Examples of this would be the encoding argument of an XML string, the metadata content-type argument in an HTML string or an {$encoding} command in one of our email templates. In this case, the resetCharset method may be used to return a redecoded string.

This makes some attempts to handle synchronization in a multithreaded environment, but note that the getCharset value might not match the decoding charset used for the getString string if some other thread has called resetCharset inbetween.


Constructor Summary
BsoByteArrayWithCharset(byte[] bytes)
          Construct an BsoByteArrayWithCharset and guess at the proper charset.
BsoByteArrayWithCharset(byte[] bytes, java.lang.String charset)
          Construct an BsoByteArrayWithCharset with a known charset.
 
Method Summary
 byte[] getBytes()
          Return the byte array.
 java.lang.String getCharset()
          Return the charset as it is currently assumed.
 java.lang.String getString()
          Return the string version of the byte array.
 java.lang.String resetCharset(java.lang.String charset)
          Attempt to reconvert the string using a new charset.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BsoByteArrayWithCharset

public BsoByteArrayWithCharset(byte[] bytes)
Construct an BsoByteArrayWithCharset and guess at the proper charset.
Parameters:
bytes - a byte array encoding using an unknown charset

BsoByteArrayWithCharset

public BsoByteArrayWithCharset(byte[] bytes,
                               java.lang.String charset)
Construct an BsoByteArrayWithCharset with a known charset. This charset may be null, in which case the normal guessing is made.
Parameters:
bytes - a byte array
charset - the charset to use when decoding the byte array into a string. This may be null if the charset is not known.
Method Detail

getBytes

public byte[] getBytes()
Return the byte array.
Returns:
the byte array

resetCharset

public java.lang.String resetCharset(java.lang.String charset)
Attempt to reconvert the string using a new charset. This will redo the previously guessed at charset decoding with a new one. If there was no guessing (i.e., the charset is already known), then this will throw an exception if charset does not match the known charset. This may be called more than once, but charset must be the same each time.
Parameters:
charset - the new charset decoding to use (cannot be null)
Returns:
the newly decoded string

getCharset

public java.lang.String getCharset()
Return the charset as it is currently assumed. This might return null if the charset was guessed to be the system default charset. If resetCharset has already been called, this will return its charset.

Note that this method is really not safe to use in a multithreaded environment since the charset returned might not match that used in the decoding of the getString's value if some other thread has called resetCharset inbetween.

Returns:
the charset as it is currently assumed

getString

public java.lang.String getString()
Return the string version of the byte array. The decoding used is the currently assumed charset.
Returns:
a decoded string version of the byte array

Collaboration API Documentation

Copyright ?2002 - 2005 FileNet Corporation. All rights reserved.