Using intrinsic functions to process UTF-8 encoded data
If it is more convenient to keep your data encoded in UTF-8, use the Unicode intrinsic functions to facilitate testing and processing the UTF-8 data.
You can use the following intrinsic functions:
UVALID
- To verify that the UTF-8 character data is well-formed
USUPPLEMENTARY
- If the data is to be converted to national, and it is important
that every character can be represented by a single 16-bit encoding
unit, use the
USUPPLEMENTARY
function to determine whether a valid UTF-8 character string contains a Unicode supplementary code point; that is, a code point with a Unicode scalar value above U+FFFF, requiring a 4-byte representation in UTF-8. USUBSTR
- It provides a convenient alternative to reference modification
for referring to substrings of the UTF-8 character string.
USUBSTR
expects character position and length arguments versus the computed byte locations and counts required by reference modification.
Auxiliary functions can provide additional
information about a valid UTF-8 character string:
ULENGTH
- To determine the total number of Unicode code points in the string
UPOS
- To determine the byte position in the string of the nth Unicode code point
UWIDTH
- To determine the width in bytes of the nth Unicode code point in the string
The following code fragment illustrates UTF-8 validity
checking, and the use of the auxiliary functions:
checkUTF-8-validity.
Compute u = function UVALID(UTF-8-testStr)
If u not = 0
Display 'checkUTF-8-validity failure:'
Display ' The UTF-8 representation is not valid,'
'starting at byte ' u '.'
Compute v = function ULENGTH(UTF-8-testStr(1:u - 1))
Compute u = function UPOS(UTF-8-testStr v)
Compute w = function UWIDTH(UTF-8-testStr v)
Display ' The ' v 'th and last valid code point starts '
'at byte ' u ' for ' w ' bytes.'
End-if.
In the following string, the sequence
that starts with x'F5' is not valid UTF-8 because no byte can have
a value in the range x'F5' to x'FF':
x'6162D0B0E4BA8CF5646364'
The
output from checkUTF-8-validity
for this string is
as follows:checkUTF-8-validity failure:
The UTF-8 representation is not valid, starting at byte 08.
The 04th and last valid code point starts at byte 05 for 03 bytes.
The
following code fragment illustrates checking for the presence of a
Unicode supplementary code point, requiring a 4-byte representation
in UTF-8:
checkUTF-8-supp.
Compute u = function USUPPLEMENTARY(UTF-8-testStr)
If u not = 0
Display ' checkUTF-8-supp hit:'
Compute v = function ULENGTH(UTF-8-testStr(1:u - 1))
Compute w = function UWIDTH(UTF-8-testStr v + 1)
Display ' The ' v 'th code point of the string'
', starting at byte ' u ','
Display ' is a Unicode supplementary code point, '
'width ' w ' bytes.'
End-if.
In the following string,
the sequence x'F0908C82' is a supplementary character (as is any valid
UTF-8 sequence beginning with a byte in the range x'F0' to x'F4'): x'6162D0B0E4BA8CF0908C826364'
The
output from checkUTF-8-supp
for this string is as
follows: checkUTF-8-supp hit:
The 04th code point of the string, starting at byte 08,
is a Unicode supplementary code point, width 04 bytes.