Using intrinsic functions to process UTF-8 encoded data

If it is more convenient to keep your data encoded in UTF-8, use the Unicode intrinsic functions to facilitate testing and processing the UTF-8 data.

You can use the following intrinsic functions:
UVALID
To verify that the UTF-8 character data is well-formed
USUPPLEMENTARY
If the data is to be converted to national, and it is important that every character can be represented by a single 16-bit encoding unit, use the USUPPLEMENTARY function to determine whether a valid UTF-8 character string contains a Unicode supplementary code point; that is, a code point with a Unicode scalar value above U+FFFF, requiring a 4-byte representation in UTF-8.
USUBSTR
It provides a convenient alternative to reference modification for referring to substrings of the UTF-8 character string. USUBSTR expects character position and length arguments versus the computed byte locations and counts required by reference modification.
Auxiliary functions can provide additional information about a valid UTF-8 character string:
ULENGTH
To determine the total number of Unicode code points in the string
UPOS
To determine the byte position in the string of the nth Unicode code point
UWIDTH
To determine the width in bytes of the nth Unicode code point in the string
The following code fragment illustrates UTF-8 validity checking, and the use of the auxiliary functions:
checkUTF-8-validity.
       Compute u = function UVALID(UTF-8-testStr)
       If u not = 0
       Display 'checkUTF-8-validity failure:'
       Display '  The UTF-8 representation is not valid,'
           'starting at byte ' u '.'
       Compute v = function ULENGTH(UTF-8-testStr(1:u - 1))
       Compute u = function UPOS(UTF-8-testStr v)
       Compute w = function UWIDTH(UTF-8-testStr v)
       Display '  The ' v 'th and last valid code point starts '
           'at byte ' u ' for ' w ' bytes.'
       End-if.
In the following string, the sequence that starts with x'F5' is not valid UTF-8 because no byte can have a value in the range x'F5' to x'FF':
x'6162D0B0E4BA8CF5646364'
The output from checkUTF-8-validity for this string is as follows:
checkUTF-8-validity failure:
  The UTF-8 representation is not valid, starting at byte 08.
  The 04th and last valid code point starts at byte 05 for 03 bytes.
The following code fragment illustrates checking for the presence of a Unicode supplementary code point, requiring a 4-byte representation in UTF-8:
checkUTF-8-supp.
       Compute u = function USUPPLEMENTARY(UTF-8-testStr)
       If u not = 0
         Display ' checkUTF-8-supp hit:'
         Compute v = function ULENGTH(UTF-8-testStr(1:u - 1))
         Compute w = function UWIDTH(UTF-8-testStr v + 1)
         Display '  The ' v 'th code point of the string'
             ', starting at byte ' u ','
         Display '  is a Unicode supplementary code point, '
             'width ' w ' bytes.'
       End-if.
In the following string, the sequence x'F0908C82' is a supplementary character (as is any valid UTF-8 sequence beginning with a byte in the range x'F0' to x'F4'):
x'6162D0B0E4BA8CF0908C826364'
The output from checkUTF-8-supp for this string is as follows:
checkUTF-8-supp hit:
  The 04th code point of the string, starting at byte 08,
  is a Unicode supplementary code point, width 04 bytes.

related references  
CODEPAGE