In order for the system to store and process large attribute data for scoring, metadata must be converted to Universal Message Format (UMF) and stored in the appropriate columns.
Use the ATTR_VALUE and ATTR_LARGE_DATA columns to store large or unstructured attribute data for custom attribute and scoring applications.
Column and UMF tag name | Data type and size | Required | Explanation |
ATTR_VALUE | varchar(255) (default) resizable up to 8k | Yes | Data used as one of the attributes in an ETL
process with the base scoring plugins. In cases where the data is larger than 8k and in binary format, store the data in the ATTR_LARGE_DATA column and create a unique identifier for that data in the ATTR_VALUE column. That ATTR_VALUE identifier is used for comparison and scoring. For example, create an MD5 (Message-Digest algorithm 5) one-way hash that can be compared and displayed in the visualizer and reports. Max column size is database dependant. For any binary data bigger than 255/3 to be stored in ATTR_VALUE, the column must be resized. For performance reasons you should consider re-tuning the database cache because it is likely that far fewer rows will fit in the cache. |
ATTR_LARGE_DATA | Character large object (CLOB), use for data larger than 8k. | No | Store as character data. For example, use Base64
encoding of binary data. Use this column to store attribute data that is too large for the ATTR_VALUE column. ATTR_LARGE_DATA is of type CLOB (character large object) column that can handle data of unlimited size. This data is available to entity resolution. The structure of the data must be known to the author of the customized comparison plugin. The visualizer will not display this data because the format is non-standard and will be different for various types of systems. A CLOB will not perform as well as a varchar column because a CLOB cannot be cached and requires a disk read, which is why ATTR_VALUE is preferable. If increasing the size of ATTR_VALUE will cause very little attribute data to be cached, it may be better to just use ATTR_LARGE_DATA for data smaller than 8k to ensure that other non-large attributes like gender and DOB are well cached. This is left to the architect's discretion. Consider consulting with your database administrator. When ATTR_LARGE_DATA is used, ATTR_VALUE must be populated with some value. If there is a way to make a meaningful search key from the data that fits in ATTR_VALUE, this should be created and put into ATTR_VALUE. If there is no way to create a meaningful search key, something else unique to the value must be put in ATTR_VALUE or the pipeline will not function properly and will likely fail with DQM errors. A unique key can be generated automatically by setting up a DQM rule to create a MD5 hash of the data(600 rule), or a custom hash based on configured rules(615 rule). It is important that this value be fairly unique especially if the attribute type is going to be setup for persistent searches as the ATTR_VALUE is used in the determination of generic-values. Note: The shipped 'binaryAttributeScoring'
plugin does not compare ATTR_VALUE at all. It only examines and scores
the ATTR_LARGE_DATA segment.
|
<ATTRIBUTE><ATTR_TYPE>BIOMETRIC-1</ATTR_TYPE><ATTR_VALUE>214b21fc3e040f844a07710b1bb451a0 </ATTR_VALUE><ATTR_LARGE_DATA><![H4sICBRTqkgAA2Zvby50eHQAK0ktLuHlAgDkTqoPBgAAAA==]> </ATTR_LARGE_DATA></ATTRIBUTE>Actual ATTR_LARGE_DATA values are likely to be much larger than this example.