Pipelines are the component that perform name and address hygiene
standardization, data quality management, and entity resolution. The pipelines
also perform relationship resolution and generate alerts, based on the system
configuration.
Pipelines perform three core processes:
- Recognize, which involves optimizing incoming data by performing data
standardization, hygiene, enhancement, and quality checks
- Resolve, which involves resolving entities
- Relate, which involves detecting relationships and generating alerts
Pipelines are hosted by pipeline nodes.
You can configure pipelines for parallel processing, so that one pipeline
command spawns multiple parallel pipeline processing threads, which enables
the system to concurrently process multiple data requests. This feature can
help improve system performance, reduce data processing time, and mitigate
hardware memory constraints.
The parallel pipeline processing feature is configured in two places:
- The global concurrency setting is controlled by the DEFAULT_CONCURRENCY system
parameter on the System Parameters tab in the Configuration
Console. The value here determines the number of parallel processing threads
started from a pipeline start command. The value for the DEFAULT_CONCURRENCY system
parameter is 1, meaning that unless this parameter is edited, only one pipeline
processing thread starts.
- A local concurrency setting (by pipeline node) can be configured in the
pipeline configuration file. If you specify a concurrency parameter and value
in the pipeline configuration file by pipeline node, that value overrides
the global system parameter. When you issue a pipeline start command on that
pipeline node, you start the same number of concurrent pipeline processing
threads as specified in the pipeline configuration file.