Automated data processing is not all about how many machines you can throw at the problem. Data quality is a threat to any size operation. Integrating a data quality engine into your process is important if you need long term credibility, so below, is a list of high level requirements that support the problem statements in my last post.
Business Requirements:
- The automated ETL system should always produce trusted and robust outputs, even under conditions of variable file, record and data item quality.
- As quality failures can be detected at many levels of granularity, so too must the quality monitoring: so at file, message, record and data item levels
- To establish trust, all system inputs and outputs require a minimum level of data quality assessment to ensure that expectations set around data quality are met. These can be characterized by the principle of “Entry and Exit criteria checking.”
- If the input data quality changes through “drift”, error, or by design, the system should be able to handle these on-going situations in an agile manner that’s adaptable .
- Data points failing to meet expectations should be tagged as such, and should be made available for inspection and reporting.
- Enough information shouldbe collected about quality failures to describe, trace, and resolve the source of the issue
- Where data cleanse rules alter inputs or create data, these must be tracked and reported
- All data moving through the system must be traceable proving no data was unknowingly lost or introduced, and the provenance of all outputs is clear
- All file transfers in and out of the system must be wholly resilient in the face of machine, network and process failure
- Duplicate data files created through error must mitigated against in all cases
- Error codes should be directly interpretable, rather than requiring lookup in a dictionary.
- Rules to resolve data quality issues should be applied at the same level as the tagging of data quality problems
- the system should attempt to resolve a data resupply unaided by humans when files arrive that fail entry criteria checks
- data quality checks of many kinds are needed on individual and composite raw data fields:
- uniqueness checks
- format and data type conformance
- range constraints
- dictionary conformance
- character filters
- null constraints
- check digit verification
[The index for this series of articles is my data quality page.]