Category Archives: Data Quality

Generating Profiled Data

If you saw my very simple data profiling rules then you might want to try profiling some of your own data.

For the impatient, here’s a one liner in awk that will profile your comma delimited data. I suggest you load the output into excel. In excel you can put an autofilter on all your columns to play with the profiled data, and to start to zone in on rows with suspect quality.

awk -F, '{p = $0;gsub(/[[:lower:]]/,"a",p);gsub(/[[:upper:]]/,"A",p);gsub(/[[:digit:]]/,"9",p); o = $0 FS p; print o}' yourfile.csv

As you can see, it’s not all that hard to transform data into the simple data profiler specification using a little regex.  But what is hard :  running a continuous data profiling process, and thus a continuous data quality operation, within the context of existing ETL systems.

Why is this hard? Because the most of the expensive ETL tools people use at work do not support implementing that one line of regex very well, and once it’s executed, typically as a user defined function at the field level rather than at the row level, it’s often slow. Worse, this regex just doubled the number of columns under management in your ETL system. Doubling the number of columns = a double price tag in terms of configuration of your ETL system.

So, this is a good time to outline what exactly we need our continuous data profiling flows to do, written up as a list of problem statements and requirements.

That is the topic of my next post.

[The index for this series of articles is my data quality page.]

Tagged , , ,

Continuous Data Quality: Problem Statements

To develop an architecture and solution around continuous data quality, a great way to get started is to catalog the “Problem Statements.” These are descriptions of the business issues you are setting out to address.

From these problems, you can develop

  • more detailed business and system requirements
  • a solution architecture
  • the more granular system features you will need to set about building software.

As you read through the list below, you may wonder if these are representative of what people in the real world need. I have put these together based on years of experience, but it’s always good to validate.

Later, as a separate exercise I will map the final solution against the features used by Gartner in assessing the completeness of a Data Quality Software offering, and discuss the overlaps and omissions.

Problem Statements:

Below are high level problem statements describing the issues around a lack of a continuous data quality function that works effectively within the context of an ETL system.

From broad to specific:

  1. The automated data processing system does not produce 100% trusted outputs due to data quality issues. People are concerned with downstream impacts and confidence in the system is affected.
    1. ETL systems generally assume their data inputs are of high quality; and that attributes will be stable over time. These assumptions are largely incorrect.
      1. In many cases pre-known outliers exist that invalidate the general ETL rules, and a facility is needed that allows for these to be pre-processed (cleansed or rejected) selectively prior to general processing.
      2. Ideally the data values arriving need monitoring over time so the “drift” in the content of the data can be handled as new outliers are found.
    2. The practice of silently rejecting and discarding poor quality data fails to take into proper account the value this data has to the data supplier in guiding and instructing them on how they improve their data and processes.
    3. The error codes reported when data quality problems are found are not intuitive or instructive to the casual user or administrator of the system. Often when codes are finally found, researched and then understood, the investigator discovers they are not granular enough to resolve the problem without inspecting the raw data items directly.
    4. When investigating data item level errors the quality system doesn’t generate the information needed to quickly create and deploy a good data cleanse rule. The information generated about errors need to better characterise and generalise the problems so they are more manageable.
    5. Existing strategies around data cleansing do not generally provide the metrics needed to account for the value added in this step, or to determine if the cleanse rule was cost effective. i.e., was the effort in creating the rule offset by the cost savings produced by circumventing the quality problems? The return on investment (ROI) of the data cleanse effort, versus cost savings, or value add, are often largely unknown, and thus largely undervalued. Clear evidence of these would be very useful to encourage the development of effective cleansing strategies.
  2. File based data failures have resulted in system downtime. These are expensive and embarrassing. While hardware resiliency approaches are largely addressed, failures due to data problems (both content and configuration) are not. Several scenarios are outlined below.
    1. The presentation of a duplicated input file (possibly uniquely named) for bulk loading into most database and data warehouse loading programs will have had two effects, both resulting in outages:
      1. Either it is bulk loaded loaded and no errors are immediately reported, despite the error
      2. Or the load processes halts catastrophically as unexpected primary key violations throw errors not found in general testing of this unusual case.
      3. Both the above scenarios are bad for business:
        1. The first case implies downstream reporting systems are delivering wrong counts or wrong processing for the period prior to discovery. This typically results in a system restore from backup being needed, or bespoke delete work being needed.
        2. The second case means an incomplete load has possibly corrupted the database or has halted all data loading until the blocking data is resolved. This typically results in a system restore from backup being needed, or bespoke delete work being needed.
    2. The creation of an erroneously null file (a file with correct structure but containing no records) can affect all file types, including xml files. Where the process is to receive this data and derive “a delta”, aka updates, the business impact can be very high. The comparison process will invariably determine that all records held in the current system need deleting.
      1. The impact of this type of error cannot be emphasized more highly. While deletes of all your data is bad, the resolution is often to obtain a resupply of the data for reloading. Unfortunately this reload can generate transactions on the destination system with a volume far higher than normally expected. The impact of this can be catastrophic on transactional downstream systems . I have often seen this “foul weather” scenario knock over ESB and messaging based systems, or worse, grind them to a halt as they struggle with the volumes which create backlogs that can take days to clear.
    3. Another problem similar to null files, is file truncation, which is another hard-to-spot source of data loss with difficult to resolve consequences.similar to those effects associated with null files.
      1. File transfers of flat delimited data are a common way to move large volumes of structured data, but file truncations and corruptions are a problems. A source of this problem is that file generation and file transfer processes need to be assessed together from a continuous quality perspective rather than separately, as a failed data generation process are creating truncated files, and then independent monitoring processes are pick up these partial files up, considering them completely written. The file transfer processes then distribute the truncated files

These are challenging problems, and I am sure there are more. If you have one to add please do so using the comments.

[The index for this series of articles is my data quality page.]

Tagged , , , ,

How to Profile Data

There is a little secret to data profiling that is ages old but that no one seems to get taught anywhere. While there are lots of overcomplicated alternatives, let me explain a dead simple but extremely effective profiling rule:

  • Swap all capital letters in your data for the letter “A”
  • Swap all lower case letters for “a”
  • Swap all numbers for “9”
  • Leave all other characters in the raw data unchanged

This simple transformation will convert a list of phone numbers as follows:

  • “1 (416) 123-5432” becomes “9 (999) 999-9999”
  • “+44(0)7970 123123” becomes “+99(9)999 999999”
  • “+33 2 31 11 22 33” becomes “+99 9 99 99 99 99”
  • “1-416-123-9876 ext. 123” becomes “9-999-999-9999 aaa. 999”

You can see that in different countries, Canada, the UK and France, the presentation of a phone number is quite different.

If you summarize the millions of phone number records you have, you could expect to find a thousand or so phone number formats in your data, some with very high frequencies, and others that are outliers – meaning they have very low frequency counts.

The last phone number example shows what most likely will be seen as a common outlier in your customer data: the use of an extension number in the telephone data.

What about we look at some results for a less structured field of data? Something messy?

Below I present a data profile frequency report for the Given_Name field in a test system, where we can demonstrate the data quality controls are poor and need correction prior to the system going live. First is the report, then far down below the interpretation:

Top 250 Format Frequency Analysis of Column Given_Name

=======================================================

max string length:  25
raw cardinality:  9057
format cardinality:  146

count format
===== ======

3300 aaaa_999_9
1258 aaaaaa9aaaaa99999
810 aaaa_99_9
757 Aaaaaa
724 Aaaa
720 aaaaa_99_9
651 Aaaaa
568 Aaaaaaa
561 AAAA
447 Aaaaaaaa
382 aaaaaa9aaaaa9999
270 Aaaaaaaaa
138 Aaa
120 AA_AAA
120 AAAA_AAA_AAA
98 aaaaaaa
93 aaaaaaaaaaa99999
92 AA
90 aaaaaaaaaaa9999
89 aaaaaaaa
89 aaaa_9_9
86 AAaaa9
83 AAA
80 aaaaa_9_9
80 Aaaaaaaaaa
77 AAAAaaa9
72 Aaaa Aaaaaaaaaaa 9
72 Aaaa Aaaaaa 9
69 aaaaaa
69 AaaaaAA99
67 AaaaAaaa
60
58 aaaaaaaaa
58 aaaaa
56 aa9
52 AaaaAaa
51 aaaa
45 aaaaaa9aaaaa999
44 aa99
44 AAaaa99
44 AAAA99
42 aaaaaa9aaaaaaa999
40 AAAA-9999
40 AAA999
36 aaaaaa9aaaaaaa999 aaaaaa
36 Aaaa99
36 AAAAAAA-99
27 aaaaaa9aaaaaaaa999
23 Aaaa9
20 Aaaaaa99
19 Aaaaaaaaaaa
19 AAA99
18 aaaa99 aaaaa
18 AaaaaAA9
16 aaaaaaaaaa
16 aaaaaa99 aaaaa
16 AaaaaAaaa99
16 AAAAAAA
14 aaaaa99 aaaaa
14 aaa
14 Aaaaaaaaaaaa
13 aaaaaaaaaaa
12 aaaa_aaaaaaaa_aaaaa
11 Aaaaaaaaaaaaaa
11 AAAAAAAA99
10 aaa99
9 aaaaaaaaaaa999
9 aaaa99
9 AA9
8 aaaaAa
8 Aaaa Aaaaaaaaaaa 99
8 Aaaa Aaaaaa 99
8 AAAAaaa99
8 AAAAAAAA
7 Aaaaaaaaaaaaaaaa
7 Aaaaaaaaaaaaa
7 Aaaaa9
7 AAA9
6 aa.aaaaaaaa99
6 AAAAA
5 AaaaaaAaaa
5 Aaaaaa9
5 AAAAAAA9
4 aaaaaa9aaaaaaa9999 aaaaaa
4 aaaaaa9aaaaaaa9999
4 aaaa_aaaaaaaa_aaa
4 aa-aaaa
4 Aaaaaaaa.Aaaaaaaaaaaaaaaa
4 Aaaa Aaaaaaaa
4 Aaaa Aaa
4 AAAAAAAAAAAAA
4 AAAAAAAAA
4 AAAAAAA99
4 AA 9.9 AA Aaaa9 AA
4 A
3 aaaaaaaaaaaa99
3 aaaaaa9aaaaaaaa9999
3 aaaaaAaaaa99_99
3 aaaaaAaaaa99_9
3 aaaa_aaaa_aaaaa
3 Aaaa-Aaaaaaaa
3 Aaaa-Aaaaa
3 Aa
3 AAAAAA
3 AA99
3 AA 9.9 AA Aaaa99 AA
2 aaaa_aaaaaaaa_aaaaaa
2 AaaaaAaaaA
2 Aaaaa999
2 Aaaaa99
2 AaaaAaaaaa
2 AaaaAaaaa
2 Aaaa-aaaa
2 Aaaa-Aaaa
2 Aaaa-Aaa
2 Aaaa Aaaaa
2 AAAAAAAAAA
2 AAAAAAAAA99
2 AAAAAAAAA9
2 AAAAAAAA9
1 aaaaa-aaaaaaa
1 aaaa_aaaaa_aaa
1 aaaa_aaaa_aaa
1 aaaa-aaaaaaaa
1 aaaa-9-9
1 AaaaaaaaaAaaaAaa
1 Aaaaaaaaa,
1 AaaaaaaAaaaa
1 AaaaaaaA
1 Aaaaaaa-aaaa
1 Aaaaaaa-Aaaa
1 Aaaaaaa A
1 AaaaaaAaaaaaaa
1 Aaaaaa Aaaa
1 Aaaaa.
1 AaaaAa
1 Aaaa-Aaaaaaaaaa
1 Aaaa-Aaaaaaa
1 Aaaa Aaaaaa
1 Aaaa Aaa Aaaaaaaaa
1 Aaa9
1 Aaa-aaaaa
1 Aa-aaa
1 AAAA9
1 AA AAaA Aaaaaaa9
1 AA 9.9.999 AA Aaaa9 AA

Interpretation:

1) please realize you just eyeballed the data quality profile for almost 10,000 items of raw data in a field called given_name.  The report has summarized this into just 146 formats. That’s a great compression ratio for an unstructured field of data.

2) Do you know anyone with a number, a comma, or an underscore in their given name? I don’t. It’s clear we’re seeing a lot of test users in this system with names like tester01 tester02. Are there no data quality controls on the collection of this data? Just looking down this list lowers my expectations for the how this data can be consumed.  Interestingly I’ve heard people had expectations they would use this field to personalize email campaigns. Sounds like it’s time to reset some expectations with the marketing team.

3) Why are there 60 null values in this data? Maybe we should advise someone that given name should be a mandatory field of data?

4) There are 92 names with the format AA, and another 3 with Aa. This is a little tricky. I do have a friend called Ng, and another called AJ. I’m also friends with a PJ and a JP. Here we need to examine the specific cases to build up either a white list of acceptable two-character names, or conversely, a black list of strings we won’t accept like, XX.

5) There’s an example of “Aaaaaaaaa,” where a trailing comma is in the data. Seems easy to do if you are copying and pasting the data from somewhere else… but don’t generally you see “surname, givenname” rather than “givenname, surname”? We should have a look at the raw data in these cases to see whether that really is a given name in the field, and not a surname.

I think you’re getting the idea.

In the next in this series of articles I’ll explain exactly how you can generate these format reports.

[The index for this series of articles is my data quality page.]

Tagged , ,

Continuous Data Quality: Business Requirements

Automated data processing is not all about how many machines you can throw at the problem. Data quality is a threat to any size operation. Integrating a data quality engine into your process is important if you need long term credibility, so below, is a list of high level requirements that support the problem statements in my last post.

Business Requirements:

  1. The automated ETL system should always produce trusted and robust outputs, even under conditions of variable file, record and data item quality.
  2. As quality failures can be detected at many levels of granularity, so too must the quality monitoring: so at file, message, record and data item levels
  3. To establish trust, all system inputs and outputs require a minimum level of data quality assessment to ensure that expectations set around data quality are met. These can be characterized by the principle of “Entry and Exit criteria checking.”
  4. If the input data quality changes through “drift”, error, or by design, the system should be able to handle these on-going situations in an agile manner that’s adaptable .
  5. Data points failing to meet expectations should be tagged as such, and should be made available for inspection and reporting.
  6. Enough information shouldbe collected about quality failures to describe, trace, and resolve the source of the issue
  7. Where data cleanse rules alter inputs or create data, these must be tracked and reported
  8. All data moving through the system must be traceable proving no data was unknowingly lost or introduced, and the provenance of all outputs is clear
  9. All file transfers in and out of the system must be wholly resilient in the face of machine, network and process failure
  10. Duplicate data files created through error must mitigated against in all cases
  11. Error codes should be directly interpretable, rather than requiring lookup in a dictionary.
  12. Rules to resolve data quality issues should be applied at the same level as the tagging of data quality problems
  13. the system should attempt to resolve a data resupply unaided by humans when files arrive that fail entry criteria checks
  14. data quality checks of many kinds are needed on individual and composite raw data fields:
  1. uniqueness checks
  2. format and data type conformance
  3. range constraints
  4. dictionary conformance
  5. character filters
  6. null constraints
  7. check digit verification

[The index for this series of articles is my data quality page.]

Tagged , , , , ,