How to Profile Data

There is a little secret to data profiling that is ages old but that no one seems to get taught anywhere. While there are lots of overcomplicated alternatives, let me explain a dead simple but extremely effective profiling rule:

Swap all capital letters in your data for the letter “A”
Swap all lower case letters for “a”
Swap all numbers for “9”
Leave all other characters in the raw data unchanged

This simple transformation will convert a list of phone numbers as follows:

“1 (416) 123-5432” becomes “9 (999) 999-9999”
“+44(0)7970 123123” becomes “+99(9)999 999999”
“+33 2 31 11 22 33” becomes “+99 9 99 99 99 99”
“1-416-123-9876 ext. 123” becomes “9-999-999-9999 aaa. 999”

You can see that in different countries, Canada, the UK and France, the presentation of a phone number is quite different.

If you summarize the millions of phone number records you have, you could expect to find a thousand or so phone number formats in your data, some with very high frequencies, and others that are outliers – meaning they have very low frequency counts.

The last phone number example shows what most likely will be seen as a common outlier in your customer data: the use of an extension number in the telephone data.

What about we look at some results for a less structured field of data? Something messy?

Below I present a data profile frequency report for the Given_Name field in a test system, where we can demonstrate the data quality controls are poor and need correction prior to the system going live. First is the report, then far down below the interpretation:

Top 250 Format Frequency Analysis of Column Given_Name

=======================================================

max string length: 25
raw cardinality: 9057
format cardinality: 146

count format
===== ======

3300 aaaa_999_9
1258 aaaaaa9aaaaa99999
810 aaaa_99_9
757 Aaaaaa
724 Aaaa
720 aaaaa_99_9
651 Aaaaa
568 Aaaaaaa
561 AAAA
447 Aaaaaaaa
382 aaaaaa9aaaaa9999
270 Aaaaaaaaa
138 Aaa
120 AA_AAA
120 AAAA_AAA_AAA
98 aaaaaaa
93 aaaaaaaaaaa99999
92 AA
90 aaaaaaaaaaa9999
89 aaaaaaaa
89 aaaa_9_9
86 AAaaa9
83 AAA
80 aaaaa_9_9
80 Aaaaaaaaaa
77 AAAAaaa9
72 Aaaa Aaaaaaaaaaa 9
72 Aaaa Aaaaaa 9
69 aaaaaa
69 AaaaaAA99
67 AaaaAaaa
60
58 aaaaaaaaa
58 aaaaa
56 aa9
52 AaaaAaa
51 aaaa
45 aaaaaa9aaaaa999
44 aa99
44 AAaaa99
44 AAAA99
42 aaaaaa9aaaaaaa999
40 AAAA-9999
40 AAA999
36 aaaaaa9aaaaaaa999 aaaaaa
36 Aaaa99
36 AAAAAAA-99
27 aaaaaa9aaaaaaaa999
23 Aaaa9
20 Aaaaaa99
19 Aaaaaaaaaaa
19 AAA99
18 aaaa99 aaaaa
18 AaaaaAA9
16 aaaaaaaaaa
16 aaaaaa99 aaaaa
16 AaaaaAaaa99
16 AAAAAAA
14 aaaaa99 aaaaa
14 aaa
14 Aaaaaaaaaaaa
13 aaaaaaaaaaa
12 aaaa_aaaaaaaa_aaaaa
11 Aaaaaaaaaaaaaa
11 AAAAAAAA99
10 aaa99
9 aaaaaaaaaaa999
9 aaaa99
9 AA9
8 aaaaAa
8 Aaaa Aaaaaaaaaaa 99
8 Aaaa Aaaaaa 99
8 AAAAaaa99
8 AAAAAAAA
7 Aaaaaaaaaaaaaaaa
7 Aaaaaaaaaaaaa
7 Aaaaa9
7 AAA9
6 aa.aaaaaaaa99
6 AAAAA
5 AaaaaaAaaa
5 Aaaaaa9
5 AAAAAAA9
4 aaaaaa9aaaaaaa9999 aaaaaa
4 aaaaaa9aaaaaaa9999
4 aaaa_aaaaaaaa_aaa
4 aa-aaaa
4 Aaaaaaaa.Aaaaaaaaaaaaaaaa
4 Aaaa Aaaaaaaa
4 Aaaa Aaa
4 AAAAAAAAAAAAA
4 AAAAAAAAA
4 AAAAAAA99
4 AA 9.9 AA Aaaa9 AA
4 A
3 aaaaaaaaaaaa99
3 aaaaaa9aaaaaaaa9999
3 aaaaaAaaaa99_99
3 aaaaaAaaaa99_9
3 aaaa_aaaa_aaaaa
3 Aaaa-Aaaaaaaa
3 Aaaa-Aaaaa
3 Aa
3 AAAAAA
3 AA99
3 AA 9.9 AA Aaaa99 AA
2 aaaa_aaaaaaaa_aaaaaa
2 AaaaaAaaaA
2 Aaaaa999
2 Aaaaa99
2 AaaaAaaaaa
2 AaaaAaaaa
2 Aaaa-aaaa
2 Aaaa-Aaaa
2 Aaaa-Aaa
2 Aaaa Aaaaa
2 AAAAAAAAAA
2 AAAAAAAAA99
2 AAAAAAAAA9
2 AAAAAAAA9
1 aaaaa-aaaaaaa
1 aaaa_aaaaa_aaa
1 aaaa_aaaa_aaa
1 aaaa-aaaaaaaa
1 aaaa-9-9
1 AaaaaaaaaAaaaAaa
1 Aaaaaaaaa,
1 AaaaaaaAaaaa
1 AaaaaaaA
1 Aaaaaaa-aaaa
1 Aaaaaaa-Aaaa
1 Aaaaaaa A
1 AaaaaaAaaaaaaa
1 Aaaaaa Aaaa
1 Aaaaa.
1 AaaaAa
1 Aaaa-Aaaaaaaaaa
1 Aaaa-Aaaaaaa
1 Aaaa Aaaaaa
1 Aaaa Aaa Aaaaaaaaa
1 Aaa9
1 Aaa-aaaaa
1 Aa-aaa
1 AAAA9
1 AA AAaA Aaaaaaa9
1 AA 9.9.999 AA Aaaa9 AA

Interpretation:

1) please realize you just eyeballed the data quality profile for almost 10,000 items of raw data in a field called given_name. The report has summarized this into just 146 formats. That’s a great compression ratio for an unstructured field of data.

2) Do you know anyone with a number, a comma, or an underscore in their given name? I don’t. It’s clear we’re seeing a lot of test users in this system with names like tester01 tester02. Are there no data quality controls on the collection of this data? Just looking down this list lowers my expectations for the how this data can be consumed. Interestingly I’ve heard people had expectations they would use this field to personalize email campaigns. Sounds like it’s time to reset some expectations with the marketing team.

3) Why are there 60 null values in this data? Maybe we should advise someone that given name should be a mandatory field of data?

4) There are 92 names with the format AA, and another 3 with Aa. This is a little tricky. I do have a friend called Ng, and another called AJ. I’m also friends with a PJ and a JP. Here we need to examine the specific cases to build up either a white list of acceptable two-character names, or conversely, a black list of strings we won’t accept like, XX.

5) There’s an example of “Aaaaaaaaa,” where a trailing comma is in the data. Seems easy to do if you are copying and pasting the data from somewhere else… but don’t generally you see “surname, givenname” rather than “givenname, surname”? We should have a look at the raw data in these cases to see whether that really is a given name in the field, and not a surname.

I think you’re getting the idea.

In the next in this series of articles I’ll explain exactly how you can generate these format reports.

[The index for this series of articles is my data quality page.]

bytefreq

Data. Information. Technology. Architecture.