Tag Archives: awk

Generating Profiled Data

If you saw my very simple data profiling rules then you might want to try profiling some of your own data.

For the impatient, here’s a one liner in awk that will profile your comma delimited data. I suggest you load the output into excel. In excel you can put an autofilter on all your columns to play with the profiled data, and to start to zone in on rows with suspect quality.

awk -F, '{p = $0;gsub(/[[:lower:]]/,"a",p);gsub(/[[:upper:]]/,"A",p);gsub(/[[:digit:]]/,"9",p); o = $0 FS p; print o}' yourfile.csv

As you can see, it’s not all that hard to transform data into the simple data profiler specification using a little regex.  But what is hard :  running a continuous data profiling process, and thus a continuous data quality operation, within the context of existing ETL systems.

Why is this hard? Because the most of the expensive ETL tools people use at work do not support implementing that one line of regex very well, and once it’s executed, typically as a user defined function at the field level rather than at the row level, it’s often slow. Worse, this regex just doubled the number of columns under management in your ETL system. Doubling the number of columns = a double price tag in terms of configuration of your ETL system.

So, this is a good time to outline what exactly we need our continuous data profiling flows to do, written up as a list of problem statements and requirements.

That is the topic of my next post.

[The index for this series of articles is my data quality page.]

Tagged , , ,