data | bytefreq

2010/06/19

Generating Profiled Data

If you saw my very simple data profiling rules then you might want to try profiling some of your own data.

For the impatient, here’s a one liner in awk that will profile your comma delimited data. I suggest you load the output into excel. In excel you can put an autofilter on all your columns to play with the profiled data, and to start to zone in on rows with suspect quality.

awk -F, '{p = $0;gsub(/[[:lower:]]/,"a",p);gsub(/[[:upper:]]/,"A",p);gsub(/[[:digit:]]/,"9",p); o = $0 FS p; print o}' yourfile.csv

As you can see, it’s not all that hard to transform data into the simple data profiler specification using a little regex. But what is hard : running a continuous data profiling process, and thus a continuous data quality operation, within the context of existing ETL systems.

Why is this hard? Because the most of the expensive ETL tools people use at work do not support implementing that one line of regex very well, and once it’s executed, typically as a user defined function at the field level rather than at the row level, it’s often slow. Worse, this regex just doubled the number of columns under management in your ETL system. Doubling the number of columns = a double price tag in terms of configuration of your ETL system.

So, this is a good time to outline what exactly we need our continuous data profiling flows to do, written up as a list of problem statements and requirements.

That is the topic of my next post.

[The index for this series of articles is my data quality page.]

Tagged awk, data, data profiling, data quality

2010/06/17

Continuous Data Quality: Problem Statements

To develop an architecture and solution around continuous data quality, a great way to get started is to catalog the “Problem Statements.” These are descriptions of the business issues you are setting out to address.

From these problems, you can develop

more detailed business and system requirements
a solution architecture
the more granular system features you will need to set about building software.

As you read through the list below, you may wonder if these are representative of what people in the real world need. I have put these together based on years of experience, but it’s always good to validate.

Later, as a separate exercise I will map the final solution against the features used by Gartner in assessing the completeness of a Data Quality Software offering, and discuss the overlaps and omissions.

Problem Statements:

Below are high level problem statements describing the issues around a lack of a continuous data quality function that works effectively within the context of an ETL system.

From broad to specific:

The automated data processing system does not produce 100% trusted outputs due to data quality issues. People are concerned with downstream impacts and confidence in the system is affected.
1. ETL systems generally assume their data inputs are of high quality; and that attributes will be stable over time. These assumptions are largely incorrect.
  1. In many cases pre-known outliers exist that invalidate the general ETL rules, and a facility is needed that allows for these to be pre-processed (cleansed or rejected) selectively prior to general processing.
  2. Ideally the data values arriving need monitoring over time so the “drift” in the content of the data can be handled as new outliers are found.
2. The practice of silently rejecting and discarding poor quality data fails to take into proper account the value this data has to the data supplier in guiding and instructing them on how they improve their data and processes.
3. The error codes reported when data quality problems are found are not intuitive or instructive to the casual user or administrator of the system. Often when codes are finally found, researched and then understood, the investigator discovers they are not granular enough to resolve the problem without inspecting the raw data items directly.
4. When investigating data item level errors the quality system doesn’t generate the information needed to quickly create and deploy a good data cleanse rule. The information generated about errors need to better characterise and generalise the problems so they are more manageable.
5. Existing strategies around data cleansing do not generally provide the metrics needed to account for the value added in this step, or to determine if the cleanse rule was cost effective. i.e., was the effort in creating the rule offset by the cost savings produced by circumventing the quality problems? The return on investment (ROI) of the data cleanse effort, versus cost savings, or value add, are often largely unknown, and thus largely undervalued. Clear evidence of these would be very useful to encourage the development of effective cleansing strategies.
File based data failures have resulted in system downtime. These are expensive and embarrassing. While hardware resiliency approaches are largely addressed, failures due to data problems (both content and configuration) are not. Several scenarios are outlined below.
1. The presentation of a duplicated input file (possibly uniquely named) for bulk loading into most database and data warehouse loading programs will have had two effects, both resulting in outages:
  1. Either it is bulk loaded loaded and no errors are immediately reported, despite the error
  2. Or the load processes halts catastrophically as unexpected primary key violations throw errors not found in general testing of this unusual case.
  3. Both the above scenarios are bad for business:
    1. The first case implies downstream reporting systems are delivering wrong counts or wrong processing for the period prior to discovery. This typically results in a system restore from backup being needed, or bespoke delete work being needed.
    2. The second case means an incomplete load has possibly corrupted the database or has halted all data loading until the blocking data is resolved. This typically results in a system restore from backup being needed, or bespoke delete work being needed.
2. The creation of an erroneously null file (a file with correct structure but containing no records) can affect all file types, including xml files. Where the process is to receive this data and derive “a delta”, aka updates, the business impact can be very high. The comparison process will invariably determine that all records held in the current system need deleting.
  1. The impact of this type of error cannot be emphasized more highly. While deletes of all your data is bad, the resolution is often to obtain a resupply of the data for reloading. Unfortunately this reload can generate transactions on the destination system with a volume far higher than normally expected. The impact of this can be catastrophic on transactional downstream systems . I have often seen this “foul weather” scenario knock over ESB and messaging based systems, or worse, grind them to a halt as they struggle with the volumes which create backlogs that can take days to clear.
3. Another problem similar to null files, is file truncation, which is another hard-to-spot source of data loss with difficult to resolve consequences.similar to those effects associated with null files.
  1. File transfers of flat delimited data are a common way to move large volumes of structured data, but file truncations and corruptions are a problems. A source of this problem is that file generation and file transfer processes need to be assessed together from a continuous quality perspective rather than separately, as a failed data generation process are creating truncated files, and then independent monitoring processes are pick up these partial files up, considering them completely written. The file transfer processes then distribute the truncated files

These are challenging problems, and I am sure there are more. If you have one to add please do so using the comments.

[The index for this series of articles is my data quality page.]

Tagged data, data profiling, data quality, problem statements, simplification

2010/06/16

Data State Transition Diagrams

The problem with data flow diagrams is they only convey how data flows, not how well data flows. To fix that, I’d like to present a new type of diagram called a Data State Transition Diagram.

This is a diagram that helps you simplify and optimize data flows. The idea is based on an analysis technique conceived by Chris Wensel of Cascading that he described in an article called “The distance between simple and complex”. It is one of the best articles about simplifying data processing I’ve ever read.

Chris suggests data goes through 4 logical types of transitions as it flows in any system, and these state transitions can each be scored. The core data transitions he proposes are:

Model Transitions – the conversion of data from one meta model to another (ie from xml to csv, or normalised to denormalised)
Value Transitions – the creation of new data via value added business rules
Phase Transitions – the reading or writing of data between dynamic memory and static persistent storage
Location Transitions – the transfer of data between locations.

Through assigning costs to data transitions, he calculates the complexity cost of a data flow. Reducing the cost = reducing the complexity. He expertly observes that overly complex flows have early redundant steps which generate downstream compensation actions. In these cases the simplification is to remove the early redundant action, as well as the compensatory ones correcting for it downstream.

This idea is really big: Chris has produced a framework for scoring data flow complexity as a way to find simplifications. Chris says it perfectly:

Simplicity is about subtracting the obvious, and adding the meaningful.

My own contribution has been to turn this analysis into a drawing. I propose it because knowing how to simplify a data flow is not enough. You need to sell the idea, and to do that you need to communicate the simplification ideas to decision makers effectively.

A data state transition diagram has four swimlanes representing the four types of transition. Each step in the flow is a numbered box describing the transition going on, and these are joined by arrows denoting the sequence of states the data goes through.

This diagramming approach results in a visual representation that shows where your system is spending its energies, and how you can simplify and reduce the costs of the flows while achieving the same value.

Below is an example diagram clearly highlighting a poor data flow design. The original pdf version is available through this link: DataStateTransitionAnalysisExample

Discover how to simplify your data flows using Data State Transition Diagrams

Assessing the usefulness of the diagram:

In my workplace, the people who fund simplification programs want to “see” the opportunities for simplification before committing funds to them.

These are mostly programme managers who ultimately decide how to invest resources to achieve a business goal. As such, they aren’t necessarily all that technical, and they don’t have a lot of time or desire to study hundreds of pages of detailed technical analysis.

But they may have time to glance at a good diagram. And after all, it’s pretty pictures that sell ideas, right?

Tagged analysis, cascading, charting, data, diagrams, graphics, simplification, visualization

bytefreq

Data. Information. Technology. Architecture.

Tag Archives: data

Generating Profiled Data

Continuous Data Quality: Problem Statements

Data State Transition Diagrams

bytefreq

Data. Information. Technology. Architecture.

Tag Archives: data

Generating Profiled Data

Share this:

Continuous Data Quality: Problem Statements

Share this:

Data State Transition Diagrams

Share this: