Tag Archives: bytefreq

What’s a bytefreq?

In large data analysis agencies you may hear people mention “freqs”.

These are frequency reports or data profiling reports that show the frequency counts of specific values held in a field. These are really handy things. You want to create a database table with the right field lengths for example? Just run a format freq on the input data to discover the cardinality of the data, and it’s characteristics like max length along with a histogram of the typical values so you can identify outliers etc.

A more specialised type of freq, is a bytefreq. You won’t hear about these often, and no one has code to generate these much except people who roll their own up (like me).

The idea of the bytefreq is this. Imagine receiving two or three thousand files a month that you load to a datawarehouse for example, but data suppliers never bother to send you the file specification details, and they aren’t sending you xml.  So you need ways to try and discover the file spec. How many fields are in the file? What code page is the file? What field separators are in use?

How do you discover these answers without picking up the phone? You write some code that reads the file byte by byte, and then it calculates a histogram for each byte in the file. You then print this frequency data laid out as a matrix with some common code pages presented alongside the hex values so it becomes human readable.

But, how does a byte-by-byte historgram of the file help you?

Well, the point of the bytefreq is simple. If someone sent you a file with both line feeds and carriage returns, you’ll see a high count for both these bytes, and mostly these  counts should be equal. If the file came with only line feeds however, you’ll see high counts for line feeds, but not for carriage returns. If the file is comma delimited you’ll see a high count for that byte, versus say other candidate characters like pipes or tabs. The code page of the file is also very easily seen using these bytefreq reports, a large spread of counts below 128 and the code page is probably standard ascii. A smattering of diacritics, and it’s probably extended ascii. A file with almost all high ascii byte counts is invariably some form of EBCIDIC data.

Anyway – the bytefreq is a highly specialised datafile analysis technique, and I’ve named my blog after it, because bytefreqs represent innovative analysis of data. And it sounds cool too.

Tagged , , ,