Characterizing table completeness

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

Check out http://setosa.io/blog/2014/08/03/csv-fingerprints/

What we will cover

Let's get to it!

$ java edu.rpi.tw.data.csv.util.BinaryTable
usage: BinaryTable <file> [--comment-character char] [--header-line headerLineNumber] [--delimiter delimiter]
                          [--column-stop colNumber]
see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Characterizing-table-completeness

Column numbers along top,
Row sparseness pattern indicated with periods (.),
Pattern occurrence frequency along right,
completeness indication along bottom
- | indicates that all cells in this column have values in all rows;
- _ indicates that come cells in this column are missing values).

The following sample output produced when BinaryTable is applied to geonames US zip codes. One of the things this says is "41,940 rows have values for all cells except for cells 8 and 9. Three rows have values for all cells except for cells 6, 7, 8 and 9.".

bash-3.2$ java edu.rpi.tw.data.csv.util.BinaryTable source/US.txt --header-line 0 --delimiter '\t'
123456789012
.....    ...| 3
.......  .. | 41940
..... .  .. | 1
... .    .. | 4
.......  ...| 147
... ...  .. | 10
.....    .. | 1408
......   .. | 84
... . .  .. | 5
|||_|____||_

What is next

Generating enhancement parameters and FAQ for how to use edu.rpi.tw.data.csv.impl.CSVHeaders
Script: cr test conversion.sh uses this to gray out columns that are missing values, which helps design the enhancement parameters for geonames.
Example: White Hosue Visitor Access Records
Generating enhancement parameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characterizing table completeness

What is first

What we will cover

Let's get to it!

What is next

Clone this wiki locally