Skip to content

Conversion process phase: csv ify

Tim L edited this page Aug 30, 2013 · 41 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](Conversion process phases)

(The name "csv ify" was a bit too specific; this topic should be renamed to "Conversion process phase: prepare source data")

What's first?

The conversion cockpit's manual/ directory should be used to store the results of any manual tweaks of original source/ data. If no tweaks need to be made, the data can be converted directly from source/.

source/ vs. manual/

As discussed in Conversion process phase: retrieve, a conversion cockpit's source/ directory holds an unmodified copy of the data that you received from the source (e.g. rpi-edu-lebot, whitehouse-gov). If you used pcurl.sh, then you also captured the provenance justifying the file on disk by citing the authoritative URL from which it came. You SHOULD NOT modify any of the files you get from your source. If you need to, make a modified copy in manual/.

Do we have to tweak?

If the file that you obtained is already in the tabular format, then you do NOT need to duplicate the file into manual/. You can create the conversion trigger directly from the files in source/. Check out the next phase Conversion process phase: create conversion trigger for more about that.

PLEASE resist the temptation to "clean up" the structure of the table. Do NOT worry if the headers are not on the first line, or if the headers should be "called something else". Don't worry if it is tab-delimited instead of comma-delimited. Do not worry if there are values that don't make sense. Virtually ALL of these situations can be handled using the enhancement parameters that you can encode as RDF and give to the converter. Before you tweak a tablular file to handle these kinds of mundane issues, check out the enhancements -- especially the structural enhancements.

Associating tweaked manual/ files back to their originals in source/

If we manually tweak any of the original data files that we retrieved, we are adding another step in the chain from final results back to their original. Associating the files we create to their originals ensures that we can provide a level of transparency.

A simple example of this is to convert a tab or pipe delimited file into csv:

bash-3.2$ cat source/some.tsv | sed -e 's/^/"/' -e 's/|/","/g' -e 's/$/"/' > manual/some.tsv.csv

NOTE changing tabular cell delimiters can be done with enhancement parameters and should NOT be done manually.

Here, we are making a new file in manual/ that parallels the original file in source/. When naming the new file, we use a convention of appending the new file extension (.csv) to the entire file name of the original file (some.tsv). This helps others trace the lineage using only file names and without any overhead of digging through metadata. To be explicit, so also generate the metadata by running:

bash-3.2$ justify.sh source/some.tsv manual/some.tsv.csv redelimit

This creates manual/some.tsv.csv.pml.ttl and records that manual/some.tsv.csv came from source/some.tsv using a method known as redelimit (and that you did it).

What's next?

Clone this wiki locally