Skip to content

Conversion process phase: retrieve

Tim L edited this page Mar 15, 2014 · 68 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](Conversion process phases)

What's first

What we'll cover here

Background

csv2rdf4lod-automation provides a set of command line tools to facilitate the invocation of the csv2rdf4lod converter (the converter itself is a Java jar). Although one could "just toss" a CSV at the converter, things will become disorganized rather quickly. And since, as described in Design Objective: Capturing and Exposing Provenance, it is important to stay organized, the command line tools assume (i.e., require) a particular directory structure. We think that this extra effort up front will make us a better steward of the data we accumulate and share with the world.

Automation vs. Data

First, there is NO required relation between where you install csv2rdf4lod-automation and where you maintain your data. As long as the scripts in csv2rdf4lod-automation are on your path, you can perform operations on your data. A consequence of this design is that you can maintain multiple data directories.

Starting a Data Directory (source/)

Data directories are rooted with the directory source/. So, lets start a data directory:

bash-3.2$ cd ~/Desktop/
bash-3.2$ mkdir source

The directories in source/ are named after the organizations/people/agents from which you obtained your data. For example, census-gov, edu-rpi-lebot, and wind-vane-23 could identify a source of data. These directories, in turn, hold directories naming the dataset that the source provides. For example, census-2011, exercise-running-statistics, and datafeed could identify a dataset within the scope of its source. The final level of the directory structure organizes the version of a dataset. For example, release-1, week-4, and 2011-Jan-24 could be used to distinguish among many possible versions one may encounter when aggregating data.

Once you start your data directory, you can always feel free to drag that source/ directory anywhere on your disk and you won't break anything -- everything is rooted at that source/ directory. This naturally lends itself to your favorite version control system.

Example: "Please find attached my jogging stats..."

Naming the source

Let's say I gave you a data file. That makes me a source, so you choose an identifier for me and make a place for everything I send you:

bash-3.2$ cd ~/Desktop/source
bash-3.2$ mkdir rpi-edu-lebot

Naming the dataset

Let's say I emailed you a CSV file, saying "here's my jogging stats from last week." That (implicitly) makes the data a give you a dataset. As a data curator, you need to choose an identifier that reflects my dataset and make a place for the current version and any subsequent versions that might happen:

bash-3.2$ cd ~/Desktop/source/
bash-3.2$ mkdir -p rpi-edu-lebot/exercise-jogging-statistics

Naming the version

The final thing to consider before finishing up the directory structure: version. It seems like I might send you weekly updates of my running stats, so you could choose week-of-2011-Jan-16, but you aren't sure so you just stick with the date that I sent it to you (2011-Jan-24).

bash-3.2$ cd ~/Desktop/source/
bash-3.2$ mkdir -p rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24

Putting it all together: source, dataset, and version

You now have a home for the data file I sent you:

bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24

While explaining this to folks, I have found it useful to refer to this type of directory the conversion cockpit, because it is THE place to go when you want to convert data and it is THE place you stay throughout the conversion process.

(If you have an eye for symmetry and consistency, you might notice that source/rpi-edu-lebot and version/2011-Jan-24 are part of the path, but dataset/ is missing in exercise-running-statistics. If this bothers you, please take solace in that it bothers us, too. If it really bothers you, please vote for the issue and we'll see what we can do.)

Setting up the conversion cockpit

Once we've [named](Conversion process phase: name) a dataset version and have its conversion cockpit directory, we can hop in and set up shop:

bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
bash-3-2$ mkdir source
bash-3-2$ mkdir manual

conversion cockpit: what goes in source/?

The source/ directory should contain all materials that you obtained from, well, your source (rpi-edu-lebot, whitehouse-gov, etc). Ideally, you will use pcurl.sh from within source/ to retrieve the data so that we can capture the provenance between the file now on your disk and the source organization's URL. This provides a critical link from "some data you have sitting around" to the (more) authoritative source that provided it to you.

The materials in source/ should stay as is. You should also NEVER delete anything in the source/ directory. Preserving these materials is your ticket to accountability. If anybody doubts your data, you can point to this directory and say "but this is what I got from them."

conversion cockpit: what goes in manual/?

Unfortunately (or fortunately?), computers need humans to do stuff for them. Inevitably, we need to get our hands dirty and do something to "tidy up" the data we get from our source organization. This could be something as simple as changing a field delimiter (pipes or tabs to commas), or something a little less mindless. In either case, the results of our labor go into manual/. Because of the un-reproducible nature of this activity, you should NEVER delete anything in the manual/ directory. The files in manual/ are likely to parallel the files in source/, since they will be modified and pristine analogues of on another, respectively.

conversion cockpit: what goes in automatic/?

(We not need explicitly create automatic/ since the conversion automation will do it along the way)

The automation takes care of creating everything in automatic/. It starts with creating a Turtle file for every CSV you have hanging around in manual/ (if you needed to tweak) or source/ (if you didn't need to tweak). Note that the automation doesn't "go looking for" CSVs, it only converts what you specified when creating the [conversion trigger](Conversion process phase: create conversion trigger).

You can ALWAYS delete the automatic/ directory without fear of permanently losing work.

Files typically in the automatic/ directory:

  • automatic/menu.csv.raw.params.ttl

  • automatic/menu.csv.raw.sample.ttl

  • automatic/menu.csv.raw.void.ttl

  • automatic/menu.csv.raw.ttl

  • automatic/menu.csv.e1.sample.ttl

  • automatic/menu.csv.e1.void.ttl

  • automatic/menu.csv.e1.ttl

  • _CSV2RDF4LOD_file_list.txt

conversion cockpit: what goes in publish/?

(We not need explicitly create publish/ since the conversion automation will do along the way)

publish/bin

You can ALWAYS delete the publish/ directory without fear of permanently losing work.

How do I retrieve data?

bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
bash-3-2$ ls
source/
manual/

The ONE time we leave the conversion cockpit is when we retrieve data. Once we do, we get back in as soon as possible.

bash-3-2$ cd source/
bash-3-2$ pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip
bash-3-2$ punzip.sh WhiteHouse-WAVES-Released-0111.zip
bash-3-2$ cd ..
bash-3-2$ ls source/
WhiteHouse-WAVES-Released-0111.zip
WhiteHouse-WAVES-Released-0111.zip.pml.ttl
WhiteHouse-WAVES-Released-0111.csv
WhiteHouse-WAVES-Released-0111.csv.pml.ttl

NOTE: The script pcurl.sh is an essential part of adding accountability to our workflow. See its page for its description.

If the data isn't available from a URL (like our via-email jogging example), just place what you got into the source/ directory.

What's next?

Now that you retrieved the data, you need to make sure it is CSV. If so, you can press on to create the conversion trigger.

Clone this wiki locally