-
Notifications
You must be signed in to change notification settings - Fork 36
Conversion process phase: retrieve
[up](Conversion process phases)
- Naming a dataset should follow SDV organization.
- Installing csv2rdf4lod automation
- Conversion process phase: name
- Script: pcurl.sh can and should be used during the retrieval phase.
- Creating a data directory (
/source/
, done once) - Creating the conversion cockpit (by naming the
source
,dataset
, andversion
) - Setting up the conversion cockpit (by creating
source/
andmanual/
) - Retrieving the data into the conversion cockpit's
source/
(using [pcurl.sh](https://github.com/timrdf/csv2rdf4lod-automation/wiki/Script: pcurl.sh))
csv2rdf4lod-automation
provides a set of command line tools to facilitate the invocation of the csv2rdf4lod converter (the converter itself is a Java jar). Although one could "just toss" a CSV at the converter, things will become disorganized rather quickly. And since, as described in Design Objective: Capturing and Exposing Provenance, it is important to stay organized, the command line tools assume (i.e., require) a particular directory structure. We think that this extra effort up front will make us a better steward of the data we accumulate and share with the world.
Automation vs. Data
First, there is NO required relation between where you install csv2rdf4lod-automation
and where you maintain your data. As long as the scripts in csv2rdf4lod-automation
are on your path, you can perform operations on your data. A consequence of this design is that you can maintain multiple data directories.
Data directories are rooted with the directory source/
. So, lets start a data directory:
bash-3.2$ cd ~/Desktop/
bash-3.2$ mkdir source
The directories in source/
are named after the organizations/people/agents from which you obtained your data. For example, census-gov
, edu-rpi-lebot
, and wind-vane-23
could identify a source of data. These directories, in turn, hold directories naming the dataset
that the source
provides. For example, census-2011
, exercise-running-statistics
, and datafeed
could identify a dataset within the scope of its source
. The final level of the directory structure organizes the version
of a dataset
. For example, release-1
, week-4
, and 2011-Jan-24
could be used to distinguish among many possible versions one may encounter when aggregating data.
Once you start your data directory, you can always feel free to drag that source/
directory anywhere on your disk and you won't break anything -- everything is rooted at that source/
directory. This naturally lends itself to your favorite version control system.
Naming the source
Let's say I gave you a data file. That makes me a source
, so you choose an identifier for me and make a place for everything I send you:
bash-3.2$ cd ~/Desktop/source
bash-3.2$ mkdir rpi-edu-lebot
Naming the dataset
Let's say I emailed you a CSV file, saying "here's my jogging stats from last week." That (implicitly) makes the data a give you a dataset
. As a data curator, you need to choose an identifier that reflects my dataset and make a place for the current version and any subsequent versions that might happen:
bash-3.2$ cd ~/Desktop/source/
bash-3.2$ mkdir -p rpi-edu-lebot/exercise-jogging-statistics
Naming the version
The final thing to consider before finishing up the directory structure: version
. It seems like I might send you weekly updates of my running stats, so you could choose week-of-2011-Jan-16
, but you aren't sure so you just stick with the date that I sent it to you (2011-Jan-24
).
bash-3.2$ cd ~/Desktop/source/
bash-3.2$ mkdir -p rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
Putting it all together: source, dataset, and version
You now have a home for the data file I sent you:
bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
While explaining this to folks, I have found it useful to refer to this type of directory the conversion cockpit, because it is THE place to go when you want to convert data and it is THE place you stay throughout the conversion process.
(If you have an eye for symmetry and consistency, you might notice that source/rpi-edu-lebot
and version/2011-Jan-24
are part of the path, but dataset/
is missing in exercise-running-statistics
. If this bothers you, please take solace in that it bothers us, too. If it really bothers you, please vote for the issue and we'll see what we can do.)
Setting up the conversion cockpit
Once we've [named](Conversion process phase: name) a dataset version and have its conversion cockpit directory, we can hop in and set up shop:
bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
bash-3-2$ mkdir source
bash-3-2$ mkdir manual
conversion cockpit: what goes in source/
?
The source/
directory should contain all materials that you obtained from, well, your source (rpi-edu-lebot
, whitehouse-gov
, etc). Ideally, you will use pcurl.sh
from within source/
to retrieve the data so that we can capture the provenance between the file now on your disk and the source organization's URL. This provides a critical link from "some data you have sitting around" to the (more) authoritative source that provided it to you.
The materials in source/
should stay as is. You should also NEVER delete anything in the source/
directory. Preserving these materials is your ticket to accountability. If anybody doubts your data, you can point to this directory and say "but this is what I got from them."
conversion cockpit: what goes in manual/
?
Unfortunately (or fortunately?), computers need humans to do stuff for them. Inevitably, we need to get our hands dirty and do something to "tidy up" the data we get from our source
organization. This could be something as simple as changing a field delimiter (pipes or tabs to commas), or something a little less mindless. In either case, the results of our labor go into manual/
. Because of the un-reproducible nature of this activity, you should NEVER delete anything in the manual/
directory. The files in manual/
are likely to parallel the files in source/
, since they will be modified and pristine analogues of on another, respectively.
conversion cockpit: what goes in automatic/
?
(We not need explicitly create automatic/
since the conversion automation will do it along the way)
The automation takes care of creating everything in automatic/
. It starts with creating a Turtle file for every CSV you have hanging around in manual/
(if you needed to tweak) or source/
(if you didn't need to tweak). Note that the automation doesn't "go looking for" CSVs, it only converts what you specified when creating the [conversion trigger](Conversion process phase: create conversion trigger).
You can ALWAYS delete the automatic/
directory without fear of permanently losing work.
Files typically in the automatic/
directory:
-
automatic/menu.csv.raw.params.ttl
-
automatic/menu.csv.raw.sample.ttl
-
automatic/menu.csv.raw.void.ttl
-
automatic/menu.csv.raw.ttl
-
automatic/menu.csv.e1.sample.ttl
-
automatic/menu.csv.e1.void.ttl
-
automatic/menu.csv.e1.ttl
-
_CSV2RDF4LOD_file_list.txt
conversion cockpit: what goes in publish/
?
(We not need explicitly create publish/
since the conversion automation will do along the way)
publish/bin
You can ALWAYS delete the publish/
directory without fear of permanently losing work.
bash-3-2$ cd ~/Desktop/source/rpi-edu-lebot/exercise-jogging-statistics/version/2011-Jan-24
bash-3-2$ ls
source/
manual/
The ONE time we leave the conversion cockpit is when we retrieve data. Once we do, we get back in as soon as possible.
bash-3-2$ cd source/
bash-3-2$ pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip
bash-3-2$ punzip.sh WhiteHouse-WAVES-Released-0111.zip
bash-3-2$ cd ..
bash-3-2$ ls source/
WhiteHouse-WAVES-Released-0111.zip
WhiteHouse-WAVES-Released-0111.zip.pml.ttl
WhiteHouse-WAVES-Released-0111.csv
WhiteHouse-WAVES-Released-0111.csv.pml.ttl
NOTE: The script pcurl.sh is an essential part of adding accountability to our workflow. See its page for its description.
If the data isn't available from a URL (like our via-email jogging example), just place what you got into the source/
directory.
Now that you retrieved the data, you need to make sure it is CSV. If so, you can press on to create the conversion trigger.
- Conversion cockpit - the canonical working directory to convert a dataset.
- Script: pcurl.sh can and should be used during the retrieval phase.
- Conversion process phase: csv-ify
- Conversion process phase: create conversion trigger
- Conversion process phase: pull conversion trigger
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger
- Conversion process phase: tweak enhancement parameters
- ... (rinse and repeat; flavor to taste) ...
- Conversion process phase: publish