Skip to content

Example: Network and IT R and D Budget from NITRD.gov

Timothy Lebo edited this page Feb 14, 2012 · 68 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

See Examples.

In January 2011, we spent a day showing NITRD how to promote some of their data to RDF using csv2rdf4lod so we could integrate it with other existing datasets. During their visit, we worked with some row-based data from NSF, but the other dataset that they were interested in is available at http://nitrd.gov/open/, which requires cell-based conversion because it represents 4-ary statistical data.

diagram comparing row and cell based interpretations

I'm pretty excited about this dataset because it cites fiscal years, which is a common subject in government data and a good use case to demonstrate how RDF enhanced by csv2rdf4lod can be explicitly linked after only a little thought and care. That little bit of thought and care adds up each time, making problems like "What other data would this connect to?" trivial to answer.

Poking into the http://nitrd.gov/open/ dataset, we see the total R&D budgets for a variety of agencies over the past twenty years. Since I've been staring at data.gov for a while, I know that US AID reports foreign aid to a variety of countries for a variety of reasons (and released it as 1554 on data.gov). The connection? Fiscal Year. I know that, but how would you? Or even worse, how would your computer? Well, we could just ask the computer and it would tell us -- if we used csv2rdf4lod with some care.

The first hiccup in grabbing http://nitrd.gov/open/ was the manual work to obtain the data files. 20 combo-box pull downs and a save-as for the spreadsheet, a csv, and a pdf (that's 60; don't miss one!). Or you could dig up the man page for wget to crawl a depth of 2. I ended up greping through a tidy'd version of the html source. Encoding a quick description of the files in that page using RDFa would have helped find and access the data files more quickly. Alvaro helped me mock up the RDFa and we'll be forwarding it along when we send them the results of the raw conversion, initial enhancement, and cell-based enhancement so they can host it at http://nitrd.gov/open/.

After [pcurl'ing](Script: pcurl.sh) the data files into the source/ directory of my conversion cockpit, exporting the xls to csv into manual/ (capturing the provenance with [justify.sh](Script: justify.sh)), and creating my conversion trigger, I'm ready to pull it to produce the initial raw conversion into automatic/. Since I'm a proponent of quality RDF, my CSV2RDF4LOD_PUBLISH_DELAY_UNTIL_ENHANCED is set to true to prevent anything getting packaged up into publish/. Once I perform an enhancement, the packaging will happen regardless of CSV2RDF4LOD_PUBLISH_DELAY_UNTIL_ENHANCED's value.

Second hiccup: NITRD encoded two statistical dimensions into the same cell: FY 2010 Estimate for one row and FY 2010 Request in an accompanying row. This required a tweak within manual/ for each csv using a few sed commands and a few more minutes to verify the changes. The extra tweaking was captured using [justify.sh](Script: justify.sh) again using the field_parse inference rule.

Need conversion:subject_discriminator to remain the default to differentiate the row names among all 20 table files. Omitted columns 13-16 because they do not contain data, set column 2 to be a conversion:Only_if_column to skip the non-data in the row after the budget types (it contains their acronyms). Added conversion:multiplier of 1,000,000. Promoted Fiscal Year to a typed resource, templating its URI to LOGD's instance-hub. Typed the row to AgencyBudgetDistribution. Added codebook enhancements to eliminate footnote references within agency names (e.g., NIH 2 -> NIH), and included them in the parameters for all 20 tables. Promoted Agency to a URI and typed it to Agency, asserting owl:sameAs via lod-links to DBPedia using a lod-link file derived from a google spreadsheet. Citing [unit of measure](Enhancement pattern: adding units to measurements) as http://dbpedia.org/resource/United_States_dollar. Added last data row for all 20 tables. Subproperty'ing fiscal_year and agency to LOGD's predicates.

In FY07, I cheated and manually edited the input csv to avoid: [] e2:program_component_area <....Cyber_Security_Information_Assurance_1>. The right way to handle it would be a codebook enhancement on that particular column, but I'm not sure if that is implemented for cell-based conversion (hence the cheat).

Queries

PREFIX rdf:       <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ov:         <http://open.vocab.org/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
PREFIX vocab: <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/federal_research_and_development_budget_for_networking_and_information_technology/vocab/>
PREFIX e2:    <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/federal_research_and_development_budget_for_networking_and_information_technology/vocab/enhancement/2/>
PREFIX muo:   <http://purl.oclc.org/NET/muo/muo#>
PREFIX logd:  <http://logd.tw.rpi.edu/vocab/>

SELECT distinct ?fy ?agency ?pac ?in ?value
WHERE {
  GRAPH <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/federal_research_and_development_budget_for_networking_and_information_technology/version/2011-Jan-27>  {
    ?b a vocab:AgencyBudget;
       logd:fiscal_year ?fy;
       logd:agency      ?agency;
       e2:program_component_area ?pac;
       rdf:value        ?value;
       muo:measuredIn   ?in 
  }
}

results

Clone this wiki locally