-
Notifications
You must be signed in to change notification settings - Fork 36
Conversion process phase: name
- Naming a dataset should follow SDV organization.
- Conversion process phases
- Installing csv2rdf4lod automation
- Know of a dataset that you want to convert to RDF.
This page describes the "SDV" naming convention that csv2rdf4lod-automation uses to organize datasets.
Consistent naming conventions makes working with others' data easier. Establishing identifiers for source, dataset, and version affects the naming of directories used to organize all of the aggregated data (see Directory Conventions), the Enhancement parameters given to the converter, the URI naming of the resulting RDF datasets, and the URI naming of instances within the RDF datasets. Therefore, thought, care, and consideration should be taken when establishing these identifiers. Keep in mind that any single URI created could end up in someone else's hands in isolation and it is incredibly useful to humans if they have a good guess as to what it is before dereferencing it and starting to crawl it as linked data.
For all identifiers, we highly recommend that you:
- Use lower case
- Replace spaces and underscores with dashes
- Avoid acronyms; try to expand them
The Base URI is the web domain that you plan to deploy your Linked Data to. By default, every URI created by csv2rdf4lod (e.g., for datasets, entities, classes, and predicates) is formed by appending to the Base URI. In effect, the Base URI is your namespace for the data you create. At some point, "something tangible" should be created to respond to HTTP web requests of the Base URI you create, but until you're ready to deploy you can convert data without having a server ready. The converter uses the Base URI specified by the conversion:base_uri
property in the enhancement parameters, which are are [created automatically](Generating enhancement parameters) if they do not already exist. The shell variable CSV2RDF4LOD_BASE_URI is used to determine the value of conversion:base_uri
when the enhancement parameters are generated, so make sure that it is set in your source-me.
conversion:base_uri "http://sparql.tw.rpi.edu/ontowiki"^^xsd:anyURI;
NOTE: do not include a slash at the end of this; we'll add it for you.
Here, source
indicates a person, organization, or other agent providing you the data that you want to convert. The intent here is a living or social entity and not something rote like a web service or external hard drive. If you grabbed the White House visitors list, identify your source as whitehouse-gov
. If you got the data from a new acquaintance, identify them as the source using something like hotmail-com-joey
. If you've got an inside scoop and someone from the White House handed you next week's visitors list on a thumb drive, identify them as the source using something like whitehouse-gov-potus
. Make like an investigative reporter and mind your source. For several examples, see the list of source identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, so you control their meaning.
Directory holding all datasets from this source will be:
URI of source will become:
The source identifier will be encoded in the conversion parameter: `conversion:source_identifier`
The web page describing the source identified will be:
- Reuse DNS name for the organization, ignoring all non-organization identifying fragments such as "www", "www2", "ftp", "data", etc.
(Note that this perspective of source
does not align with dcterms:source because our dataset is not derived from the source that we are citing. Dublin Core's dcterms:publisher is closer to what we are referencing, though our source
may be an intermediary that was not the original publisher -- as is the case in hand-me down data sharing in cases such as scraperwiki.com (which scrapes gov sites and rehosts as csv) , impacteen.org (which aggregates statistics from many federal surveys that are not readily accessible), and Xian's company financial earnings (which states that the reports are from the gov -- but one can not be sure -- and in reality the reports were submitted to the government by the individual companies))
(see also Considerations for choosing an identifier for conversion:source_identifier)
The dataset identifier will be encoded in the conversion parameter: `conversion:dataset_identifier`
- Reuse the source organization's identifier for the dataset whenever possible.
- If the dataset has an acronym, use the acronym expansion and follow it with the acronym (e.g.
enforcement-and-compliance-history-online-echo
from http://www.epa-echo.gov/echo/). - If none given, construct a clear descriptive name based on the web pages' descriptions of the dataset.
For several examples, see the list of dataset identifiers that LOGD has used. Note that most of the identifiers at the bottom are reused from data.gov's numeric convention. Also, keep in mind that these identifiers are scoped by your base URI and your source identifier, so different source organizations can name their datasets similarly without clashing with other organizations.
Very often, datasets that you retrieved from another organization have been updated since that last time you grabbed it. For example, http://www.uniprot.org/downloads releases every four weeks. So, when you tell a colleague that your analysis results showed X, they might want to know which data you analyzed. The version identifier handles this situation.
The version identifier will be encoded in the conversion parameter: `conversion:version_identifier`
- It is highly recommended to REUSE the source organization's name for the version.
- If not, consider using the Last Modified date as reported by HTTP HEAD.
- If not, consider using the current day's date (i.e., date of retrieval). This is a very good default in the absence of other version information.
- Optionally, a curator's tag could be used (e.g., we used "mashathon" during a mashathon)
When using a date, we suggest using the form 2010-Dec-31
in "year-mon-day" form (date +%Y-%b-%d
or date +%Y-%b-%d_%H_%M_%S
can be used on unix). This follows the "larger to smaller" convention of the URI decomposition. Also, Dec
instead of 12
increases readability and avoids confusion for less technical folks and for those used to international conventions -- THESE DATE-LOOKING STRINGS ARE NOT INTENDED FOR PARSING, ONLY AS HUMAN AIDS. (Note that using the 2010-Dec-31
convention will not provide chronological order when sorted lexiographically. This is a tradeoff that should be overcome by the following point.). If date modeling is desired, augment the conversion:VersionedDataset
URI with additional RDF descriptions, using appropriate RDF vocabularies and properly formatted values such as xsd:date
or xsd:dateTime
.
For several examples, see the list of version identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, and your source identifier, and your dataset identifier.
There are a lot of conversion:version_identifiers that look like 2010-Dec-09
and dataset URIs that have version/2010-Dec-09
. What does that date mean? Does it have to be a date? What methodology should a curator use to name the Version?
After establishing identifiers for source, dataset, and version, they can be used to construct the conversion cockpit -- the place to be when converting a dataset.
- Conversion process phase: retrieve
- Conversion process phase: csv-ify
- Conversion process phase: create conversion trigger
- Conversion process phase: pull conversion trigger
- ... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger
- Conversion process phase: publish
This page aggregates and replaces: