-
Notifications
You must be signed in to change notification settings - Fork 36
Alternative Tabular to RDF converters
csv2rdf4lod is a tool by some folks in the Tetherless World Constellation at RPI. It is currently being used as part of the infrastructure for their Linking Open Government Data and Linking Open Biomedical Data projects. Tim Lebo wrote it with some invaluable design guidance from Greg Williams.
The number of utilities available to convert tabular data to RDF suggests a large and diverse set of requirements. To help you find the right match for your needs, this page collects pointers to other utilities that can convert tabular data to RDF.
If you know of yet another, feel free to email Tim or jot (and save) suggestions on this piratepad.
Please note that csv2rdf4lod was NOT used to produce the RDF available at http://www.data.gov, such as http://www.data.gov/semantic/data/alpha/92/dataset-92.rdf.gz. That was from some code somewhere in Google.
Special thanks to Jim McCusker, Paola, Li, Greg, Christoph, and Alvaro for their help in developing this list.
- W3C's wiki: ConverterToRdf
- MIT SIMILE listing: RDFizers
- LOD2 deliverable: Report on Knowledge Extraction from Structured Sources
- Michael Bergman's Sweet Tools listing
- LATC project's Data Publication & Consumption Tools Library
- The Open Data Institute's open data tech review
- Linked University's Converting Legacy Data to RDF
- (broken) http://www.opendataday.org/wiki/Tools
- homepage: http://www.isi.edu/integration/karma/
- Available under Apache 2 License on GitHub
- GUI based
- Uses Conditional Random Field (CRF) to propose mappings to classes and properties.
- Uses relational database and views.
- Avoids data preparation - requires it as a preprocessing step.
- Provides entity matching based on Song and Heflin's entity coreference approach (Silk did not work for them)
- Permits manual curation of sameAs links. Uses PROV-O to distinguish different sets of links.
Publications:
- Used in best in-use paper ESWC 2013: Connecting the Smithsonian American Art Museum to the Linked Data Cloud
- Earliest paper 2007 in IUI
"Tabular to Linked Data"
- UMBC's Mulwad Varish
- ISWC CEUR paper
- Varish Mulwad/UMBC T2LD MS Thesis
- Arbitrary target structures
- Mulwad et al. Automatically Generating Government Linked Data from Tables (AAAI 2011) use ontologies and existing linked data to drive suggestions for enhancements.
Representing multi-dimensional statistical data as RDF using the RDF Data Cube Vocabulary
(csv2rdf4lod handles n-ary relations in spreadsheets, including multi-dimentional statistics; see Converting with cell based subjects)
- Browser based
- Faceted browsing
- Concurrent editing for efficient manual data cleaning
- Reconciliation with Freebase
- Programmatic control of values
- DERI offers module that exports to RDF: http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/
- https://groups.google.com/forum/#!topic/google-refine/5O-jSE0NBTU/discussion
- http://dataist.wordpress.com/2012/04/10/tutorial-using-google-refine-to-clean-mortgage-data/
- http://ckan.org/2011/07/05/google-refine-extension-for-ckan/
- screencast
- For one-off conversion Google Refine is quite easy to get started. It has a great deal of data cleaning facilities for noisy or illogical data. With its RDF extension you have automated data reconciliation with outside linked data sources of your choice as DBpedia. (Rafael)
October 2nd, 2012, Google is not supporting actively Refine which have been rebranded to OpenRefine
RDBToOnto is a tool that allows to automatically generate fine-tuned populated ontologies from relational databases (in RDFS/OWL).
A major feature of this tool is the ability to produce highly structured ontologies by exploiting both the database schema and structuring patterns hidden in the data (see publications for details on the RTAXON learning method, including its formal description).
Though automated to a large extent, the process can be constrained in many ways through a friendly user interface. It also provides a framework that eases the development and integration of new learning methods and database readers. A database optimization module allows to enhance the input database before ontology generation.
- Ivan Ermilov, Sören Auer, Claus Stadler: Crowd-Sourcing the Large-Scale Semantic Mapping of Tabular Data at WebSci 2013 (wiki intro)
More notes and comments at Ermilov's wiki.publicdata.eu CSV2RDF Application
Datalift brings raw structured data coming from various formats (relational databases, CSV, XML, ...) to semantic data interlinked on the Web of Data.
- homepage: http://datalift.org
- tutorial
- Cambridge Semantics Anzo
- http://www.cambridgesemantics.com/products/anzo_for_excel - designed to keep large numbers (potentially hundreds) of spreadsheets continuously integrated and in sync across an enterprise, each independently curated.
Anzo (in particular Anzo for Excel) is designed for enterprises to curate large numbers of spreadsheets, map them to ontologies & to existing RDF instance data, and maintain them as changes are made to the spreadsheets or to the data in the spreadsheets. It can be used for CSV-style "tabular" spreadsheets and also for arbitrarily "human-oriented" spreadsheets. It can be used both in interactive modes (where people are opening up and interacting with spreadsheets) and also in automated batch modes.
Anzo stores the RDF data from spreadsheets in an RDF database. Anzo includes both authenticated and unauthenticated SPARQL endpoints for this data; Anzo can also directly publish the data as Linked Data. Finally, Anzo gives you several ways to export RDF data from the database.
Anzo is available in several editions: Anzo Express Starter -- includes Anzo for Excel as above for limited #s of users; freely available Anzo Express -- includes Anzo for Excel and Anzo on the Web, a user-friendly browser-based dashboard tool for visualization, searching, and analyzing RDF data Anzo Enterprise -- includes the above in addition to tools to connect to data in relational databases, to integrate unstructured data from documents, web pages, etc., to run rules and reasoning and work flow processes, various server-side and client-side APIs, etc. We also make Anzo available for free for academic use. (Lee)
Michel Dumontier's php-lib library is what Bio2RDF has been using for converting TSV, CSV files (and other file formats) to RDF [1]. It contains some aspects that are Bio2RDF specific, namely its support for prefixed URIs, but any Pull Requests on GitHub would be appreciated to generalise that. OSX has PHP installed by default as far as I know so you can use it on the command line without any other dependencies.
You can find examples of scripts using php-lib in the bio2rdf-scripts repository on GitHub [2]. A fairly simple example would be the HGNC converter, which is Tab separated, but quite similar [3].
Cheers,
[1] https://github.com/micheldumontier/php-lib [2] https://github.com/bio2rdf/bio2rdf-scripts [3] https://github.com/bio2rdf/bio2rdf-scripts/blob/master/hgnc/hgnc.php#L129
- https://github.com/cgutteridge/Grinder
- Christopher Gutteridge's http://graphite.ecs.soton.ac.uk/stuff2rdf/
- IO informatics Knowledge Explorer, a good tool. I used Google Refine+ RDF plugin and faced some problem with large datasets but KE worked perfectly well. -Abdul Mateen Rajput
- IO Informatics’ Knowledge Explorer. Professional Edition, also provides an automated way to facilitate import and updating a triplestore backend of your choice via monitored folders which will map and import incoming spreadsheets to RDF. You can set up multiple monitored folders with different data mappings, and this will run as background processes to continuously update one or multiple connected triplestores (or different graphs in a single triplestore.
The Knowledge Explorer also provide scripting within the import mapping, application of thesauri and other mechanisms for data transformation to clean, consolidate and harmonize data during the import.
You can find out more about this tool here: http://www.io-informatics.com/products/sentient-KE.html -Erich Gombocz
- © 2007; status "Past project"; active April-December 2007
- google group last non-spam entry was 19 Sept 2007
- RDF123 by UMBC's Lushan Han; see ebiquity's RDF123 page. pdf ontology.
The idea behind "spreadsheet" work in .bib is to enrich spreadsheets with an ontology that makes the semantics of the spreadsheet cells, particularly of derived/computed values, more explicit, and using that information to provide user assistance. -Christoph
- homepage: http://tiree.snipit.org/talis/tables
- download: http://tiree.snipit.org/talis/tables/downloads/csv_mapper_remote.zip
- http://code.google.com/p/data-gov-wiki/source/browse/#svn
- Chunks output into multiple files to suit Tabulator's memory constraints.
- Uses hash-based URIs for quick and easy Linked Data deployment.
S3DB stands for Simple Sloppy Semantic Database. It is a way to represent information on the Semantic Web without the rigidness of relational/XML schema while avoiding the "spaghetti" of unconstrained RDF stores. The critical feature of S3DB is a core datamodel that makes an explicit distinction between domain of discourse and its instantiation. The motivation and basic design is introduced in our publications [Nature Biotechnology - 24, 1070 - 1071 (2006)], [PLoS ONE 3(8) 2008] and [BMC Bioinformatics 11:387 (2010)]. For a shortcut to the syntax of the REST protocol used to expose S3DB's API click here. For the sprawling list of documents and media describing installation and usage see the documentation page. https://sites.google.com/a/s3db.org/s3db/ http://www.biomedcentral.com/content/pdf/1471-2105-11-387.pdf http://ibl.mdanderson.org/~jsalmeida/
- page: http://code.google.com/p/lod-apps/wiki/phpLod#phpCsv2Rdf
- last code update: Aug 4, 2011
- version 2011-02-08 is in "testing status"
- "tons of connectors to get your data from any sources"
- "nice data cleaning and transormation components to massage your data"
- "fuzzymatch option (using levenshtein and metaphone) for reconciliation"
- "job can be exported in a shell script and included in a cron job."
- "Talend is more complex than Refine and the learning curve a bit longer"
Command line version of Mindswap Convert To RDF Tool
GUI version of Michael Grove's ConvertToRDF
- Windows exe circa 2002
- Can convert up to 26 columns and 100 rows.
- Excel2RDF by University of Maryland's Mindswap.
- Defers to Michael Grove's ConvertToRDF
-
http://dataincubator.org (google group active April 2009 to August 2011)
-
http://www.w3.org/2011/Talks/0223-cshals-egp/#%2843%29, http://www.w3.org/2011/Talks/0223-cshals-egp/#%2847%29
-
VIVO slurps their csvs into a relational database and uses the JDBC or d2r widgets to produce RDF.
-
Hibernate to map SPARQL to object-relational model.
-
RDB-RDF
-
C. Bizer, D2R MAP – A Database to RDF Mapping Language, Proceedings of the 12th In- ternational World Wide Web, 2003.
-
Assem et al. mention their own in ISWC 2010
-
Interactively Mapping Data Sources into the Semantic Web (presented at ISWC) http://ceur-ws.org/Vol-783/paper2.pdf
-
W3C's A Direct Mapping of Relational Data to RDF First Public Working Draft
-
Stanford's M2: a Language for Mapping Spreadsheets to OWL from OWLED 2010
-
Auer's Triplify / http://aksw.org/Projects/ReDDObservatory/OntoWiki_sCSV2RDFPlugin
- S ̈oren Auer, Sebastian Dietzold, Jens Lehmann, Sebastian Hellmann, and David Aumueller. Triplify: Light-weight linked data publication from relational databases. In Juan Quemada, Gonzalo Le ́on, Yo ̈elle S. Maarek, and Wolfgang Nejdl, editors, Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 621–630. ACM, 2009. (eprint)
-
Leigh Dodds describing his Gridworks reconciliation api hack: http://www.ldodds.com/blog/2010/08/gridworks-reconciliation-api-implementation/
-
H2R Michael Krauthammer
-
Mapping Master http://data.semanticweb.org/conference/iswc/2010/paper/414/html
-
EasyRdf (php) Homepage: http://www.aelius.com/njh/easyrdf/ Download: http://github.com/downloads/njh/easyrdf/easyrdf-0.6.0.tar.gz API Docs: http://www.aelius.com/njh/easyrdf/docs/
-
Linked Data Integration Framework (uses R2R Mapping and SILK)
-
Pentaho Data Integration suite (http://kettle.pentaho.com/) for converting from relational DBs to RDF. They also used it to translate from XML to RDF.
-
Populous (http://populous.org.uk) uses the ontology pre-processing language (OPPL) to convert spreadsheet data in OWL/RDF. It also supports validating spreadsheet content against existing ontologies. Populous is a spawn of RightField (http://rightfield.org.uk). RightField allows the creation of spreadsheets that have ontology terms embedded within them for data validation. -Simon Jupp RightField (http://www.rightfield.org.uk), allows you to embed ontology term selection into spreadsheets, and to extract these selections as RDF. It is designed more for assisting in the data collection process (i.e. when users fill in a spreadsheet that has been marked-up using RightField, they are automatically collecting semantically enriched data). Our paper from last year's eScience conference describes the RDF extraction in more detail: Wolstencroft, Katherine; Owen, Stuart; Goble, Carole; Nguyen, Quyen; Krebs, Olga; Muller, Wolfgang; , "RightField: Semantic enrichment of Systems Biology data using spreadsheets," E-Science (e-Science), 2012 IEEE 8th International Conference on , vol., no., pp.1-8, 8-12 Oct. 2012 doi: 10.1109/eScience.2012.6404412 (Katy)
-
ALOE - Assisted Linked Data Consumption framework http://aksw.org/projects/aloe
-
Information Workbench6 [5] developed by fluid Operations Haase, P., Schmidt, M., Schwarte, A.: The Information Workbench as a Self-Service Platform for Linked Data Applications. 2nd International Workshop on Consuming Linked Data (COLD 2011), Bonn, Oktober 2011. http://www.fluidops.com/information-workbench/
-
Stanford’s DataWrangler app – a tool for visually creating a script to reformat/clean data
-
Tabels is a tool by CTIC to bridge the gap between tabular formats and linked data. Tabels is able to process spreadsheets, csv files, but also other tabular formats such as statistical specific ones, analysis tool formats and so on. Moreover, Tabels is more than a transformation tool. It is geared up with data-sensitive front-end widgets to facilitate end-users the exploitation of data. Regarding multidimensional information, Tabels programs are able to produce DataCube-compliant datasets, which can be dynamically explored using the chart view. A HTML5-based visualization component that automatically generates an interactive interface to explore the data. An example of how to transform an Eurostat PX file to Data Cube with a generic Tabels program is found at http://idi.fundacionctic.org/tabels/project/eurostat/. Ermilov et al. claim that it is the most advanced because it features "Tables Language": "This language is similar to Sparqlify-ML in the sense that it re-uses syn- tactic constructs already known from SPARQL. However, it introduces additional features specifically for CSV-RDF transformations, such as loops for iterating over CSV files in ZIP archives and workbooks and pages in Excel spreadsheets."
-
http://www.data2semantics.org/2012/11/09/update-tablinker-untablinker/
-
Tomas Knap presented a poster on ODCleanStore at ISWC 2012. Some more documentation is here.
-
revelytix makes a tool called Spyder, which is not open-source, but is free - http://www.revelytix.com/content/spyder. It will let you use R2RML over a CVS file directly to convert to RDF (or query with SPARQL without converting).
-
R2RML tutorial (circa April 2013) http://rdb2rdf.org/
-
OpenRefine (formerly Google Refine): http://github.com/OpenRefine/OpenRefine/wiki
-
Data Shapes and data transformations http://www.slideshare.net/boricles/data-shapes-and-data-transformations http://arxiv.org/abs/1211.1565
-
VIVO Harvester: http://vivo-project.github.com/
- http://code.google.com/p/court/wiki/COIN
- http://tools.ietf.org/html/draft-gregorio-uritemplate-04
- http://www.vivoweb.org/blog/2011/02/vivo-release-12-announcement
- http://www.applications.sciverse.com/action/appDetail/292651
- http://www.niemanlab.org/2013/04/a-new-tool-for-extracting-tables-of-data-from-pdfs/
- https://github.com/coolwanglu/pdf2htmlEX#readme
- Spreadsheet horror stories
Conversion to RDF is reported by the triple store evaluation literature, where they propose queries as well. Hexastore used as evaluation, but didn't mention how they converted. Library thing a dataset (LUBM?). Rdf4x guys have a non-public dataset. Work did not describe their considerations during the conversion process. (was some of this work from MIT?)
reuse comparison table from http://www.toodledo.com/info/compare.php?
PCI 2013 - Special Session on the Web of Data (DATAWEB) Production and deployment of Open, Linked and Big Data