Using SPARQL CONSTRUCT for Mapping RDF Data #268

tobiasschweizer · 2022-06-20T13:32:25Z

tobiasschweizer
Jun 20, 2022

@justin2004 in kg-construct/rml-questions#15 (reply in thread)

I'd definitely like to hear about your experiences with SPARQL constructs for mapping! You could start a discussion here if you like.

Hi there,

I use SPARQL CONSTRUCT queries to map RDF data from a given ontology to schema.org. There are different use cases. Sometimes, we get JSON-LD files from an API. Each file contains a small graph. Using rdflib these small graphs can be turned into schema.org conformant graphs (the WHERE clauses matches the originally given data and the CONSTRUCT clause rewrites it to schema.org using the bound variables). I think this is very similar (if not logically identical) to what you do with non RDF sources.

Avantages:

Standard technology SPARQL (different implementations available)
Declarative -> documentation, transparency
Works well for relatively small graphs (straight forward automation with rdflib)

Pain points:

larger graphs and/or complex queries can get slow
SPARQL CONSTRUCT queries can get complex (nested UNION clauses)
transformations sometimes hard to debug
be careful with OPTIONAL, it slows down the transformation
post-processing is needed sometimes: normalisation of dates, splitting of strings (full name -> two props: first and family name)

I figured that there are most likely some best practices when it comes to writing such SPARQL CONSTRUCT queries.

My first approach: I started with what I expected to be the "main" resource type and then defined all possible relations all in one scope. So for example:

WHERE {
    ?project a ex:Project ;
        ex:writtenBy ?author .

    ?author a foaf:Person ;
        foaf:firstName .
    ...
}

Then I found it better (query performance, handling of complexity) to write one UNION clause per resource type:

WHERE {
    {
        ?project a ex:Project ;
              ex:writtenBy ?projAuthor . 
    {
    } UNION {
          ?author a foaf:Person ;
              foaf:firstName .
    }
}

The thing is that then in the CONSTRUCT clause you cannot rewrite one graph from the "main" resource to all referred resources since variables from different UNION clauses are not related.

However, you can just handle UNION by UNION:

CONSTRUCT {
    ?project a schema:Project ;
        schema:author ?projAuthor . 

    ?author a schema:Person ;
        schema:givenName .

}

At first, this seemed not very intuitive to me but it helped greatly reduce the query's complexity because it could be split into different parts. ?projAuthor and ?author are independent but if the original graphs contains the data as expected both variables will be bound to the same IRI (once as an object, once as a subject), hence the relation between project and person (author) will exist in schema.org complaint RDF.

However, sometimes you need restrictions that affect several UNION clauses, e.g. only map persons that are authors. This is where it gets a bit redundant.

Please feel free to give me some critical feedback. I admit that this is a bit trial and error rather than based strictly on the specs (scoping and UNION) ...

justin2004 · 2022-06-21T21:41:01Z

justin2004
Jun 21, 2022

I've also felt most of the pain points.

normalisation of dates, splitting of strings (full name -> two props: first and family name)

Though for these I've been using SPARQL fruitfully. For date normalization I wrote an Apache Jena SPARQL value function which I describe here.
And for splitting of strings I usually find SPARQL's built in string functions to be sufficient, specifically replace.

4 replies

tobiasschweizer Jun 22, 2022
Author

In general, I try to use as few functions as possible. It somehow weakens the clarity of the declarative approach and is also very limited compared to what one can do in programming languages. This is also consistent with my experience with other technologies like XSLT. I am rather inclined to add a post-processing step.

I avoided to create my own SPARQL functions so far because the SPARQL CONSTRUCT queries should work with several implementations, i.e. rdflib or an actual triplestore. This helps mitigate risks tied to specific applications (vendor lock-in etc.). What is your experience or opinion here?

justin2004 Jun 27, 2022

It somehow weakens the clarity of the declarative approach and is also very limited compared to what one can do in programming languages.

Given that I often work with non-software engineers, I think I prefer using a declarative approach with the occasional procedural invocation. I think the cost is small -- especially for SPARQL. Since in SPARQL there are really just 3 ways to invoke an arbitrary procedure:
(1) with a service clause
(2) with a value function (inside a bind())
(3) with a magic property

All the non-software engineers that I've worked with that know SPARQL understood those 3 within minutes after they encountered them.
(1) returns bindings
(2) returns 1 binding
(3) returns bindings

I don't think that much uniformity (that all arbitrary function invocations can only return bindings) weakens the clarity of declarative SPARQL enough to cause trouble.

SPARQL CONSTRUCT queries should work with several implementations

I think all of the mature SPARQL engines I've read the docs for allow one to define value functions (2) and magic properties (3). And for service clauses (1) you could deploy any SPARQL engine you want.

tobiasschweizer Jun 29, 2022
Author

Ok, fair enough. Still I would try to avoid writing my own SPARQL functions but I understand that this can be useful. Speaking of magic properties, I found that Lucene index properties work quite differently depending on the triplestore used (sometimes you get the exact literal that matched, sometimes just the resource it's connected with).

What's your experience with SPARQL optimisation? Sometimes I find it quite tricky to make sure that the actual semantics did not change, e.g, when working with nested UNION clauses.

justin2004 Jul 12, 2022

Yeah, I think Neptune and Stardog each handle the full text search (Lucene syntax) a little differently but I think it is fairly easy to port queries back and forth.

As for SPARQL optimization I don't feel I am expert in that domain. But I do find query plans to be useful. Stardog's query plans are helpful with their cardinality estimation. With Apache Jena I do look at the SPARQL algebra and I've had to look at the TDB2 verbose logging to figure out which parts of a query are expensive.
I still think Jena could improve there. I'd like to see something like a heatmap (number of TDB2 lookups or something similar) overlaid on the SPARQL query. I think that might be a nice open source contribution if anyone has time!

Also I found this presentation to be helpful.

I don't think I've written many nested UNIONs. I feel like I might be reaching for nested subqueries instead perhaps?

justin2004 · 2022-07-30T10:48:00Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using SPARQL CONSTRUCT for Mapping RDF Data #268

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using SPARQL CONSTRUCT for Mapping RDF Data #268

tobiasschweizer Jun 20, 2022

Replies: 2 comments · 4 replies

justin2004 Jun 21, 2022

tobiasschweizer Jun 22, 2022 Author

justin2004 Jun 27, 2022

tobiasschweizer Jun 29, 2022 Author

justin2004 Jul 12, 2022

justin2004 Jul 30, 2022

tobiasschweizer
Jun 20, 2022

Replies: 2 comments 4 replies

justin2004
Jun 21, 2022

tobiasschweizer Jun 22, 2022
Author

tobiasschweizer Jun 29, 2022
Author

justin2004
Jul 30, 2022