Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SPARQL query optimization by using selectivity #70

Open
jindrichmynarz opened this issue Aug 7, 2019 · 0 comments
Open

Improve SPARQL query optimization by using selectivity #70

jindrichmynarz opened this issue Aug 7, 2019 · 0 comments

Comments

@jindrichmynarz
Copy link
Collaborator

SPARQL query optimization in Halyard is based on cardinalities pre-computed by Halyard Stats. A frequent case in SPARQL queries is to encounter triple patterns with similar cardinality, but drastically different selectivity (see e.g., here for a description of selectivity). Selectivity can be considered as the ratio of distinct predicate values to all values (i.e. (COUNT(DISTINCT ?val)/COUNT(?val)).

In some cases the optimizer can prioritize triple patterns that have low cardinality but are unselective. For example, in Wikidata the properties that link Wikidata entities with Wikipedia pages have similar cardinality. See the Halyard Stats for schema:about, schema:inLanguage, and schema:isPartOf:

PREFIX halyard: <http://merck.github.io/Halyard/ns#>
PREFIX schema:  <http://schema.org/>
PREFIX void:    <http://rdfs.org/ns/void#>

SELECT ?property ?cardinality
WHERE {
  GRAPH halyard:statsContext {
    VALUES ?property {
      schema:about
      schema:inLanguage
      schema:isPartOf
    }
    [] void:propertyPartition [
        void:property ?property ;
        void:triples ?cardinality
      ] .
  }
}
property cardinality
schema:about "126270886"^^xsd:long
schema:inLanguage "67986056"^^xsd:long
schema:isPartOf "67986056"^^xsd:long

schema:about has the highest cardinality, but it has vastly better object selectivity than schema:inLanguage or schema:isPartOf, which have only few distinct values. Given that the object of schema:about is known, the query optimizer would produce a better plan if it gave it a priority. While the object selectivity of schema:about is high, since most of its objects are unique (i.e. Wikidata entities), the object selectivity of schema:inLanguage is low, since it has very few unique objects (i.e. languages of Wikipedias).

Adding selectivity to Halyard Stats can mean adding 2 numbers to each partition, e.g., for property partitions it's selectivity with respect to objects and selectivity with respect to subjects. These can be then used by the query optimizer to produce better query plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant