Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BASE declarations are now handled correctly #1660

Merged
merged 11 commits into from
Dec 10, 2024
Merged

BASE declarations are now handled correctly #1660

merged 11 commits into from
Dec 10, 2024

Conversation

hannahbast
Copy link
Member

@hannahbast hannahbast commented Dec 5, 2024

When a BASE IRI is defined, IRIs without a scheme (like <http://...>) should be resolved. For example, with BASE <http://purl.uniprot.org/uniprot/>, the relative IRI <UPI001AF4585D> should be resolved to <http://purl.uniprot.org/uniprot/UPI001AF4585D>, and the absolute IRI </prosite/PS51927> should be resolved to <http://purl.uniprot.org/prosite/PS51927>.

NOTE: Without BASE declaration, relative IRIs like <a> are accepted and left unchanged. This is used extensively in our tests. This is not 100% compliant with the SPARQL standard, but we consider being slightly more accepting than the standard harmless in this case.

When a BASE IRI is defined, relative IRIs should be resolved against it.
For example, with `BASE <http://example.org/>`, the relative IRI `<a>`
should be resolved to `<http://example.org/a>`.

This is a fist step in this direction. So far, most of the test fail
because they use relative IRIs extensively, assuming that they are
left untouched.

PS: UniProt uses BASE extensively, so this is a blocker for parsing
UniProt via Turtle.
@hannahbast hannahbast marked this pull request as draft December 5, 2024 04:11
Copy link

codecov bot commented Dec 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.68%. Comparing base (5b27eeb) to head (0d14ac8).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1660      +/-   ##
==========================================
+ Coverage   89.64%   89.68%   +0.04%     
==========================================
  Files         383      383              
  Lines       36889    36942      +53     
  Branches     4167     4174       +7     
==========================================
+ Hits        33068    33132      +64     
+ Misses       2522     2510      -12     
- Partials     1299     1300       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hannahbast hannahbast marked this pull request as ready for review December 8, 2024 20:46
@hannahbast hannahbast changed the title First step towards a correct implementation of BASE BASE declarations are now handled correctly Dec 8, 2024
Hannah Bast added 2 commits December 8, 2024 21:47
TODO: I still don't understand how the `prefixMap_` can become empty
again, despite the default initialization in the `TurtleParser` class.
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial and thorough round of reviews.

Comment on lines 204 to 217
// TODO<hannah> I would prefer to just call `prefixMap_.at(...)`, but then
// some of the tests fails because the keys are not in the map (despite the
// initialization above).
const TripleComponent::Iri& baseForRelativeIri() {
// return prefixMap_.at(baseForRelativeIriKey_);
return prefixMap_
.try_emplace(baseForRelativeIriKey_, TripleComponent::Iri{})
.first->second;
}
const TripleComponent::Iri& baseForAbsoluteIri() {
// return prefixMap_.at(baseForAbsoluteIriKey_);
return prefixMap_
.try_emplace(baseForAbsoluteIriKey_, TripleComponent::Iri{})
.first->second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should definitely fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me, which tests those are (or I can see myself after the next iteration).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect, that these test are FRIENDS and explicitly set the prefix map. This can be reimplemented by a setter function which inserts the prefixes if necessary.

src/parser/Iri.h Show resolved Hide resolved
@@ -1332,7 +1332,7 @@ auto CompressedRelationWriter::createPermutationPair(
// TODO<joka921> Use `CALL_FIXED_SIZE`.
ad_utility::CompressedExternalIdTableSorter<decltype(compare), 0>
twinRelationSorter(basename + ".twin-twinRelationSorter", numColumns,
4_GB, alloc);
30_GB, alloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely part of a different PR.

src/parser/Iri.cpp Outdated Show resolved Hide resolved
pos = iri_.find('/', pos + 3);
}
// If there is no `/` after the scheme or the first such `/` is the last
// character, the base IRI is the IRI itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This also is the return for the case no scheme found . Do you want to assert that we either have an absolute IRI as the base, or the empty base (for backwards compatibility) but nothing ill-formed either ?
    (might be that you have this check elsewhere).

src/parser/RdfParser.cpp Outdated Show resolved Hide resolved
src/parser/RdfParser.cpp Outdated Show resolved Hide resolved
forAllParsers(testWithParser, "<missing> <object> .");
// Make sure that the `.` is not the final character because we (need to)
// handle that as a special case in `RdfStreamParser<T>::getLineImpl`.
forAllParsers(testWithParser, "<missing> <object> . ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be working again with the discussed end regex fix

  • it should be possible to add a test that checks, that we have fixed the uniprot bug with dots in the middle of an IRIref correctly (e.g. in the StreamAndParallelparser test, where we manually fiddle with blocks anyway.

test/parser/LiteralOrIriTest.cpp Outdated Show resolved Hide resolved
test/parser/LiteralOrIriTest.cpp Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hjhhh

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

Co-authored-by: Johannes Kalmbach <[email protected]>
@sparql-conformance
Copy link

@hannahbast hannahbast merged commit 0400f90 into master Dec 10, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants