Speedup NTriples parser #1558

white-gecko · 2021-04-28T09:32:57Z

white-gecko
Apr 28, 2021
Maintainer

After a successful speedup of the Turtle parser (#1266), we should see how we can speedup the the n-triples parser as well.

rchateauneu · 2021-04-29T08:35:57Z

rchateauneu
Apr 29, 2021

Is there a specific test data file on which we could focus, as a benchmark, please ?

0 replies

white-gecko · 2021-04-30T13:03:24Z

white-gecko
Apr 30, 2021
Maintainer Author

Sure you can use the orkg.nt from: https://github.com/AKSW/orkg-dump and you can use the wordnet file https://en-word.net/static/english-wordnet-2020.ttl.gz and convert it with rapper to n-triples. You can also use this one: http://dbpedia-mappings.tib.eu/databus-repo/kurzum/cleaned-data/geonames/2018.03.11/geonames_all.nt.bz2 (from https://databus.dbpedia.org/kurzum/cleaned-data/geonames/)

I came to the following run times on my system with the rdflib and the orkg file and as comparison with the raptor utils (which are compiled c code ;-) )

$ time python import.py orkg.nt nt
python import.py orkg.nt nt  22,37s user 0,46s system 99% cpu 22,870 total
$ time python import.py orkg.nt ttl
python import.py orkg.nt ttl  33,10s user 0,42s system 99% cpu 33,524 total
$ time rapper -i ntriples -o ntriples orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer ntriples
rapper: Parsing returned 380647 triples
rapper -i ntriples -o ntriples orkg.nt > /dev/null  1,35s user 0,04s system 99% cpu 1,390 total
$ time rapper -i ntriples -o turtle orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer turtle
rapper: Parsing returned 380647 triples
rapper -i ntriples -o turtle orkg.nt > /dev/null  3,80s user 0,05s system 99% cpu 3,848 total

0 replies

rchateauneu · 2021-05-02T06:45:57Z

rchateauneu
May 2, 2021

There are possible speedups, for example by using a single regular expression to parse the input triples or quads into their three and four individual components respectively.

However, before a pull-request, I would like to confirm two points about opening and reading text files.

(1) This function opens a file in binary mode:

def _create_input_source_from_location(file, format, input_source, location):
...
    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")

Later on, this implies a conversion from _io.BufferedReader with codecs.getreader("utf-8"). This conversion is needed because all regular expressions and further processing are on Unicode, not bytes. In a simple test which reads a nt files with Graph.parse(), codecs.readline() costs at least 10% of the whole time. I have tested replacing "rb" by "r" and there is a visible speedup.

(2) At the moment, W3CNTriplesParser.readline() does not use a plain readline() apparently because of terminator "\n\r", "\n", "r". But Python documentation (Py3 at least) says something which is what we want : https://docs.python.org/3/library/io.html

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

To conclude, my intention is these two changes, on top of using less regular expressions:

In _create_input_source_from_location(), open text files in text mode "r" instead of "rb" (A little tweak is needed because of BOM, see test/test_n3.py::TestN3Case::testIssue156 ).
Use a simple file readline(), eliminating the need to internally store the file buffer content, instead of several reads() and an internal buffer to store the content.

The impact might be that some tests could be broken, and there might be Python 2 incompatibilities. But the code is more natural and faster I think. Please note that these two changes can be done separately, in other branches.

What is please your opinion about these changes ?

Many thanks in advance.

0 replies

white-gecko · 2021-05-06T17:11:15Z

white-gecko
May 6, 2021
Maintainer Author

That sounds very good. The change towards plain open should also be related to #1222 . Would you recommend using StringIO or should it be something different?

0 replies

rchateauneu · 2021-05-06T18:22:10Z

rchateauneu
May 6, 2021

Your suggested change simplifies a lot the code and notably removes a lot of conversions. Does it run the tests and do you have an idea of the performance speedup ?
Because it does not change the overall structure of the code, it would be great to merge it soon, if possible.

0 replies

rchateauneu · 2021-05-21T07:29:21Z

rchateauneu
May 21, 2021

Indeed, there might be interferences with #1222 and also #1276

However, there are some optimizations in these independent PRs , ready for merging:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup NTriples parser #1558

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Speedup NTriples parser #1558

white-gecko Apr 28, 2021 Maintainer

Replies: 6 comments

rchateauneu Apr 29, 2021

white-gecko Apr 30, 2021 Maintainer Author

rchateauneu May 2, 2021

white-gecko May 6, 2021 Maintainer Author

rchateauneu May 6, 2021

rchateauneu May 21, 2021

white-gecko
Apr 28, 2021
Maintainer

rchateauneu
Apr 29, 2021

white-gecko
Apr 30, 2021
Maintainer Author

rchateauneu
May 2, 2021

white-gecko
May 6, 2021
Maintainer Author

rchateauneu
May 6, 2021

rchateauneu
May 21, 2021