Speedup NTriples parser #1558
Replies: 6 comments
-
Is there a specific test data file on which we could focus, as a benchmark, please ? |
Beta Was this translation helpful? Give feedback.
-
Sure you can use the orkg.nt from: https://github.com/AKSW/orkg-dump and you can use the wordnet file https://en-word.net/static/english-wordnet-2020.ttl.gz and convert it with rapper to n-triples. You can also use this one: http://dbpedia-mappings.tib.eu/databus-repo/kurzum/cleaned-data/geonames/2018.03.11/geonames_all.nt.bz2 (from https://databus.dbpedia.org/kurzum/cleaned-data/geonames/) I came to the following run times on my system with the rdflib and the orkg file and as comparison with the raptor utils (which are compiled c code ;-) )
|
Beta Was this translation helpful? Give feedback.
-
There are possible speedups, for example by using a single regular expression to parse the input triples or quads into their three and four individual components respectively. However, before a pull-request, I would like to confirm two points about opening and reading text files. (1) This function opens a file in binary mode:
Later on, this implies a conversion from _io.BufferedReader with codecs.getreader("utf-8"). This conversion is needed because all regular expressions and further processing are on Unicode, not bytes. In a simple test which reads a nt files with Graph.parse(), codecs.readline() costs at least 10% of the whole time. I have tested replacing "rb" by "r" and there is a visible speedup. (2) At the moment, W3CNTriplesParser.readline() does not use a plain readline() apparently because of terminator "\n\r", "\n", "r". But Python documentation (Py3 at least) says something which is what we want : https://docs.python.org/3/library/io.html
To conclude, my intention is these two changes, on top of using less regular expressions:
The impact might be that some tests could be broken, and there might be Python 2 incompatibilities. But the code is more natural and faster I think. Please note that these two changes can be done separately, in other branches. What is please your opinion about these changes ? Many thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
That sounds very good. The change towards plain open should also be related to #1222 . Would you recommend using StringIO or should it be something different? |
Beta Was this translation helpful? Give feedback.
-
Your suggested change simplifies a lot the code and notably removes a lot of conversions. Does it run the tests and do you have an idea of the performance speedup ? |
Beta Was this translation helpful? Give feedback.
-
Indeed, there might be interferences with #1222 and also #1276 However, there are some optimizations in these independent PRs , ready for merging: |
Beta Was this translation helpful? Give feedback.
-
After a successful speedup of the Turtle parser (#1266), we should see how we can speedup the the n-triples parser as well.
Beta Was this translation helpful? Give feedback.
All reactions