-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help parsing large file #44
Comments
Hi @h4ck3rm1k3 , Thanks for the report!
If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 in November. Try something like: parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl" Does that improve the memory performance?
Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:
I'm interested to know If attoparsec (above) gives you better results. |
Yes, I have downloaded the git repo and am looking. I am interested in
converting the data into types based on the schema or ontology I provide,
for now I will create a custom one, but basically I want to call
constructors of different forms based on the data in the rdf.
…On Tue, Sep 19, 2017 at 8:05 AM, Rob Stewart ***@***.***> wrote:
Hi @h4ck3rm1k3 <https://github.com/h4ck3rm1k3> ,
Thanks for the report!
1. will attoparsec help?
If you use the git repository for this library, then you can try
experimental attoparsec support provided by @axman6
<https://github.com/axman6> in November.
Try something like:
parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"
Does that improve the memory performance?
1. can we convert the rdf into an equivalent haskell source program
that would be compiled and strongly typed.
Interesting idea. What exactly would you want to convert to Haskell types?
You might meaning:
1.
The schema for each ontology used in a Turtle file? E.g. if the friend
of a friend ontology is used, then foaf:homepage predicate would be
turned into a Haskell type? For this, have you looked at type providers?
It's that sort of thing, i.e. turning a closed world schema into types. F#
has them, Haskell doesn't.
2.
Turning Turtle data into types? I'm not how that'd work, or why
turning *instances* (triples) into Haskell types would a useful thing
to do.
I'm interested to know If attoparsec (above) gives you better results.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACIV6z0YtZOC9BM6-wG9cF0TPj32Ku6ks5sj62egaJpZM4PcERR>
.
--
James Michael DuPont
|
Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished. Like the sax model in xml parsing, then I could do my processing before the file is completed. |
Test with of normal vs attoparse 30k lines we are still hovering around 0.5 seconds per 1k lines,
Testing with lastest version of rdf4h
|
Agree that this would be a good feature, moving towards generating on-the-fly streams of RDF triples whilst parsing, rather than parsing a file/string in entirety. For example, from the API in the io-streams library, I can imagine to read an RDF source, we'd have a new type class:
Then these triple streams could be connected to an output stream, e.g. a file output stream, using the io-streams API:
|
The big question I have for rdf and haskel is how to create instances of types from rdf data, is there any easy way to map rdf data via some ontology into haskell types? |
@h4ck3rm1k3 sadly not, although that would be very cool. There is some work in this area, for other languages including F# and Idris:
And also in Scala, where they have support for type providers from RDF data: https://github.com/travisbrown/type-provider-examples |
Hi there,
I am working on parsing a large turtle file, ideally I would like to turn it into an equivalent haskell program.
I have been profiling the read function and see a growth over time in memory and other things 👍
For 30k lines of the file I got these stats from rdf4h-3.0.1 release from stack.
We can see that a large amount of memory and time is spent in the parsec. I am wondering the following :
Examples of the files are here:
https://gist.github.com/h4ck3rm1k3/e1b4cfa58c4dcdcfc18cecab013cc6c9
The text was updated successfully, but these errors were encountered: