Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help parsing large file #44

Open
h4ck3rm1k3 opened this issue Sep 19, 2017 · 7 comments
Open

Help parsing large file #44

h4ck3rm1k3 opened this issue Sep 19, 2017 · 7 comments

Comments

@h4ck3rm1k3
Copy link

Hi there,
I am working on parsing a large turtle file, ideally I would like to turn it into an equivalent haskell program.
I have been profiling the read function and see a growth over time in memory and other things 👍

For 30k lines of the file I got these stats from rdf4h-3.0.1 release from stack.

        total alloc = 29,235,026,136 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                        SRC                                                    %time %alloc

>>=              Text.Parsec.Prim              Text/Parsec/Prim.hs:202:5-29                            17.4    7.1
satisfy          Text.Parsec.Char              Text/Parsec/Char.hs:(140,1)-(142,71)                    16.2   32.7
noneOf.\         Text.Parsec.Char              Text/Parsec/Char.hs:40:38-52                            14.3    0.0

We can see that a large amount of memory and time is spent in the parsec. I am wondering the following :

  1. can we parse this data incrementally ? Would it make sense to read this in line by line and feed that to the parser or something?
  2. can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.
  3. will attoparsec help?

Examples of the files are here:
https://gist.github.com/h4ck3rm1k3/e1b4cfa58c4dcdcfc18cecab013cc6c9

@robstewart57
Copy link
Owner

robstewart57 commented Sep 19, 2017

Hi @h4ck3rm1k3 ,

Thanks for the report!

  1. will attoparsec help?

If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 in November.

Try something like:

parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"

Does that improve the memory performance?

  1. can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.

Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:

  1. The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then foaf:homepage predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't.

  2. Turning Turtle data into types? I'm not sure how that'd work, or why turning ontological instances (data as triples) into Haskell types would a useful thing to do, or what it'd look like.

I'm interested to know If attoparsec (above) gives you better results.

@h4ck3rm1k3
Copy link
Author

h4ck3rm1k3 commented Sep 20, 2017 via email

@h4ck3rm1k3
Copy link
Author

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished. Like the sax model in xml parsing, then I could do my processing before the file is completed.

@h4ck3rm1k3
Copy link
Author

h4ck3rm1k3 commented Sep 21, 2017

Test with of normal vs attoparse 30k lines we are still hovering around 0.5 seconds per 1k lines,
The memory usage has gone down.
That is still not very fast. I think next I want to look into some callback function.
These are both with NTriplesParserCustom

    Thu Sep 21 06:51 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       14.89 secs   (14886 ticks @ 1000 us, 1 processor)
        total alloc = 28,746,934,240 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                         SRC                                                    %time %alloc

satisfy          Text.Parsec.Char               Text/Parsec/Char.hs:(140,1)-(142,71)                    13.1   21.4
>>=              Text.Parsec.Prim               Text/Parsec/Prim.hs:202:5-29                            11.4   13.5
mplus            Text.Parsec.Prim               Text/Parsec/Prim.hs:289:5-34                             6.5    9.7
parsecMap.\      Text.Parsec.Prim               Text/Parsec/Prim.hs:190:7-48                             6.5   11.4
isSubDelims      Network.URI                    Network/URI.hs:355:1-38                                  4.4    0.0
fmap.\           Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(171,7)-(172,42)       4.1    3.1
isGenDelims      Network.URI                    Network/URI.hs:352:1-34                                  3.7    0.0
>>=.\.succ'      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76              3.5    1.1
encodeChar       Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               3.1    4.6
encodeString     Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:37:1-53                      2.3    4.0
concat.ts'       Data.Text                      Data/Text.hs:902:5-34                                    2.0    2.6

Testing with lastest version of rdf4h

        Thu Sep 21 06:34 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       15.28 secs   (15282 ticks @ 1000 us, 1 processor)
        total alloc = 33,815,423,648 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                        SRC                                                    %time %alloc

satisfy          Text.Parsec.Char              Text/Parsec/Char.hs:(140,1)-(142,71)                    17.2   27.6
>>=              Text.Parsec.Prim              Text/Parsec/Prim.hs:202:5-29                            16.5   22.8
parsecMap.\      Text.Parsec.Prim              Text/Parsec/Prim.hs:190:7-48                             9.2    8.4
mplus            Text.Parsec.Prim              Text/Parsec/Prim.hs:289:5-34                             7.7    9.5
isSubDelims      Network.URI                   Network/URI.hs:355:1-38                                  3.9    0.0
isGenDelims      Network.URI                   Network/URI.hs:352:1-34                                  3.4    0.0
encodeChar       Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               2.9    3.9
encodeString     Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:37:1-53                      2.2    3.4
parserReturn.\   Text.Parsec.Prim              Text/Parsec/Prim.hs:234:7-30                             2.0    3.1

@robstewart57
Copy link
Owner

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished

Agree that this would be a good feature, moving towards generating on-the-fly streams of RDF triples whilst parsing, rather than parsing a file/string in entirety.

For example, from the API in the io-streams library, I can imagine to read an RDF source, we'd have a new type class:

class RdfParserStream p where
  parseStringStream
      :: (Rdf a)
      => p
      -> Text
      -> Either ParseFailure (InputStream (RDF a))
  parseFileStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))
  parseURLStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))

Then these triple streams could be connected to an output stream, e.g. a file output stream, using the io-streams API:

connect :: InputStream a -> OutputStream a -> IO () 

@h4ck3rm1k3
Copy link
Author

The big question I have for rdf and haskel is how to create instances of types from rdf data, is there any easy way to map rdf data via some ontology into haskell types?

@robstewart57
Copy link
Owner

@h4ck3rm1k3 sadly not, although that would be very cool.

There is some work in this area, for other languages including F# and Idris:

And also in Scala, where they have support for type providers from RDF data: https://github.com/travisbrown/type-provider-examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants