Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiple input files with rdf2hdt #233

Open
donpellegrino opened this issue Feb 26, 2022 · 7 comments
Open

Handle multiple input files with rdf2hdt #233

donpellegrino opened this issue Feb 26, 2022 · 7 comments

Comments

@donpellegrino
Copy link
Contributor

The current rdf2hdt command-line interface only accepts one RDF input file and overwrites the output file. Therefore, it is not possible to use the rdf2hdt CLI to build up a single HDT file from many RDF input files. A work-around is to concatenate all the RDF files into a single input and then pass that to HDT. However, that can be inefficient when working with many files and only works for RDF formats that can be concatenated. RDF/XML for example would require another step to convert to N-Triples.

It would be useful for rdf2hdt to accept multiple input files in one run.

It would also be useful if rdf2hdt to optionally append to an existing HDT file instead of replacing it.

If there is already another usage pattern for bulk loading multiple RDF inputs into one HDT file, please let me know.

@D063520
Copy link
Contributor

D063520 commented Feb 26, 2022

Hi,

the input of rdf2hdt is one rdf and not multiple.

As you say depending on the serialization different things might be possible. I think it is not the job of this HDT library to design the best way this can be done. This is not an issue of HDT but an issue of the respective RDF serialization.

What can help you though is the fact that you can concatenate two HDTs. This is the job of hdtCat. So you can compress two rdf files into HDT and cat them then together. (this functionality is only available in the hdt-java repo though)

WARNING: it is more efficient in time to

  • first join the RDF files and then compress to HDT
  • than compressing to two HDT files and then cutting together.
    On the other hand it is more efficient in memory to do the other way around.

Hope it helps
D063520

@donpellegrino
Copy link
Contributor Author

Thanks for the clarification. For first joining the RDF files and then compressing to HDT, I am trying to avoid creating the very large joined RDF file on disk. I tried piping the output of cat over the files but it seems rdf2hdt won't accept stdin as an input and passing "/dev/stdin" gives a resulting HDT file having 0 triples. If there is some way to get HDT to take a pipe as input that would seem to enable pipes that handle joining multiple files into one input without also filling disk with a full duplication of all those triples into the single file.

@D063520
Copy link
Contributor

D063520 commented Feb 26, 2022

How big is your file and how big are your resources? Because you need more or less a 1:1 ratio between the size of the RDF file and the amount of RAM to make the compression. HDT is very resources efficient except at indexing time where it is very memory hungry .....

@donpellegrino
Copy link
Contributor Author

"How big is your file" - A use case I am exploring is putting all of PubChemRDF (https://pubchemdocs.ncbi.nlm.nih.gov/rdf) into an HDT file. The full collection is 14,378,495,402 triples as per https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/void.ttl. I am exploring a subset that is ~ 40 GB across ~ 700 N-Triples gzipped files. I don't know how big that would be as a single N-Triples uncompressed text file. The creation of that intermediate file on disk is what I am trying to avoid.

"How big are your resources" - I am exploring the techniques on Intel DevCloud (https://devcloud.intel.com/oneapi/home/). It is a heterogeneous cluster, but the 12 nodes having the most RAM have 384 GB each. I also have a Cray EX series with some nodes having 1 TB of RAM, but I would prefer to identify a technique that was not limited by the RAM on a single node.

Distributed Memory Candidate Approach

Perhaps one approach would be switching to a distributed memory model for scaling beyond the memory limitations on a single node. I see that Celerity (https://celerity.github.io/) could be a pathway to building in distributed memory utilization and that it is advertised as compatible with the Intel Dara Parallel C++ (DPC++) Compiler on the Intel DevCloud. If someone could point me to the right bit of code, I could explore the feasibility of that approach.

A distributed memory approach may address memory utilization limitations but would still leave open the problem of slow I/O and storage space needed to write out a single massive, concatenated, RDF file as an intermediate bulk load artifact.

Streaming Candidate Approach

A bit of Bash scripting would seem to allow for streaming in triples from multiple compressed files in multiple formats and allowing rdf2hdt to just eat N-Triples. I tried the following, but have not figured out why it doesn't work:

time find ~/data/pubchemrdf -name "*.ttl.gz" -exec gunzip --to-stdout --keep {} \; | rdf2hdt /dev/stdin pubchemrdf.hdt

My current suspicion is that rdf2hdt reads all of /dev/stdin, closes it, and then attempts to reopen it and finds it empty. But that is just a guess. I would love any suggestion on where to look to investigate what is happening there and see if a streaming approach could be made to work.

Laboratory

I am working in a sandbox with the following:

  • Up-to-date Git clone of hdt-cpp "develop" branch.
  • Intel DevCloud compute node (192 GB RAM, "Intel(R) Xeon(R) Gold 6128 CPU")
  • hdt-cpp compiled with "Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)"
  • base OS distribution: Ubuntu 20.04.4 LTS

Please let me know if that is the right branch and codebase to work from.

@D063520
Copy link
Contributor

D063520 commented Feb 28, 2022

Hi,

nice task : )

We have compressed on a single node (120Gb of RAM) wikidata (16 Billion triples) with the following approach. We converted Wikidata in n-triples and chunked it in pieces (of more or less 100Gb) so that we can compress them in HDT. Then we used hdtCat to cat them together.

You say that you want not to uncompress everything in one file. What you can do is that you uncompress some chunks, convert them to HDT and then you cat them together.

Having a bzipped or gzipped chunk or having an HDT file occupies more or less the same amount of space. So if you have 380 Gb of RAM I would advise to uncompress more or less 300Gb of n-triples and compress these to HDT1. Continue with the next 300Gb chunk and compress these to HDT2. Then you cat them together. And continue like this .... with your description this seams feasible ....

Does this make sense to you?

Salut
D063520

@mielvds
Copy link
Member

mielvds commented Feb 28, 2022

My current suspicion is that rdf2hdt reads all of /dev/stdin, closes it, and then attempts to reopen it and finds it empty. But that is just a guess. I would love any suggestion on where to look to investigate what is happening there and see if a streaming approach could be made to work.

Note that this requires creating an HDT in one-pass, which is a feature hdt-java has, but hdt-cpp doesn't (checkout #47). Short summary: it needs a dictionary first in order to assign IDs to values when compressing the triples and therefore, hdt-cpp reads the input file twice. A while ago, I made this PR to include one pass ingestion in the hdt-cpp, but since I don't know cpp, I was hoping somebody would step up ;)

@ate47
Copy link

ate47 commented May 11, 2022

If there is some way to get HDT to take a pipe as input that would seem to enable pipes that handle joining multiple files into one input without also filling disk with a full duplication of all those triples into the single file.

On POSIX computer, you can actually create a named pipe and sending twice the cat result for the two-pass, it won't cost the price of a new file,

mkfifo mypipe.nt
# send the cat twice to the pipe
(cat myfile1.nt myfile2.nt > mypipe.nt ; cat myfile1.nt myfile2.nt > mypipe.nt) &
rdf2hdt mypipe.nt myhdt.hdt
# don't forget to remove it ;)
rm mypipe.nt

Otherwise, the Java version contains with a one-pass parser, a directory parser, which can be better to parse multiple rdf files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants