-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to import all of GeoNames #17
Comments
What is the bottleneck? If it is memory consumption by Fuseki, maybe tdb2.tdbloader should be included after mapping. |
The first bottleneck is an OOM when running
So why not try CARML, which is supposed to be streaming: 😄
So unfortunately it isn’t completely compatible with https://github.com/RMLio/rmlmapper-java for our mapping file. To be honest, I don’t know much about RML. @pmaria Any ideas? |
Yeah, CARML does not yet support CSVW, because it was never officially part of the RML spec. In the new specs it is incorporated, but I'm still in the process of implementing those. You could try the mapping without using CSVW: So something like: :GeonamesSource
a rml:LogicalSource ;
rml:source "geonamesplus.txt";
rml:referenceFormulation ql:CSV . for all the logical sources should work. Let me know if I can help. |
Thanks @pmaria. That throws:
Even with an absolute file path as the |
|
@pmaria Using https://github.com/carml/carml-jar/tree/nde throws:
That’s probably due to the data containing null values. Do you prefer to handle that in your implementation or should I filter out (how?) null values in the config? |
@ddeboer I pushed a fix |
Thanks, that helps! Got 12 GB of .nt output without any OOMs. |
@pmaria However, the output for the predicate in question (
@pmaria Any ideas? |
Yes, this is another problem of non-specification which will be fixed with the new spec. CARML defaults to generating IRIs for functions, while RML Mapper defaults to literals. You can add so in this case: :AlternateNamesSplit
rr:termType rr:Literal ;
fnml:functionValue [
rml:logicalSource :LogicalSource;
rr:predicateObjectMap [
rr:predicate fno:executes;
rr:objectMap [ rr:constant grel:string_split ];
];
rr:predicateObjectMap [
rr:predicate grel:valueParameter;
rr:objectMap [ rml:reference "alternatenames" ];
];
rr:predicateObjectMap [
rr:predicate grel:p_string_sep;
rr:objectMap [ rr:constant "," ];
];
]. |
Thanks, I get literals now. As I said before, I was able to generate 12 GB of N-Triples. However, when mapping now it starts slowing down ~4.1 GB (at 800% CPU, which is perhaps to be expected). |
Interesting, I will see if I can reproduce locally. |
I am able to reproduce this. The problem of high CPU is a side-effect of the heap space being used up, and that is caused by the joins in the mapping. Intermediary results for joins with conditions are still stored in-memory in CARML. In theory this could be handled using some intermediate persistence, but currently this is not implemented. It does surprise me however that it costs as much memory as it does. I will see if I can investigate that further. It is interesting that you were able to run it before though. Did anything change in the mapping? |
Perhaps when I had disabled the joins. I can confirm that doing so makes the process run with ~100% CPU and more reasonable memory consumption, outputting ~11 GB of data. Do you see any possible solutions to the join problem? I could of course offload the joins to some (shell) script, but that rather defeats the purpose of using RML. |
Keeping the current selection of administrative units. Is this feasible?
The text was updated successfully, but these errors were encountered: