-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write vocabulary files to separate directory #1237
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1237 +/- ##
==========================================
+ Coverage 87.46% 87.48% +0.02%
==========================================
Files 307 307
Lines 27771 27785 +14
Branches 3116 3116
==========================================
+ Hits 24291 24309 +18
+ Misses 2346 2340 -6
- Partials 1134 1136 +2 ☔ View full report in Codecov by Sentry. |
@joka921 I have tried this with olympics and dblp and it worked fine. I am currently building an index with uniprot, using SSD for the index files and HDD for the vocabulary files. That works fine so far as well. I think this is rather straightforward, in particular, since the default behavior (when the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions to this very useful feature.
Server server(port, numSimultaneousQueries, memoryMaxSize, | ||
std::move(accessToken), !noPatternTrick); | ||
server.run(indexBasename, text, !noPatterns, !onlyPsoAndPosPermutations); | ||
server.run(baseNameIndex, baseNameVocabulary, text, !noPatterns, | ||
!onlyPsoAndPosPermutations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one thing you could do (the outer interface that interacts with the control script is mostly you area)
is to set up a simple struct ServerConfig
that is then passed to the constructor as well as the run()
function where they grap their respectively needed arguments. That makes it much easier to add additional arguments.
(Probably we need a similar struct IndexConfig
that then becomes part of the server config for exactly the same reason).
Then it will become much easier to add additional arguments.
Are you interested in setting this up as a separate PR, or should I do this?
"The basename of the index files (required)."); | ||
add("vocabulary-basename,v", po::value(&baseNameVocabulary), | ||
"The basename of the vocabulary files" | ||
"(default: same as basename of the index fles)."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(default: same as basename of the index fles)."); | |
"(default: same as basename of the index files)."); |
if (baseNameVocabulary.empty()) { | ||
baseNameVocabulary = baseNameIndex; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the baseNameVocabulary
is an std::optional<string>
then you can pass it as is all the way down, and only in a single place in the index (in setOnDiskBase
) you have to resolve the default (or not even there but only in the functions that interact with the vocabulary).
@@ -107,14 +107,14 @@ void IndexImpl::addTextFromContextFile(const string& contextFile, | |||
LOG(DEBUG) << "Reloading the RDF vocabulary ..." << std::endl; | |||
vocab_ = RdfsVocabulary{}; | |||
readConfiguration(); | |||
vocab_.readFromFile(onDiskBase_ + INTERNAL_VOCAB_SUFFIX, | |||
onDiskBase_ + EXTERNAL_VOCAB_SUFFIX); | |||
vocab_.readFromFile(onDiskBaseVocabulary_ + INTERNAL_VOCAB_SUFFIX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make the onDiskBaseVocabulary_
an optional
(see my suggestion above), and then instead introduce the function onDiskBaseVocabulary()
that returns something like onDiskBaseVocabulary_.value_or(onDiskBaseIndex_)
, then everything is clean nice and typesafe:)
@@ -124,7 +123,7 @@ Index makeTestIndex(const std::string& indexBasename, | |||
// these tests. | |||
static std::ostringstream ignoreLogStream; | |||
ad_utility::setGlobalLoggingStream(&ignoreLogStream); | |||
std::string inputFilename = indexBasename + ".ttl"; | |||
std::string inputFilename = baseName + ".ttl"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you here change the baseName vocabulary to something that is different but derived from the baseName
, then you immediately have a test for your new feature.
(make sure to also pass then then vocabularyBasename to the getAllIndexFilenames
function above).
The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both `IndexBuilderMain` and `ServerMain` via the new command-line option `--vocabulary-basename` or `-v`. The directory for the index files is specified like before via `--index-basename` or `-i`. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR). This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to `.vocabulary.external` and `.vocabulary.external.idsAndOffsets`).
Reason: The merge was very SLOW when these were in the vocabulary directory, which for our UniProt index builds is on HDD (because the external vocabulary is so larger). I first tried to only have the `.tmp.partial-vocabulary.words` files in the index directory, but that was still slow. Now also the `.tmp.partial-vocabulary.ids` files are in the index directory. Explanations concerning SLOW: The merging of the first few 100M triples is fast (30 seconds per 100M triples). Then it becomes slow and then very slow (half an hour from 700M triples to 800M triples). Not only is it slow, but doing other stuff on the machine (like wrting something in an editor with autosave on) becomes very slow to respond to, which is a clear sign that the random accesses to HDD are the problem. NOTE: With the partial solution, where `.tmp.partial-vocabulary.words` are on SSD and `.tmp.partial-vocabulary.ids` are on HDD, it is not as bad. There was a very significant slow-down from 700M to 1100M triples, but after that merging was as fast again (though not as fast as in the beginning). At the time of this writing, I only observed until 1700M, stay tuned for more information.
6799a37
to
81f2e53
Compare
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 3 New issues |
The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both
IndexBuilderMain
andServerMain
via the new command-line option--vocabulary-basename
or-v
. The directory for the index files is specified like before via--index-basename
or-i
. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR).This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to
.vocabulary.external
and.vocabulary.external.idsAndOffsets
).