Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write vocabulary files to separate directory #1237

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hannahbast
Copy link
Member

The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both IndexBuilderMain and ServerMain via the new command-line option --vocabulary-basename or -v. The directory for the index files is specified like before via --index-basename or -i. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR).

This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to .vocabulary.external and .vocabulary.external.idsAndOffsets).

Copy link

codecov bot commented Jan 21, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ea905cc) 87.46% compared to head (81f2e53) 87.48%.

Files Patch % Lines
src/index/IndexImpl.Text.cpp 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1237      +/-   ##
==========================================
+ Coverage   87.46%   87.48%   +0.02%     
==========================================
  Files         307      307              
  Lines       27771    27785      +14     
  Branches     3116     3116              
==========================================
+ Hits        24291    24309      +18     
+ Misses       2346     2340       -6     
- Partials     1134     1136       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hannahbast hannahbast requested a review from joka921 January 22, 2024 10:52
@hannahbast
Copy link
Member Author

@joka921 I have tried this with olympics and dblp and it worked fine. I am currently building an index with uniprot, using SSD for the index files and HDD for the vocabulary files. That works fine so far as well.

I think this is rather straightforward, in particular, since the default behavior (when the new --vocabulary-basename option is not used) is exactly like before.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to this very useful feature.

Comment on lines 139 to +142
Server server(port, numSimultaneousQueries, memoryMaxSize,
std::move(accessToken), !noPatternTrick);
server.run(indexBasename, text, !noPatterns, !onlyPsoAndPosPermutations);
server.run(baseNameIndex, baseNameVocabulary, text, !noPatterns,
!onlyPsoAndPosPermutations);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one thing you could do (the outer interface that interacts with the control script is mostly you area)
is to set up a simple struct ServerConfig that is then passed to the constructor as well as the run() function where they grap their respectively needed arguments. That makes it much easier to add additional arguments.
(Probably we need a similar struct IndexConfig that then becomes part of the server config for exactly the same reason).
Then it will become much easier to add additional arguments.
Are you interested in setting this up as a separate PR, or should I do this?

"The basename of the index files (required).");
add("vocabulary-basename,v", po::value(&baseNameVocabulary),
"The basename of the vocabulary files"
"(default: same as basename of the index fles).");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"(default: same as basename of the index fles).");
"(default: same as basename of the index files).");

Comment on lines +161 to +163
if (baseNameVocabulary.empty()) {
baseNameVocabulary = baseNameIndex;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the baseNameVocabulary is an std::optional<string> then you can pass it as is all the way down, and only in a single place in the index (in setOnDiskBase) you have to resolve the default (or not even there but only in the functions that interact with the vocabulary).

@@ -107,14 +107,14 @@ void IndexImpl::addTextFromContextFile(const string& contextFile,
LOG(DEBUG) << "Reloading the RDF vocabulary ..." << std::endl;
vocab_ = RdfsVocabulary{};
readConfiguration();
vocab_.readFromFile(onDiskBase_ + INTERNAL_VOCAB_SUFFIX,
onDiskBase_ + EXTERNAL_VOCAB_SUFFIX);
vocab_.readFromFile(onDiskBaseVocabulary_ + INTERNAL_VOCAB_SUFFIX,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make the onDiskBaseVocabulary_ an optional (see my suggestion above), and then instead introduce the function onDiskBaseVocabulary() that returns something like onDiskBaseVocabulary_.value_or(onDiskBaseIndex_), then everything is clean nice and typesafe:)

@@ -124,7 +123,7 @@ Index makeTestIndex(const std::string& indexBasename,
// these tests.
static std::ostringstream ignoreLogStream;
ad_utility::setGlobalLoggingStream(&ignoreLogStream);
std::string inputFilename = indexBasename + ".ttl";
std::string inputFilename = baseName + ".ttl";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you here change the baseName vocabulary to something that is different but derived from the baseName, then you immediately have a test for your new feature.
(make sure to also pass then then vocabularyBasename to the getAllIndexFilenames function above).

Hannah Bast added 3 commits February 2, 2024 05:23
The vocabulary files can now be written to (and read from) a separate
directory than the other index files. This directory can be specified in
both `IndexBuilderMain` and `ServerMain` via the new command-line option
`--vocabulary-basename` or `-v`. The directory for the index files is
specified like before via `--index-basename` or `-i`. If no directory
for the vocabulary files is specified, the same directory as for the
index files is taken (that was the status quo before this PR).

This is useful for datasets with a huge vocabulary. For example, the
vocabulary files for UniProt have a total size of over 3 TB (mostly due
to `.vocabulary.external` and `.vocabulary.external.idsAndOffsets`).
Reason: The merge was very SLOW when these were in the vocabulary
directory, which for our UniProt index builds is on HDD (because the
external vocabulary is so larger). I first tried to only have the
`.tmp.partial-vocabulary.words` files in the index directory, but that
was still slow. Now also the `.tmp.partial-vocabulary.ids` files are in
the index directory.

Explanations concerning SLOW: The merging of the first few 100M triples
is fast (30 seconds per 100M triples). Then it becomes slow and then
very slow (half an hour from 700M triples to 800M triples). Not only is
it slow, but doing other stuff on the machine (like wrting something in
an editor with autosave on) becomes very slow to respond to, which is a
clear sign that the random accesses to HDD are the problem.

NOTE: With the partial solution, where `.tmp.partial-vocabulary.words`
are on SSD and `.tmp.partial-vocabulary.ids` are on HDD, it is not as
bad. There was a very significant slow-down from 700M to 1100M triples,
but after that merging was as fast again (though not as fast as in the
beginning). At the time of this writing, I only observed until 1700M,
stay tuned for more information.
@hannahbast hannahbast force-pushed the write-vocabulary-files-to-separate-directory branch from 6799a37 to 81f2e53 Compare February 2, 2024 04:23
Copy link

sonarqubecloud bot commented Feb 2, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants