Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not duplicate the repository or store across branches in the metadata #1390

Closed
wlandau opened this issue Dec 3, 2024 · 5 comments
Closed
Assignees

Comments

@wlandau
Copy link
Member

wlandau commented Dec 3, 2024

The current system duplicates long strings and takes up a lot of space. We can think about adding special rows for repositories and formats, and referencing hashes throughout the metadata file. Or the formats and stores can be in separate metadata files and _targets/meta/meta can refer to them.

@wlandau wlandau self-assigned this Dec 3, 2024
@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2025

A specialized solution for just repositories and formats would not be ideal because it would be complicated to integrate with the current design of targets, and because it would not really solve the general problem of file size (for example, dynamic branch names are repeated throughout and contribute to large text files).

I wonder if we can leverage qs2 to read and write compressed text. This route could be fast, it could reduce storage consumption, and it would be neatly encapsulated in the existing "database" class in targets. I will investigate.

@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2025

I experimented with a simple pipeline with vs without local CAS:

library(targets)
tar_option_set(repository = tar_repository_cas_local())
list(
  tar_target(x, seq_len(1e4)),
  tar_target(y, x, pattern = map(x))
)

With tar_repository_cas_local():

  • First tar_make(): 69.1 seconds
  • Second tar_make(): 4.063 seconds
  • _targets/meta/meta uncompressed file size: 5.13 MB
  • _targets/meta/meta with individual lines independently compressed with qs2::encode_source(): 4.94 MB
  • _targets/meta/meta file size after compression with qs2::qs_save(): 541.69 kB

Without tar_repository_cas_local():

  • First tar_make(): 63.960 seconds
  • Second tar_make(): 4.582 seconds
  • _targets/meta/meta uncompressed file size: 1.78 MB
  • _targets/meta/meta with individual lines independently compressed with qs2::encode_source(): 2.53 MB
  • _targets/meta/meta file size after compression with qs2::qs_save(): 520.79 kB

Takeaways:

  • We don't save much space by independently compressing lines.
  • If we compress the whole file, qs2 detects repeated strings, and the result is the same size with vs without CAS.
  • _targets/meta/meta does not get prohibitively large in either case. This might not actually be worth optimizing too aggressively, especially since users can gain a lot by compressing _targets/meta/meta if needed.

Other ideas for optimization:

  • Remove the function signatures from fields in tar_repository_cas() and tar_format() because we already know what they should be.
  • potentially allow an upstream target to store the repository or format.

@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2025

potentially allow an upstream target to store the repository or format.

On second thought, this would be an unclean mixture of data and metadata.

@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2025

Removing known signatures from the tar_repository_cas() string seems to shave a whole megabyte off the stored size of the file (4.11 MB now vs 5.13 MB before vs 2.53 MB without CAS).

wlandau-lilly pushed a commit that referenced this issue Jan 10, 2025
wlandau-lilly pushed a commit that referenced this issue Jan 10, 2025
@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2025

Just finished condensing strings from both tar_repository_cas() and tar_format() this way. I think the best way to improve optimization is to convert metadata to DuckDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant