Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support cloud storage in scan_csv #16674

Merged
merged 20 commits into from
Jun 6, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jun 3, 2024

This adds support for cloud storage in scan_csv, using a local cache to avoid duplicates downloads. Cached files are removed if they are not accessed within an hour (configurable with POLARS_FILE_CACHE_TTL).

@github-actions github-actions bot added the internal An internal refactor or improvement label Jun 3, 2024
Copy link

codspeed-hq bot commented Jun 3, 2024

CodSpeed Performance Report

Merging #16674 will not alter performance

Comparing nameexhaustion:file-cache (b131a2b) with main (616b2e0)

Summary

✅ 37 untouched benchmarks

@nameexhaustion nameexhaustion force-pushed the file-cache branch 2 times, most recently from 7491915 to aa6f347 Compare June 4, 2024 06:27
Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 68.42601% with 335 lines in your changes missing coverage. Please review.

Project coverage is 81.36%. Comparing base (ddf8126) to head (b131a2b).
Report is 6 commits behind head on main.

Files Patch % Lines
crates/polars-io/src/file_cache/eviction.rs 40.00% 87 Missing ⚠️
crates/polars-io/src/file_cache/entry.rs 74.61% 66 Missing ⚠️
crates/polars-io/src/file_cache/cache_lock.rs 72.13% 34 Missing ⚠️
crates/polars-io/src/file_cache/utils.rs 64.70% 30 Missing ⚠️
crates/polars-io/src/file_cache/file_fetcher.rs 56.92% 28 Missing ⚠️
crates/polars-io/src/cloud/polars_object_store.rs 0.00% 25 Missing ⚠️
crates/polars-io/src/file_cache/metadata.rs 54.34% 21 Missing ⚠️
crates/polars-io/src/file_cache/cache.rs 81.48% 20 Missing ⚠️
py-polars/src/lazyframe/mod.rs 70.83% 7 Missing ⚠️
crates/polars-io/src/pl_async.rs 0.00% 6 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16674      +/-   ##
==========================================
- Coverage   81.45%   81.36%   -0.10%     
==========================================
  Files        1413     1421       +8     
  Lines      186096   187205    +1109     
  Branches     2776     2777       +1     
==========================================
+ Hits       151588   152321     +733     
- Misses      33988    34363     +375     
- Partials      520      521       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion changed the title test file cache feat: Support cloud storage in scan_csv Jun 5, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 5, 2024
@nameexhaustion nameexhaustion marked this pull request as ready for review June 5, 2024 05:35
@nameexhaustion nameexhaustion removed the internal An internal refactor or improvement label Jun 5, 2024
@@ -20,6 +20,7 @@ ahash = { workspace = true }
arrow = { workspace = true }
async-trait = { version = "0.1.59", optional = true }
atoi_simd = { workspace = true, optional = true }
blake3 = "1.5.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an optional dependency? Only activated when we activate the caching?

Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put them behind a new file_cache feature flag (although maybe we could also just use the cloud feature flag if you want, let me know)
*edit: nvm, I think I prefer the new feature flag because it also specifies "dep:blake3", "dep:fs4"

crates/polars-io/Cargo.toml Outdated Show resolved Hide resolved
crates/polars-io/src/file_cache/file_fetcher.rs Outdated Show resolved Hide resolved
crates/polars-io/src/file_cache/cache_lock.rs Outdated Show resolved Hide resolved
pub(super) type GlobalFileCacheGuardExclusive<'a> = RwLockWriteGuard<'a, GlobalLockData>;

#[derive(Clone)]
struct AccessTracker(Arc<AtomicBool>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment explaining what this boolean means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a very good point - I've left a comment summarizing its purpose, but how it achieves its purpose would probably take a few paragraphs 😂

&& verbose
{
eprintln!("unlocked global file cache lockfile");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before sleeping, I think we can also set a notify: https://docs.rs/tokio/latest/tokio/sync/struct.Notify.htm;

As I understand it this will not schedule this task until we notify that we have set the locked file first. If that works I think we can set the sleep shorter to 200 ms or something like that.

Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip - I've added the notify layer for the background task to await on, which should make it more efficient.

For the delay I'd prefer to avoid making it too little - the background task gets notified the moment we acquire a lock, so it runs during the entire duration for which we hold the lock (i.e. while we are downloading files).

self.uri_hash.as_bytes(),
remote_metadata.last_modified,
);
let _ = std::fs::remove_file(data_file_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block until a new file is downloaded right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not exactly..- this was for in case there was a partial download from an aborted process. I've also left a comment above it.

impl EvictionManager {
pub(super) fn run_in_background(mut self) {
let verbose = config::verbose();
let sleep_interval = std::cmp::max(self.limit_since_last_access.as_secs() / 4, 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set this higher to 60 seconds or so? Doesn't have to run often I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 60 secs

@ritchie46
Copy link
Member

Cool stuff @nameexhaustion. I have left some comments/ questions. But really fun to review this.

@nameexhaustion nameexhaustion marked this pull request as draft June 5, 2024 09:12
@nameexhaustion nameexhaustion marked this pull request as ready for review June 5, 2024 11:47
@s-banach
Copy link
Contributor

s-banach commented Jun 5, 2024

Closes #13115.

@s-banach
Copy link
Contributor

s-banach commented Jun 5, 2024

Can you explain POLARS_FILE_CACHE_TTL?
I'm assuming that files are only cached within the context of a single query.
The name "TTL" makes it sound like, if I rerun scan_csv, I'll keep getting an hour-old version of the file I'm trying to scan.

@nameexhaustion
Copy link
Collaborator Author

nameexhaustion commented Jun 6, 2024

Closes #13115.

Thanks for the triage!

Can you explain POLARS_FILE_CACHE_TTL? I'm assuming that files are only cached within the context of a single query. The name "TTL" makes it sound like, if I rerun scan_csv, I'll keep getting an hour-old version of the file I'm trying to scan.

A query will always check to ensure it has the latest file at the start of execution (i.e. collect()). POLARS_FILE_CACHE_TTL is for configuring how long a cached file is kept on disk based on the time it was last accessed (in seconds).

@ritchie46 ritchie46 merged commit a7f9c8d into pola-rs:main Jun 6, 2024
27 checks passed
@nameexhaustion nameexhaustion deleted the file-cache branch June 6, 2024 08:35
@c-peters c-peters added the accepted Ready for implementation label Jun 9, 2024
Wouittone pushed a commit to Wouittone/polars that referenced this pull request Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants