You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
current use of within Argo workflows is via bundle upload, so all the bytes in a bundle have to be shuttled into datamon at once. this issue is about decreasing the time between an Argo DAG node deciding to commit a bundle and getting a response indicating that its data has been safely stored. if the Argo workflows use bundle mount new to create new bundles, datamon can take a more eager approach to storing data.
an implementation detail of note here is the use of a secondary blob-store-like "staging area": files are to be hashed and stored to GCS as soon as their handles are released by the OS. in case files are appended to or otherwise modified before the final commit, and since the blob store is treated as immutable, data cannot be uploaded directly to the blob store upon file descriptor close without winding up with some hashes (and the corresponding storage costs) that aren't accessible via metadata: i.e. we'd wind up paying to store data that the application has no access to.
so while closing a file at the application level will always result in a new entry in the staging area, only the most recent entry will be recorded in the bundle metadata.
of course, the final destination for all file contents is the blob store. yet to facilitate accessing data between Argo DAG nodes as quickly as possible, the datamon may transfer bytes directly from the staging area when accessing a bundle that hasn't transferred all its blobs out of the staging area: it will always look for blobs in the blob store first, then fall back to the staging area (if a staging area is specified).
given such behavior of the end-user binary (i.e. the command-line interface), the only remaining component of the overall incremental upload design is some method to garbage collect the staging area -- to transfer blobs associated with bundles into the blob store and to discard unused blobs.
The text was updated successfully, but these errors were encountered:
current use of within Argo workflows is via
bundle upload
, so all the bytes in a bundle have to be shuttled into datamon at once. this issue is about decreasing the time between an Argo DAG node deciding to commit a bundle and getting a response indicating that its data has been safely stored. if the Argo workflows usebundle mount new
to create new bundles, datamon can take a more eager approach to storing data.an implementation detail of note here is the use of a secondary blob-store-like "staging area": files are to be hashed and stored to GCS as soon as their handles are released by the OS. in case files are appended to or otherwise modified before the final commit, and since the blob store is treated as immutable, data cannot be uploaded directly to the blob store upon file descriptor close without winding up with some hashes (and the corresponding storage costs) that aren't accessible via metadata: i.e. we'd wind up paying to store data that the application has no access to.
so while closing a file at the application level will always result in a new entry in the staging area, only the most recent entry will be recorded in the bundle metadata.
of course, the final destination for all file contents is the blob store. yet to facilitate accessing data between Argo DAG nodes as quickly as possible, the datamon may transfer bytes directly from the staging area when accessing a bundle that hasn't transferred all its blobs out of the staging area: it will always look for blobs in the blob store first, then fall back to the staging area (if a staging area is specified).
given such behavior of the end-user binary (i.e. the command-line interface), the only remaining component of the overall incremental upload design is some method to garbage collect the staging area -- to transfer blobs associated with bundles into the blob store and to discard unused blobs.
The text was updated successfully, but these errors were encountered: