Skip to content

Commit

Permalink
drop cloud versioned remotes (#5190)
Browse files Browse the repository at this point in the history
  • Loading branch information
dberenbaum authored Mar 20, 2024
1 parent e0ced1e commit 7273a83
Show file tree
Hide file tree
Showing 8 changed files with 6 additions and 233 deletions.
7 changes: 0 additions & 7 deletions content/docs/command-reference/exp/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,6 @@ to [remote storage].

[remote storage]: /doc/user-guide/data-management/remote-storage

<admon type="warn">

`dvc exp push` is not supported with
[`version_aware` DVC remotes](/doc/user-guide/data-management/cloud-versioning).

</admon>

## Synopsis

```usage
Expand Down
9 changes: 0 additions & 9 deletions content/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,6 @@
Install Git hooks into the <abbr>DVC repository</abbr> to automate certain
common actions.

<admon type="warn">

Do not use these Git hooks if you are using a
[version-aware remote](/doc/user-guide/data-management/cloud-versioning#version-aware-remotes).
Version-aware remotes require running `dvc push` before `git commit`, which is
not supported by the included hooks.

</admon>

## Synopsis

```usage
Expand Down
89 changes: 1 addition & 88 deletions content/docs/command-reference/push.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# push

Upload tracked files or directories to [remote storage] based on the current
<abbr>dvc files</abbr> files (and update the cloud info in those files if
pushing to a [version-aware] remote).
<abbr>dvc files</abbr> files.

[remote storage]: /doc/user-guide/data-management/remote-storage

Expand Down Expand Up @@ -276,89 +275,3 @@ Cache and remote 'r1' are in sync.

And running `dvc status --cloud`, DVC verifies that indeed there are no more
files to push to remote storage.

## Example: Version-aware remote for readable storage

Let's set up a [version-aware] remote, which uses cloud versioning to organize
the remote storage.

[version-aware]:
/doc/user-guide/data-management/cloud-versioning#version-aware-remotes

```cli
$ dvc remote add -d versioned_store s3://mybucket
$ dvc remote modify versioned_store version_aware true
$ dvc push
```

> See also `dvc remote add` and `dvc remote modify`.
Now let's look at what was pushed to the remote. Unlike the [example above], the
version-aware remote looks similar to the data in your workspace and is easy to
read.

[example above]: #example-what-happens-in-the-cache

```cli
# Show the current versions.
$ aws s3 ls --recursive s3://mybucket/
2023-02-01 15:24:09 1708591 data/prepared/test.tsv
2023-02-01 15:24:10 6728772 data/prepared/train.tsv
# Show all object versions.
$ aws s3api list-object-versions --bucket mybucket
{
"Versions": [
{
"ETag": "\"b656f1a8273d0c541340cb129fd5d5a9\"",
"Size": 1708591,
"StorageClass": "STANDARD",
"Key": "data/prepared/test.tsv",
"VersionId": "T6rFr7NSHkL3v9tGStO7GTwsVaIFl42T",
"IsLatest": true,
"LastModified": "2023-02-01T20:24:09.000Z",
...
},
{
"ETag": "\"9ca281786366acca17632c27c5c5cc75\"",
"Size": 6728772,
"StorageClass": "STANDARD",
"Key": "data/prepared/train.tsv",
"VersionId": "XaYsHQHWK219n5MoCRe.Rr7LeNbbder_",
"IsLatest": true,
"LastModified": "2023-02-01T20:24:10.000Z",
...
}
]
```

With `version_aware` enabled, `dvc push` will also modify <abbr>dvc files</abbr>
to capture the version information:

```cli
...
outs:
- path: data/prepared
hash: md5
files:
- relpath: test.tsv
md5: b656f1a8273d0c541340cb129fd5d5a9
size: 1708591
cloud:
versioned_store:
etag: b656f1a8273d0c541340cb129fd5d5a9
version_id: T6rFr7NSHkL3v9tGStO7GTwsVaIFl42T
- relpath: train.tsv
md5: 9ca281786366acca17632c27c5c5cc75
size: 6728772
cloud:
versioned_store:
etag: 9ca281786366acca17632c27c5c5cc75
version_id: XaYsHQHWK219n5MoCRe.Rr7LeNbbder_
...
```

Always `dvc push` before `git commit` so that the updated cloud version info is
available in Git.
70 changes: 5 additions & 65 deletions content/docs/user-guide/data-management/cloud-versioning.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,15 @@
# Cloud Versioning

When cloud versioning is enabled, DVC will store files in the remote according
to their original directory location and filenames. Different versions of a file
will then be stored as separate versions of the corresponding object in cloud
storage. This is useful for cases where users prefer to retain their original
filenames and directory hierarchy in remote storage (instead of using DVC's
usual
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
format).

<admon type="warn">

Note that not all DVC functionality is supported when using cloud versioned
remotes, and using cloud versioning comes with the tradeoff of losing certain
benefits of content-addressable storage.

</admon>

<details>

### Expand for more details on the differences between cloud versioned and content-addressable storage

`dvc remote` storage normally uses
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
to organize versioned data. Different versions of files are stored in the remote
according to hash of their data content instead of according to their original
filenames and directory location. This allows DVC to optimize certain remote
storage lookup and data sync operations, and provides data de-duplication at the
file level. However, this comes with the drawback of losing human-readable
filenames without the use of the DVC CLI (`dvc get --show-url`) or API
(`dvc.api.get_url()`).

When using cloud versioning, DVC does not provide de-duplication, and certain
remote storage performance optimizations will be unavailable.
## Importing versioned data

</details>
DVC supports importing cloud-versioned data from supported storage providers.
Refer to `dvc import-url` (`--version-aware`) and `dvc update --rev` for more
information.

## Supported storage providers

Cloud versioning features are only avaible for certain storage providers.
Currently, it is supported on the following `dvc remote` types:
Currently, it is supported on the following storage types:

- [Amazon S3] (requires [S3 Versioning] enabled buckets)
- Microsoft [Azure Blob Storage] (requires [Blob versioning] enabled storage
Expand Down Expand Up @@ -70,33 +40,3 @@ management, see:
[azure blob storage]:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-policy-configure
[google cloud storage]: https://cloud.google.com/storage/docs/lifecycle

## Version-aware remotes

When the `version_aware` option is enabled on a `dvc remote`:

- `dvc push` will utilize cloud versioning when storing data in the remote. Data
will retain its original directory structure and filenames, and each version
of a file tracked by DVC will be stored as a new version of the corresponding
object in cloud storage.
- `dvc fetch` and `dvc pull` will download the corresponding version of an
object from cloud storage.

With `version_aware` enabled, `dvc push` will modify <abbr>dvc files</abbr>.
Always `dvc push` before `git commit` so that the updated cloud version info is
available in Git.

<admon type="warn">

Note that when `version_aware` is in use, DVC does not delete current versions
or restore noncurrent versions of objects in cloud storage. So the current
version of an object in cloud storage may not match the version of a file in
your DVC repository.

</admon>

## Importing versioned data

DVC supports importing cloud-versioned data from supported storage providers.
Refer to `dvc import-url` (`--version-aware`) and `dvc update --rev` for more
information.
Original file line number Diff line number Diff line change
Expand Up @@ -36,28 +36,6 @@ The AWS user needs the following permissions: `s3:ListBucket`, `s3:GetObject`,
To use [custom auth](#custom-authentication) or further configure your DVC
remote, set any supported config param with `dvc remote modify`.

## Cloud versioning

<admon type="info">

Requires [S3 Versioning] enabled on the bucket and the following AWS user
permissions: `s3:ListBucketVersions`, `s3:GetObjectVersion`,
`s3:DeleteObjectVersion`.

</admon>

```cli
$ dvc remote modify myremote version_aware true
```

`version_aware` (`true` or `false`) enables [cloud versioning] features for this
remote. This lets you explore the bucket files under the same structure you see
in your project directory locally.

[s3 versioning]:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html
[cloud versioning]: /docs/user-guide/data-management/cloud-versioning

## Custom authentication

Use these configuration options if you don't have the AWS CLI setup in your
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,6 @@ $ dvc remote add -d myremote azure://<mycontainer>/<path>
To set up authentication or other configuration, set any supported config param
with `dvc remote modify`.

## Cloud versioning

<admon type="info">

Requires [Blob versioning] enabled on the storage account and container.

</admon>

```cli
$ dvc remote modify myremote version_aware true
```

`version_aware` (`true` or `false`) enables [cloud versioning] features for this
remote. This lets you explore the bucket files under the same structure you see
in your project directory locally.

[blob versioning]:
https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview
[cloud versioning]: /docs/user-guide/data-management/cloud-versioning

## Authentication

<admon type="info">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,25 +36,6 @@ service account or other ways to authenticate ([more info]).
To use [custom auth](#custom-authentication) or further configure your DVC
remote, set any supported config param with `dvc remote modify`.

## Cloud versioning

<admon type="info">

Requires [Object versioning] enabled on the bucket.

</admon>

```cli
$ dvc remote modify myremote version_aware true
```

`version_aware` (`true` or `false`) enables [cloud versioning] features for this
remote. This lets you explore the bucket files under the same structure you see
in your project directory locally.

[object versioning]: https://cloud.google.com/storage/docs/object-versioning
[cloud versioning]: /docs/user-guide/data-management/cloud-versioning

## Custom authentication

For [service accounts] (a Google account associated to your GCP project instead
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,6 @@ change, but not saved in the <abbr>cache</abbr> for

Saving external outputs to an external cache has been deprecated in DVC 3.0.

Stay tuned as we work on versioning external outputs using
[cloud versioning](/doc/user-guide/data-management/cloud-versioning).

</admon>

To define files or directories in an external location as <stage> outputs, give
Expand Down

0 comments on commit 7273a83

Please sign in to comment.