Skip to content

Commit

Permalink
Merge pull request #21 from IBM/fdedup-docs
Browse files Browse the repository at this point in the history
put readme cross-reference
  • Loading branch information
daw3rd authored Apr 29, 2024
2 parents 917b184 + 1b7ccc6 commit 3943ce2
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 0 deletions.
4 changes: 4 additions & 0 deletions transforms/universal/doc_id/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ is unique across all rows in all tables provided to the `transform()` method.
To enable this annotation, set `int_id_column` to the name of the column, where you want
to store it.

Document IDs are generally useful for tracking annotations to specific documents. Additionally
[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
document ID column(s), you can use this transform to create ones.

## Building

A [docker file](Dockerfile) that can be used for building docker image. You can use
Expand Down
3 changes: 3 additions & 0 deletions transforms/universal/fdedup/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ Above you see both parameters and their values for small runs (tens of files). W

## Running

In order to run this transform, you need to have an integer doc id in every row of data. [Test data](test-data/input)
ALready has this column, but if you run on our own dataset, please use [doc_id transform](../doc_id) to create one.

We also provide several demos of the transform usage for different data storage options, including
[local file system](src/fdedup_local_ray.py) and [s3](src/fdedup_s3_ray.py)

Expand Down

0 comments on commit 3943ce2

Please sign in to comment.