Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy dedup #699

Open
wants to merge 86 commits into
base: dev
Choose a base branch
from
Open

Fuzzy dedup #699

wants to merge 86 commits into from

Commits on Oct 10, 2024

  1. added folder_transform

    blublinsky committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    47f4526 View commit details
    Browse the repository at this point in the history
  2. added folder_transform

    blublinsky committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    5fd20a1 View commit details
    Browse the repository at this point in the history
  3. added folder_transform

    blublinsky committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    38b4725 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2024

  1. added folder_transform

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    a3abf21 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d93a06c View commit details
    Browse the repository at this point in the history
  3. Fuzzy dedup pure python implementation

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    af8475d View commit details
    Browse the repository at this point in the history
  4. Fuzzy dedup spark implementation

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    7f9b503 View commit details
    Browse the repository at this point in the history
  5. added folder_transform

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    3349521 View commit details
    Browse the repository at this point in the history
  6. added folder_transform

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    0553edf View commit details
    Browse the repository at this point in the history
  7. added folder_transform

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    a53412e View commit details
    Browse the repository at this point in the history
  8. added folder_transform

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    9c3ace7 View commit details
    Browse the repository at this point in the history
  9. added noop testing

    blublinsky committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    7091a2e View commit details
    Browse the repository at this point in the history
  10. Fuzzy dedup ray implementation

    Signed-off-by: nelson <[email protected]>
    Kibnelson committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    680c78a View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    0c31dc0 View commit details
    Browse the repository at this point in the history
  12. Merge with updated folder_transform branch

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    47d8fdf View commit details
    Browse the repository at this point in the history

Commits on Oct 13, 2024

  1. added folder_transform

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    6ee6695 View commit details
    Browse the repository at this point in the history
  2. added folder_transform

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    e7260ba View commit details
    Browse the repository at this point in the history
  3. added folder_transform

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    5856f3f View commit details
    Browse the repository at this point in the history
  4. added folder_transform

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    6519686 View commit details
    Browse the repository at this point in the history
  5. added noop testing

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    c728224 View commit details
    Browse the repository at this point in the history
  6. added noop Ray testing

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    6e2863a View commit details
    Browse the repository at this point in the history
  7. added noop Spark testing

    blublinsky committed Oct 13, 2024
    Configuration menu
    Copy the full SHA
    3c9be57 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    371a712 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2024

  1. Renamed/refactored fuzzy dedup python orchestrator

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    680f313 View commit details
    Browse the repository at this point in the history
  2. Rewrote cluster_analysis_transform as a folder_transform

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    c29d3bf View commit details
    Browse the repository at this point in the history
  3. Wrote get_duplicate_list_transform as a folder_transform

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    aada59e View commit details
    Browse the repository at this point in the history
  4. Added text preprocessing

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    2019d56 View commit details
    Browse the repository at this point in the history
  5. Added python test data

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    9362803 View commit details
    Browse the repository at this point in the history
  6. Added project admin tools

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    ddbd602 View commit details
    Browse the repository at this point in the history
  7. Bug fix

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    4dac838 View commit details
    Browse the repository at this point in the history
  8. Add op modes for data cleaning: filter (non)dupl and annotate

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    fbc2b58 View commit details
    Browse the repository at this point in the history
  9. Python and spark transforms for cluster analysis

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    828ec41 View commit details
    Browse the repository at this point in the history
  10. Merge folder_transform

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    a20fe76 View commit details
    Browse the repository at this point in the history
  11. Sync spark Makefile with dpk

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    bc6b81c View commit details
    Browse the repository at this point in the history
  12. Spark orchestration for fuzzy dedup

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    4d486d3 View commit details
    Browse the repository at this point in the history
  13. Bug fix

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    19e0844 View commit details
    Browse the repository at this point in the history
  14. Added spark test data

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    2ce3d8c View commit details
    Browse the repository at this point in the history
  15. Setting input test data for ray

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    5e4022c View commit details
    Browse the repository at this point in the history
  16. Bug fix

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    c14bdaa View commit details
    Browse the repository at this point in the history
  17. Ray orchestration for fuzzy dedup

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    1215ac5 View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2024

  1. Merge with the latest dev branch

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    5966972 View commit details
    Browse the repository at this point in the history

Commits on Oct 18, 2024

  1. Added python test with expected data files

    Signed-off-by: nelson <[email protected]>
    Kibnelson committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    caf79a3 View commit details
    Browse the repository at this point in the history
  2. Added python tests and expected outputs for the tests

    Signed-off-by: nelson <[email protected]>
    Kibnelson committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    8fd9676 View commit details
    Browse the repository at this point in the history
  3. Update versions in pyproject.toml

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    d07a23a View commit details
    Browse the repository at this point in the history
  4. Updated ray test data

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    ec2168c View commit details
    Browse the repository at this point in the history
  5. Updated ray tests

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    fd0f52c View commit details
    Browse the repository at this point in the history
  6. Spark test data and tests

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    954dffd View commit details
    Browse the repository at this point in the history
  7. Adjust to file naming changes

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    77d85fd View commit details
    Browse the repository at this point in the history
  8. Create python Dockerfile

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    310d813 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2024

  1. Ray bug fixes

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    7d97cef View commit details
    Browse the repository at this point in the history
  2. Fix spark image to support testing

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    87902ac View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2024

  1. Removed file copy utils

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    c847924 View commit details
    Browse the repository at this point in the history
  2. Add fdedup to kfp black list until we get kfp integration

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    ba9b07c View commit details
    Browse the repository at this point in the history
  3. Freeze polars version to 1.9.0 for now

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    f187948 View commit details
    Browse the repository at this point in the history
  4. Fixed duplicate_list_location bug

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    84b9104 View commit details
    Browse the repository at this point in the history
  5. Allow input of s3 credentials on command line

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    08ff006 View commit details
    Browse the repository at this point in the history
  6. Added license

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    d0c6f8a View commit details
    Browse the repository at this point in the history
  7. Use str2bool for use_s3 argument

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    63e11eb View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2024

  1. Add overwrite output path argument

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 29, 2024
    Configuration menu
    Copy the full SHA
    bf550fd View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2024

  1. Add separate data access objects for reading and writing files

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 30, 2024
    Configuration menu
    Copy the full SHA
    272be36 View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2024

  1. Define 2 data access objects for data and duplicate list

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Oct 31, 2024
    Configuration menu
    Copy the full SHA
    ee411e1 View commit details
    Browse the repository at this point in the history

Commits on Nov 1, 2024

  1. Configuration menu
    Copy the full SHA
    3a30501 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2024

  1. Added an option to run either word or char shingle

    Signed-off-by: nelson <[email protected]>
    Kibnelson committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    80ae8df View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2024

  1. Use captured_arg_keys to list the arguments of each transform

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    c531809 View commit details
    Browse the repository at this point in the history
  2. Ray implementation for get_duplicate_list_transform

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    fe43110 View commit details
    Browse the repository at this point in the history
  3. Bug fix: jaccard threshold type must be float

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    82a1860 View commit details
    Browse the repository at this point in the history
  4. Get fuzzy dedup ray image ready for kfp

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    61ed40f View commit details
    Browse the repository at this point in the history
  5. kfp implementation for fuzzy dedup

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    a8ede00 View commit details
    Browse the repository at this point in the history
  6. Merge word/char shingles

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 10, 2024
    Configuration menu
    Copy the full SHA
    524236d View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2024

  1. Added params to captured_arg_keys

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 11, 2024
    Configuration menu
    Copy the full SHA
    96edea4 View commit details
    Browse the repository at this point in the history
  2. Add shingle type option (word or char) to kfp

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 11, 2024
    Configuration menu
    Copy the full SHA
    24163af View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2024

  1. Utility to calculate number of bands and length of a band

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    3a43c3d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'dev' into fuzzy-dedup

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    83c05f9 View commit details
    Browse the repository at this point in the history
  3. Set correct version for pyproject

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    2f61be7 View commit details
    Browse the repository at this point in the history
  4. Change the name of the utils Makefile

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    cd5eb05 View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2024

  1. Copy whl file to the context folder

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    6cc18cd View commit details
    Browse the repository at this point in the history
  2. Use keyword args in compute_common_params

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    9f33620 View commit details
    Browse the repository at this point in the history
  3. Use dynamic dependencies

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    528457c View commit details
    Browse the repository at this point in the history
  4. Add FIXME for kubeflow/pipelines#10914

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    fffb630 View commit details
    Browse the repository at this point in the history
  5. Add FIXME for kubeflow/pipelines#10914

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    5547d7f View commit details
    Browse the repository at this point in the history
  6. Remove pyproject.toml dependencies

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    09e56e0 View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2024

  1. Fix bug in number of actors calculation

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    d3eac50 View commit details
    Browse the repository at this point in the history
  2. Cleanup main entry point and local implementation of python transforms

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    fa5959b View commit details
    Browse the repository at this point in the history
  3. Cleanup main entry point and local implementation of ray transforms

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    c4f889b View commit details
    Browse the repository at this point in the history
  4. Cleanup main entry point and local implementation of spark transforms

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    f3c5be0 View commit details
    Browse the repository at this point in the history
  5. Cleanup main entry point and local implementation of spark transforms

    Signed-off-by: Constantin M Adam <[email protected]>
    cmadam committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    4941d5b View commit details
    Browse the repository at this point in the history