Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparkdeduplication #429

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Sparkdeduplication #429

wants to merge 21 commits into from

Commits on Mar 17, 2017

  1. Initial version of spark-based document deduplication. It contains

    a new version of the clustering mechanism with auto-sizing clusters.
    axnow committed Mar 17, 2017
    Configuration menu
    Copy the full SHA
    eafc85d View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2017

  1. Work on the algorithm which splits large clusters among the cluster

    to obtain better scalability.
    axnow committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    e95a8cc View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2017

  1. Complete version with tiled comparison task.

    Added programatical logging into stdout, for easier log reading
    axnow committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    efb3fc2 View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2017

  1. Work on tiled optimalization.

    Added reshuffle for better work balance.
    axnow committed Apr 14, 2017
    Configuration menu
    Copy the full SHA
    fab0915 View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2017

  1. Stable version, does proper job within 2.5h on full data set.

    Needs code cleanup and qality assurance.
    axnow committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    6180bba View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2017

  1. Added options parsing from command line to control app behaviour.

    Version used for performance testing.
    axnow committed Jun 26, 2017
    Configuration menu
    Copy the full SHA
    28834d2 View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2017

  1. Added dependency for the scopt.

    axnow committed Jul 11, 2017
    Configuration menu
    Copy the full SHA
    3a749d4 View commit details
    Browse the repository at this point in the history

Commits on Jul 14, 2017

  1. Configuration menu
    Copy the full SHA
    0cfdcbb View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2017

  1. Scala version.

    axnow committed Jul 23, 2017
    Configuration menu
    Copy the full SHA
    2e0c205 View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2017

  1. Fixed oozie workflow building.

    axnow committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    1a2c5dd View commit details
    Browse the repository at this point in the history
  2. Cleaning up project files

    axnow committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    ca4592b View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2017

  1. Initial version of spark-based document deduplication. It contains

    a new version of the clustering mechanism with auto-sizing clusters.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    ece39dc View commit details
    Browse the repository at this point in the history
  2. Work on the algorithm which splits large clusters among the cluster

    to obtain better scalability.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    c056a0b View commit details
    Browse the repository at this point in the history
  3. Complete version with tiled comparison task.

    Added programatical logging into stdout, for easier log reading
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    e7ad7aa View commit details
    Browse the repository at this point in the history
  4. Work on tiled optimalization.

    Added reshuffle for better work balance.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    0cf6672 View commit details
    Browse the repository at this point in the history
  5. Stable version, does proper job within 2.5h on full data set.

    Needs code cleanup and qality assurance.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    013b53c View commit details
    Browse the repository at this point in the history
  6. Added options parsing from command line to control app behaviour.

    Version used for performance testing.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    ac56042 View commit details
    Browse the repository at this point in the history
  7. Added dependency for the scopt.

    Task tiling class rewritten to scala, with tests.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    81f6509 View commit details
    Browse the repository at this point in the history
  8. Scala version.

    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    cd1014c View commit details
    Browse the repository at this point in the history
  9. Fixed oozie workflow building.

    Cleaning up project files. Fixed workflow building for oozie.
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    221cd52 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'sparkdeduplication' of https://github.com/axnow/CoAnSys

    …into sparkdeduplication
    axnow committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    20da6f8 View commit details
    Browse the repository at this point in the history