ReleaseNotes

Contains the release notes for older Duke versions from before the move to Github.

Duke 1.2

See 1.2 Release Notes.

Duke 1.1

The main new feature in this version is the GeneticAlgorithm, which can be used to automatically tune a configuration.

Other new features:

Support for geosearch.

API changes:

Moved Column class to .datasources.
Property is now an interface, and no longer a class.
Added setDoInference to !InMemoryLinkDatabase.
Added !ConfigWriter, which can export configurations to XML.
Added !LinkFileWriter.
Added Processor.setProfiling and Processor.getProfiler to allow performance profiling via API.
Added Processor.removeMatchListener
Added Configuration.copy() and Property.copy().

Changes to command-line tools:

Added the RecordSearch tool.
Added --reindex option to DebugCompare.
Added --lookups to .Duke

Bug fixes:

SPARQL data source was broken. Now fixed, and test cases added.
Issue 124: --interactive fails in Eclipse
Issue 114: --noreindex causes matches to disappear

Duke 1.0

This is a major new release of Duke. After 18 months of work it is now feature-complete enough to deserve a 1.0 release.

Performance improvements:

Support for multi-threading added
Using NIOFSDirectory on all platforms except Windows
New in-memory backend, faster than Lucene (experimental)

Changes to [Comparator]s:

Geo-coordinate comparator added.
Q-grams comparator added.
Levenshtein implementation is now faster
Weighted Levenshtein weight estimator now knows position in string (issue 81)

Changes to [Cleaner]s:

Added PhoneNumberCleaner
Extended and generalized regexp cleaner
Removed sub-cleaner concept, added support for multiple cleaners

Other improvements:

Implemented user control over lookup props
Upgraded to Lucene 4.0
Added MatchListener.startProcessing() callback
Removed some MatchListener callback methods (weren't thread-safe)
InMemoryLinkDatabase now complete and tested
LinkDatabaseMatchListener bug fixes
Better validation of configurations
JDBCEquivalenceClassDatabase added
RDBMSLinkDatabase performance improvement

Changes to command-line client:

Added data debug mode
Fixed bug with reusing link file as test file
Added pretty-printing of records
Better interactive debugging behaviour
Improvements to DebugCompare tool
Added performance profiling to command-line client

Bugs fixed:

Issue 83: Look up record by ID when ID is a URI.
Issue 90: Bug in command-line option parser.
Bug in CSV data source fixed

Duke 0.6

The following changes have been made in this version:

A change to the calculation of property probabilities when values do not match exactly. This means that you may need to adjust the probabilities and thresholds in your applications.
Upgraded to Lucene 3.6.1.
Improvements to NorwegianCompanyNameCleaner and NorwegianAddressCleaner.

This version adds the following features:

A weighted Levenshtein comparator.
A Metaphone comparator.
A Jaccard index comparator.
A prototype of a comparator using a Norwegian version of Metaphone.
A generic value cleaner.
Support for setting objects as parameters of other objects.

Duke 0.5

This version has seen the internals of the code refactored and cleaned up. In addition, the following has been added:

Support for pluggable backends in addition to the Lucene backend
A naïve in-memory backend that matches all pairs of records
Performance tuning options in the configuration
Data sources can now split a single value into several properties
New cleaners: RegexpCleaner, MappingFileCleaner
The NTriples data source now supports incremental loading
Properties can now be declared as type="ignore"
Command-line tool has an --interactive option
Command-line tool has an --noreindex option
Levenshtein comparator has been substantially optimized
Several bug fixes
Many more tests
In-memory link database implementation
inferLink() method on link database interface

Duke 0.4

This version of Duke introduces:

Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
In-memory data source added (thanks to FMitzlaff).
Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
Record linkage API refactored slightly to be more flexible (with FMitzlaff).
Added utilities for building equivalence classes from Duke output.
Made the XML config loader more robust.
Added a special cleaner for English person names.
Fixed bug in NumericComparator (issue 66)
Uses own Lucene query parser to avoid issues with search strings.
Upgraded to Lucene 3.5.0.
Added many more tests.
Many small bug fixes to core, NTriples reader, ec.

Duke 0.3

This version of Duke introduces:

Refactored API to make it much more user-friendly.
Documented API to same end.
New comparators: !NumericComparator, !DiceCoefficientComparator, !SoundexComparator
A new record linkage mode which can be used to link records from different data sets.
Numerous bug fixes and new test cases.
Performance improvements in the Levenshtein comparator.
Default cleaner now strips accents.
Upgraded to Lucene 3.3.0.
Version stamping in manifest file, API, and command-line client.

Duke 0.2

This version of Duke fixes a number of bugs, and adds a number of improvements.

Example data and setup now included in distribution.
New comparators: !JaroWinklerTokenized, !DifferentComparator
A new !DebugCompare command (see TuningGuide)
More flexibility in the CSV data source
Better reporting of configuration errors
A --verbose option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly