Maintain Term Frequencies
This release introduces several major changes to the file formats as well as additional options for context signature generation and signature robustification.
Term Index and Equivalence Classes Files
D4 now maintains frequency counts for each term (equivalence class) for each column that the term (equivalence class) occurs in. For terms and equivalence classes the list of columns is nor a comma-separated list of column-id:frequency pairs.
Signature Files
In the robust signature files, D4 now maintains the size of each block (in the number of terms for all equivalence classes in the block) as the first value of the comma-separated list. the following elements are pairs of eq-identifier:overlap-pairs.
Robust Signatures
D4 contains a new similarity measure for equivalence classes that is based in tf-idf (option --sim=TF-ICF
when creating signatures)
For minor drops, D4 now also includes a new robustifier (--robustifier=IGNORE-LAST
) that ignores the last block (instead of the largest block as LIBERAL does).