Skip to content

Maintain Term Frequencies

Compare
Choose a tag to compare
@heikomuller heikomuller released this 28 Dec 22:11
· 12 commits to master since this release
42805c3

This release introduces several major changes to the file formats as well as additional options for context signature generation and signature robustification.

Term Index and Equivalence Classes Files

D4 now maintains frequency counts for each term (equivalence class) for each column that the term (equivalence class) occurs in. For terms and equivalence classes the list of columns is nor a comma-separated list of column-id:frequency pairs.

Signature Files

In the robust signature files, D4 now maintains the size of each block (in the number of terms for all equivalence classes in the block) as the first value of the comma-separated list. the following elements are pairs of eq-identifier:overlap-pairs.

Robust Signatures

D4 contains a new similarity measure for equivalence classes that is based in tf-idf (option --sim=TF-ICF when creating signatures)

For minor drops, D4 now also includes a new robustifier (--robustifier=IGNORE-LAST) that ignores the last block (instead of the largest block as LIBERAL does).