You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.
Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:
What is the problem?
Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the desired_false_positive_rateoverlap_threshold parameter, but without any success.
Here is the outputdolma -c create-bloomfilter.yaml dedupe
bloom_filter:
desired_false_positive_rate: 0.001
estimated_doc_count: 73543
file: decontamination_bloom_filter.bin
read_only: false
size_in_bytes: 0
dedupe:
min_length: 0
min_words: 0
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
by_ngram:
ngram_length: 0
overlap_threshold: 1.0
skip_short_paragraphs: false
stride: 0
paragraph_separator: '
'
skip_empty: true
documents:
- benchmarks.jsonl.gz
processes: 4
work_dir:
input: /tmp/dolma-input-1rmq0gbx
output: /tmp/dolma-output-ky8van2k
[2024-06-27T12:34:26Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO dolma::deduper] Skipping "/disk/cschroeder/workspaces/dolma/benchmarks.jsonl.gz" because it already exists
[2024-06-27T12:34:26Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO dolma::deduper] Bloom filter written.
[2024-06-27T12:34:26Z INFO dolma::deduper] Done!
dolma -c decontaminate.yaml dedupe
bloom_filter:
desired_false_positive_rate: 0.1
estimated_doc_count: 288347
file: decontamination_bloom_filter.bin
read_only: true
size_in_bytes: 0
dedupe:
min_length: 0
min_words: 0
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
by_ngram:
ngram_length: 0
overlap_threshold: 1.0
skip_short_paragraphs: false
stride: 0
paragraph_separator: '
'
skip_empty: true
documents:
- tmp/v0/documents/*.gz
processes: 3
work_dir:
input: work/para/input
output: work/para/output
[2024-06-27T12:38:17Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0000.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0001.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0002.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0003.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0004.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:22Z INFO dolma::deduper] Bloom filter written.
[2024-06-27T12:38:22Z INFO dolma::deduper] Done!
Am I missing somehting?
The text was updated successfully, but these errors were encountered:
Hi,
dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.
Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:
What is the problem?
Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the
desired_false_positive_rate
overlap_threshold
parameter, but without any success.Infos about my setup:
I am using the latest dolma 1.0.3 release. My latest minimum working example is based on configs/dolma-v1_5/decontamination.
Here are my config files
create-bloomfilter.yaml:decontaminate.yaml:
Here is the output
dolma -c create-bloomfilter.yaml dedupedolma -c decontaminate.yaml dedupe
Am I missing somehting?
The text was updated successfully, but these errors were encountered: