This is the code repository containing all resources used to build the Webis-STEREO-21 corpus on scientific text reuse in open access publications.
It consists of general-purpose spark jobs for scalable text reuse detection in large document collections.
Each stage of the pipeline is defined as a separate file in the jobs
directory. Alongside, each job has an associated submit script to handle resource alloaction on a spark cluster in the the scripts
directory.
Located in the tools
directory, the source code for standalone alignment component (written in Go) a standalone converter for Grobid output is located.
The analysis-example.ipynb
notebook features an exemplary analysis of the Webis-STEREO-21 corpus as a starting point to facilitate data reuse.
Each job can be invoked to run either locally or on the cluster. See the makefile for available predefined targets.
Resource allocation is handled by the submit script associated with each job.
make preprocess-cluster
(cluster mode using the corresponding submit script) or make preprocess-local
(local mode)
Reads the stereo document collection from S3 and converts them to a standardized (id, content) format.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | <YOUR INPUT GROBID DUMP HERE> |
output_path | Path to write data to | stereo-grobid-preprocessed.parquet |
make filter-cluster
(cluster mode using the corresponding submit script) or make filter-local
(local mode)
Reads the preprocessed stereo document collection and filters the subset specified by the supplied list of DOIs.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-grobid-preprocessed.parquet |
output_path | Path to write data to | stereo-filtered.parquet |
make vectorize-cluster
(cluster mode using the corresponding submit script) or make vectorize-local
(local mode)
Splits each document into sequential fixed-size chunks and represents each chunk as binary term vector.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-filtered.parquet/* |
output_path | Path to write data to | stereo-vectorized.parquet |
ngram_length | Length of chunks documents are split into | 50 |
num_features | Dimension of word feature vector for each chunk | 2**18 |
make hash-cluster
(cluster mode using the corresponding submit script) or make hash-local
(local mode)
Calculates a set of hashes for each feature vector to enable MinHash similarity detection.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-vectorized.parquet/* |
output_path | Path to write data to | stereo-hashed.parquet |
num_hashes | Number of hashes for the MinHash calculation. Allows for 1/n Jaccard distance precision with n hashes | 5 |
make reduce-cluster
(cluster mode using the corresponding submit script) or make reduce-local
(local mode)
Encodes the hashset of each chunk as one-hot binary vector. Reduces chunks of one document into one vector using logical OR on the binary vectors.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-hashed.parquet/* |
output_path | Path to write data to | stereo-reduced.parquet |
num_features | Number of dimensions for the binary document vector | 2**18 |
make partition-cluster
(cluster mode using the corresponding submit script) or make partition-local
(local mode)
Builds an inverted list of hash->doi pairs and partitions it by hash. Allows for efficient batching of the pairing job.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-reduced.parquet/* |
output_path | Path to write data to | stereo-partitioned.parquet |
num_partitions | Number of partitions to split the index into | 5000 |
make pair-cluster
(cluster mode using the corresponding submit script) or make pair-local
(local mode)
Transforms document vectors into document pairs. Each pair denotes documents that share at least one hash across all chunks. Works by filtering the cartesian product of documents by all pairs that have at least one "1" in the logical AND of their binary document vectors.
Operates in batches. Recommended is using 100 batches at least ("00" to "99").
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-partitioned.parquet/* |
output_path | Path to write data to | stereo--paired.parquet |
batch | Unix file wildcard to identify the part-* files used in this batch |
"00" |
make join-cluster
(cluster mode using the corresponding submit script) or make join-local
(local mode)
Joins the pair dataframe with the corresponding texts in each row.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | stereo-paired.parquet/* |
output_path | Path to write data to | stereo-joined.parquet |
batch | Batch to join on (recommended 100 batches) | 00 |
make align-cluster
(cluster mode using the corresponding submit script) or make align-local
(local mode)
Produces the exact alignment of all given document pairs. Operates in batches similar to the join job.
Parameter | Description | Default |
---|---|---|
pair_path | Path to read data from | stereo-paired.parquet/* |
text_path | Path to read data from | stereo-filtered.parquet/* |
output_path | Path to write data to | stereo-aligned.parquet |
batch | Batch prefix from the join job | "00" |
NGRAM_LENGTH | 8 | |
NGRAM_OVERLAP | 7 | |
THETA | 250 |
make metadata-cluster
(cluster mode using the corresponding submit script) or make metadata-local
(local mode)
Extracts metadata from the Microsoft Open Academic Graph Dataset and maps them to the DFG classification of scientific disciplinces.
Parameter | Description | Default |
---|---|---|
input_path | Path to read data from | file:/mnt/ceph/storage/corpora/corpora-thirdparty/corpus-microsoft-open-academic-graph-v1/*.txt |
output_path | Path to write data to | stereo-oag.parquet |
make unify-cluster
(cluster mode using the corresponding submit script) or make unify-local
(local mode)
Joins metadata and reuse cases.
Parameter | Description | Default |
---|---|---|
case_path | Path to read case data from | stereo-core-aligned.parquet/*/* |
metadata_path | Path to read metadata from | stereo-metadata.parquet/* |
output_path | Path to write data to | stereo-corpus.jsonl |
make finalize-cluster
(cluster mode using the corresponding submit script) or make finalize-local
(local mode)
Transforms each data record into its final form, filters publication metadata, and assigns unique IDs to each case.
Parameter | Description | Default |
---|---|---|
case_path | Path to read case data from | stereo-core-aligned.parquet/*/* |
text_path | Path to read publication text data from | stereo-core-aligned.parquet/*/* |
metadata_path | Path to read metadata from | stereo-metadata.parquet/* |
output_cases_full | Path to write case data to | webis-stereo21/cases-full |
output_cases_metadata_only | Path to write metadata-only case data to | webis-stereo21/cases-metadata-only |
output_publications_full | Path to write publication data to | webis-stereo21/publications-full |
output_publications_metadata_only | Path to metadata-only publication data to | webis-stereo21/publications-metadata-only |