This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System for Author Name Disambiguation by Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman.
The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).
To install this package, run the following:
git clone https://github.com/allenai/S2AND.git
cd S2AND
conda create -y --name s2and python==3.7
conda activate s2and
pip install -r requirements.in
pip install -e .
If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:
CFLAGS='-stdlib=libc++' pip install -r requirements.in
To obtain the S2AND dataset, run the following command after the package is installed (from inside the S2AND
directory):
[Expected download size is: 50.4 GiB]
aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/
Note that this software package comes with tools specifically designed to access and model the dataset.
Modify the config file at data/path_config.json
. This file should look like this
{
"main_data_dir": "absolute path to wherever you downloaded the data to",
"internal_data_dir": "ignore this one unless you work at AI2"
}
As the dummy file says, main_data_dir
should be set to the location of wherever you downloaded the data to, and
internal_data_dir
can be ignored, as it is used for some scripts that rely on unreleased data, internal to Semantic Scholar.
Once you have downloaded the datasets, you can go ahead and load up one of them:
from os.path import join
from s2and.data import ANDData
dataset_name = "pubmed"
parent_dir = "data/pubmed/
dataset = ANDData(
signatures=join(parent_dir, f"{dataset_name}_signatures.json"),
papers=join(parent_dir, f"{dataset_name}_papers.json"),
mode="train",
specter_embeddings=join(parent_dir, f"{dataset_name}_specter.pickle"),
clusters=join(parent_dir, f"{dataset_name}_clusters.json"),
block_type="s2",
train_pairs_size=100000,
val_pairs_size=10000,
test_pairs_size=10000,
name=dataset_name,
n_jobs=8,
)
This may take a few minutes - there is a lot of text pre-processing to do.
The first step in the S2AND pipeline is to specify a featurizer and then train a binary classifier that tries to guess whether two signatures are referring to the same person.
We'll do hyperparameter selection with the validation set and then get the test area under ROC curve.
Here's how to do all that:
from s2and.model import PairwiseModeler
from s2and.featurizer import FeaturizationInfo
from s2and.eval import pairwise_eval
featurization_info = FeaturizationInfo()
# the cache will make it faster to train multiple times - it stores the features on disk for you
train, val, test = featurize(dataset, featurization_info, n_jobs=8, use_cache=True)
X_train, y_train = train
X_val, y_val = val
X_test, y_test = test
# calibration fits isotonic regression after the binary classifier is fit
# monotone constraints help the LightGBM classifier behave sensibly
pairwise_model = PairwiseModeler(
n_iter=25, calibrate=True, monotone_constraints=featurization_info.lightgbm_monotone_constraints
)
# this does hyperparameter selection, which is why we need to pass in the validation set.
pairwise_model.fit(X_train, y_train, X_val, y_val)
# this will also dump a lot of useful plots (ROC, PR, SHAP) to the figs_path
pairwise_metrics = pairwise_eval(X_test, y_test, pairwise_model.classifier, figs_path='figs/', title='example')
print(pairwise_metrics)
The second stage in the S2AND pipeline is to tune hyperparameters for the clusterer on the validation data and then evaluate the full clustering pipeline on the test blocks.
We use agglomerative clustering as implemented in fastcluster
with average linkage.
There is only one hyperparameter to tune.
from s2and.model import Clusterer, FastCluster
from hyperopt import hp
clusterer = Clusterer(
featurization_info,
pairwise_model,
cluster_model=FastCluster(linkage="average"),
search_space={"eps": hp.uniform("eps", 0, 1)},
n_iter=25,
n_jobs=8,
)
clusterer.fit(dataset)
# the metrics_per_signature are there so we can break out the facets if needed
metrics, metrics_per_signature = cluster_eval(dataset, clusterer)
print(metrics)
For a fuller example, please see the transfer script: scripts/transfer_experiment.py
.
Assuming you have a clusterer already fit, you can dump the model to disk like so
import pickle
with open("saved_model.pkl", "wb") as _pkl_file:
pickle.dump(clusterer, _pkl_file)
You can then reload it, load a new dataset, and run prediction
import pickle
with open("saved_model.pkl", "rb") as _pkl_file:
clusterer = pickle.load(_pkl_file)
anddata = ANDData(
signatures=signatures,
papers=papers,
specter_embeddings=paper_embeddings,
name="your_name_here",
mode="inference",
block_type="s2",
)
pred_clusters, pred_distance_matrices = clusterer.predict(anddata.get_blocks(), anddata)
Our released models are in the s3
folder referenced above, and are called production_model.pickle
and full_union_seed_*.pickle
. They can be loaded the same way, except that the pickled object is a dictionary, with a clusterer
key.
There is a also a predict_incremental
function on the Clusterer
, that allows prediction for just a small set of new signatures. When instantiating ANDData
, you can pass in cluster_seeds
, which will be used instead of model predictions for those signatures. If you call predict_incremental
, the full distance matrix will not be created, and the new signatures will simply be assigned to the cluster they have the lowest average distance to, as long as it is below the model's eps
, or separately reclustered with the other unassigned signatures, if not within eps
of any existing cluster.
The experiments in the paper were run with the python (3.7.9) package versions in paper_experiments_env.txt
. You can install these packages exactly by running pip install pip==21.0.0
and then pip install -r paper_experiments_env.txt --use-feature=fast-deps --use-deprecated=legacy-resolver
. Rerunning scripts/paper_experiments.sh
on the branch s2and_paper
should produce the same numbers as in the paper (we will udpate here if this becomes not true).
Note that by default we are using the --use_cache
flag, which will cache all the features so future reruns are faster. There are two things to be aware of: (a) the cache is stored in RAM and can be huge (100gb+) and (b) if you intend to change the features and rerun, you'll have to turn off the cache or the new features won't be used.
The code in this repo is released under the Apache 2.0 license (license included in the repo. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).
If you use S2AND in your research, please cite S2AND: A Benchmark and Evaluation System for Author Name Disambiguation.
@inproceedings{subramanian2021s2and,
title={{S}2{AND}: {A} {B}enchmark and {E}valuation {S}ystem for {A}uthor {N}ame {D}isambiguation},
author={Subramanian, Shivashankar and King, Daniel and Downey, Doug and Feldman, Sergey},
year={2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {{JCDL} '21: Proceedings of the {ACM/IEEE} Joint Conference on Digital Libraries in 2021},
series = {JCDL '21}
}
S2AND is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.