This repository attempts to analyze the gender of first authors of papers at various conferences. There are several caveats here. Inferring gender based on name is never exact and the accuracy of this method has not been tested at all so any results should be considered suspect. Aside from manually labelling the gender of each author (also a difficult and potentially error-prone task), there are several approaches that could improve the accuracy of this method. For example, attempting to fetch the country of the author's affiliation could provide a more accurate prediction.
We make use of the genderComputer library for gender inference which is installed as a submodule.
Therefore it is necessary to run git submodule update --init
to fetch submodules in this repository.
We also make use of Pipenv to manage dependencies, so this must be installed first as well.
To install other dependencies, run pipenv install
.
The downloaded files can be analyzed by running the following command:
pipenv run python analyze_genders.py
This will print a CSV file with inferred counts of first authors by gender. You can also use this notebook for further analysis.
To add a new conference, simply edit fetch-papers.sh
to retrieve new JSON data files.
The files should be named CONF-xx.json
where CONF
is the name of the conference and xx
is the year.
The link to the JSON files can be obtained by looking at the table of contents for the proceedings in DBLP and selecting the JSON export link.
Since data coming from DBLP is CC0 and can be freely shared, any new data files should be committed to this repository.
To fetch data from Scopus, you will need an API key.
This API key should be set in the .env
file as SCOPUS_API_KEY
.
Data from Scopus can then be fetched by running fetch-scopus.sh
.
This will fetch all data on DB conferences from Scopus where a DOI is available from DBLP and save to scopus.json
.
Note that this requires the installation of jq to process the JSON from DBLP.