NSIDES project data release v0.1

Data are available at http://tatonettilab.org/resources/nsides/.

This release includes data for drug side effects (OFFSIDES) and drug-drug pair side effects (TWOSIDES).
These data represent an update from the data released with the publication of [1], to include adverse events reported to the FDA through FAERS up to and including 2014.
We are releasing source code (notebooks and scripts) on GitHub, as well as data in the form both of database dumps and compressed .csv files.

[1] Tatonetti, Nicholas P., P. Ye Patrick, Roxana Daneshjou, and Russ B. Altman. "Data-driven prediction of drug effects and interactions." Science translational medicine 4, no. 125 (2012): 125ra31-125ra31. doi:10.1126/scitranslmed.3003377

Summary information

Number of	Value
Drugs (≥ 1 exposure)	3,394
Adverse events types (≥ 1 occurrence)	17,552
Drug-event pairs	9,505,200
Significant* drug-event pairs	125,647
Drug-drug-event triplets†	222,155,888
Significant* drug-drug-event triplets†	5,729,992

* Significant determined by LOG(PRR) - 1.96 * PRR_error > LOG(2)

† This is not filtered by OFFSIDES, meaning a drug-drug-event triplet can be significant even if one of the drugs is more significantly associated with the event by itself.

Notes on methods used to compute data

Signal detection methods

A contingency table can be drawn using exposed and unexposed cohorts produced by propensity score matching.

	Had outcome	Didn't have outcome
Drug exposed	A	B
Not drug exposed	C	D

Using these definitions,

$PRR = \frac{\frac{A}{A+B}}{\frac{C}{C+D}}$

and the error is

$PRR_s = \sqrt{\frac{1}{A} + \frac{1}{C} - \frac{1}{A+B} - \frac{1}{C+D}}$

Several consequences of these definitions should be taken into account when inspecting the data.

PRR is NaN when both A and C are zero.
PRR is Inf when C is zero but A is greater than zero.
PRR is zero when A is zero and C is not zero.
PRR_s is Inf when A or C or both are zero.

Computational notes

Propensity score matching was used to account for covariates.
OFFSIDES and TWOSIDES were handled in the same way.

The (drug exposure) propensity scores for were computed using 10 bootstrap iterations.
For each iteration, only a fraction of reports were used to fit a logistic regression model that predicted drug exposure.
Specifically, non-exposed reports were sampled in equal number to exposed reports.
If fewer than 100 reports were exposed, then exposed reports (however many available) were used with 100 unexposed reports for propensity score calculation.
Final propensity scores were calculated using an average across the 10 bootstrap iterations.

Propensity score matching used the following bins: [0, 0.2, 0.4, 0.6, 0.8, 1].
These bins were used to create case and control sets with similar compositions.
For each bin, 10 times the number of cases were sampled from control (unexposed) reports.
We only added cases or controls from bins that had at least one case and at least one control.

While some filtering was conducted for both OFFSIDES and TWOSIDES (A > 0 for both flat files and database dumps and PRR > 0.1 for flat files), TWOSIDES contains pairwise associations even for pairs where one of the drugs alone has a higher association score than the pair.

Flat files

Two flat files are made available for download: OFFSIDES.csv.xz and TWOSIDES.csv.xz.
The files have the following schemata:

`OFFSIDES.csv.xz`

column name	data type	description
drug_rxnorn_id	integer	RxNorm CUI (RxCUI) of each drug
drug_concept_name	string	String name of each drug
condition_meddra_id	integer	MedDRA code of each (adverse event) condition
condition_concept_name	string	String name of each condition
A	integer	Number of reports prescribed the drug that had the condition
B	integer	Number of reports prescribed the drug that did not have the condition
C	integer	Number of reports not prescribed the drug† that had the condition
D	integer	Number of reports not prescribed the drug† that did not have the condition
PRR	float	Proportional reporting ratio*
PRR_error	float	Proportional reporting ratio error*
mean_reporting_frequency	float	A / (A + B)

† Number of controls is determined using propensity-score matching with controls sampled at 10-to-1 relative to cases.

* See signal detection methods

Note that this table does not include any drug-condition combinations for which both A and C were zero.
In other words, no information is presented for conditions that did not occur or for drugs that caused no adverse events.

`TWOSIDES.csv.xz`

Each drug pair in the TWOSIDES file occur each only once, since drug pair ordering is not meaningful.
Each pair's order is determined by OMOP CDM IDs in the TWOSIDES database table, though this means there is no particular order between RxNorm IDs in this flat file.

column name	data type	description
drug_1_rxnorm_id	integer	RxNorm CUI (RxCUI) of the first drug
drug_1_concept_name	string	String name of the first drug
drug_2_rxnorm_id	integer	RxNorm CUI (RxCUI) of the second drug
drug_2_concept_name	string	String name of the second drug
condition_meddra_id	integer	MedDRA code of each (adverse event) condition
condition_concept_name	string	String name of each condition
A	integer	Number of reports prescribed the drug that had the condition
B	integer	Number of reports prescribed the drug that did not have the condition
C	integer	Number of reports not prescribed the drug† that had the condition
D	integer	Number of reports not prescribed the drug† that did not have the condition
PRR	float	Proportional reporting ratio*
PRR_error	float	Proportional reporting ratio error*
mean_reporting_frequency	float	A / (A + B)

† Number of controls is determined using propensity-score matching with controls sampled at 10-to-1 relative to cases.

* See signal detection methods

Note that this table does not include any drug-drug-condition triplets for which both A and C were zero.
In other words, no information is presented for conditions that did not occur or for drug pairs that caused no adverse events.

Database dumps

The data contained in this release is a stored in a MySQL database called effect_nsides (MySQL Ver 14.14 Distrib 5.7.26, for Linux (x86_64)).
The code used to build and populate the tables in this database are located in nb/4.insert_tables/.

The dump was created using

usr/bin/mysqldump -p$PASSWORD effect_nsides | gzip > effect_nsides-2019-11-13.sql.gz

To load the database dumps into a local MySQL database, use the following commands:

mysqladmin create effect_nsides

gunzip < effect_nsides-2019-11-13.sql.gz | mysql effect_nsides

For more information about loading from these dump files, see the MySQL documentation on this topic.

Table schemata

effect_nsides has the following tables: REPORT, CONDITION_CONCEPT, CONDITION_OCCURRENCE, DRUG_CONCEPT, DRUG_EXPOSURE, OFFSIDES, and TWOSIDES.
These tables have the following schemata:

`REPORT`

column name	data type	description
report_id	int	Unique ID for a report
report_year	int	Year in which the report was received
person_age	int	Age of the affected person
person_sex	char(1)	Sex of the affected person

Note that both person_age and person_sex columns contain NULL values.
person_sex has been coded to be one of three values: 'M', 'F', or 'U'.
person_age is normalized to units of years, though some unreasonable values are clearly erroneous reports (eg. 1054, 869, 5200).

`CONDITION_CONCEPT`

column name	data type	description
condition_concept_id	int	OMOP CDM concept_id of each (adverse event) condition
condition_concept_name	varchar(255)	String name of each (adverse event) condition
condition_meddra_id	int	MedDRA code of each (adverse event) condition
condition_snomed_id	int	SNOMED-CT code of each (adverse event) condition

Note that the condition_snomed_id column does contain NULL values, as the map between MedDRA and SNOMED-CT is imperfect.

`CONDITION_OCCURRENCE`

column name	data type	description
report_id	int	Unique ID for a report
condition_concept_id	int	OMOP CDM concept_id of an (adverse event) condition

This table simply connects rows from REPORT with rows in CONDITION_CONCEPT.

`DRUG_CONCEPT`

column name	data type	description
drug_concept_id	int	OMOP CDM concept_id of each drug
drug_concept_name	varchar(255)	String name of each drug
rxnorm_concept_id	int	RxNorm CUI (RxCUI) of each drug
drugbank_concept_id	varchar(255)	DrugBank Accession Number (DB*) of each drug
chebi_concept_id	int	ChEBI ID of each drug

Note that drugbank_concept_id and chebi_concept_id both contain NULL values, as the maps among these terminologies are imperfect.

`DRUG_EXPOSURE`

column name	data type	description
report_id	int	Unique ID for a report
drug_concept_id	int	OMOP CDM concept_id of a drug

This table simply connects rows from REPORT with rows in DRUG_CONCEPT.

`OFFSIDES`

column name	data type	description
drug_concept_id	int	OMOP CDM concept_id of a drug
condition_concept_id	int	OMOP CDM concept_id of an (adverse event) condition
A	integer	Number of reports prescribed the drug that had the condition
B	integer	Number of reports prescribed the drug that did not have the condition
C	integer	Number of reports not prescribed the drug† that had the condition
D	integer	Number of reports not prescribed the drug† that did not have the condition
PRR	float	Proportional reporting ratio*
PRR_error	float	Proportional reporting ratio error*
mean_reporting_frequency	float	A / (A + B)

† Number of controls is determined using propensity-score matching with controls sampled at 10-to-1 relative to cases.

* See signal detection methods

`TWOSIDES`

column name	data type	description
drug_concept_id_1	int	OMOP CDM concept_id of first drug
drug_concept_id_2	int	OMOP CDM concept_id of second drug
condition_concept_id	int	OMOP CDM concept_id of an (adverse event) condition
A	integer	Number of reports prescribed the drug that had the condition
B	integer	Number of reports prescribed the drug that did not have the condition
C	integer	Number of reports not prescribed the drug† that had the condition
D	integer	Number of reports not prescribed the drug† that did not have the condition
PRR	float	Proportional reporting ratio*
PRR_error	float	Proportional reporting ratio error*
mean_reporting_frequency	float	A / (A + B)

† Number of controls is determined using propensity-score matching with controls sampled at 10-to-1 relative to cases.

* See signal detection methods

Questions

This work is the result of efforts by a number of people.
Questions or comments are best made by opening an issue on the GitHub repository.
Additionally, as this work is the result of a rotation project in the Tatonetti Lab at Columbia University, questions can also be sent to Michael Zietz (zietzm) or Nicholas Tatonetti directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSIDES pre-release v0.1

NSIDES project data release v0.1

Summary information

Notes on methods used to compute data

Signal detection methods

Computational notes

Flat files

`OFFSIDES.csv.xz`

`TWOSIDES.csv.xz`

Database dumps

Table schemata

`REPORT`

`CONDITION_CONCEPT`

`CONDITION_OCCURRENCE`

`DRUG_CONCEPT`

`DRUG_EXPOSURE`

`OFFSIDES`

`TWOSIDES`

Questions