python create_dataset.py [--config path/to/config.json --seed your_seed]
Template of config.json
in configs/
Rx1 | Rx2 | Rx3 | Rx4 | ... | RxN | RR |
---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | ... | 0 | 3.00 |
0 | 1 | 0 | 1 | ... | 1 | 2.67 |
0 | 1 | 1 | 1 | ... | 1 | 3.14 |
⋮ | ⋮ | ⋮ | ⋮ | ... | ⋮ | ⋮ |
1 | 1 | 1 | 1 | ... | 0 | 1.85 |
- RR = Relative risk
-
file_identifier
: File identifier for the output data -
output_dir
: Directory identifier for the output data -
seed
: Random seed -
n_combi
: Number of unique drug combinations to produce -
n_rx
: Number of individual drugs (equals the number of columns in the generated dataset) -
mean_rx
: Mean number of drugs per combination -
use_gpu
: Indicate whether to use GPU for data generation, if available -
patterns
: Sub-configuration for the dangerous patternsn_patterns
: Number of dangerous patterns to generatemin_rr
: Minimal RR for patternsmax_rr
: Maximal RR for patternsmean_rx
: Mean number of drugs per dangerous patterns
-
disjoint_combinations
: Sub-configuration for drug combinations disjoint from the dangerous patternsmean_rr
: Gaussian mean for the RR of these combinationsstd_rr
: Gaussian standard deviation of these combinations
-
inter_combinations
: Sub-configurtion for drug combinations which intersect with dangerous patternsstd_rr
: Gaussian standard deviation of these combinations
Here, uniform distributions within the interval [patterns:min_rr
, patterns:max_rr
] are used to facilitate the creation of datasets of varying difficulty levels.
A normal distribution with a standard deviation of inter_combinations:std_rr
is used, with a mean calculated based on the similarity between combinations and dangerous patterns.
A normal distribution with a mean of disjoint_combinations:mean_rr
and a standard deviation of disjoint_combinations:std_rr
is used. Combinations related to a pattern will thus be closer to an RR predetermined by the configuration.
- Generate dangerous patterns and associated risks randomly.
- Generate combinations.
- Generate risks based on the similarity between combinations and patterns.
This can be seen as a cut that overflows into other cuts, or as a tree. Each pattern is a root from which several combinations stem. A combination is associated with a pattern if the pattern is its nearest neighbor according to the Hamming distance. However, a combination can be placed in a separate set if no medication is shared between the combination and the nearest pattern.
See our paper for more details.
- If stuck on "Regenerating bad combinations...", it is possible that the average number of "possible" combinations is smaller than the number of combinations being generated. In other words, the average number of Rx per combination should be increased, otherwise, you'll be stuck in an infinite loop. To ensure a finite loop, it suffices to have:
where