Merge branch 'main' of https://github.com/Adibvafa/CodonTransformer

Adibvafa · Oct 29, 2024 · 023a905 · 023a905
2 parents 6fe99b2 + e01d63a
commit 023a905
Showing 1 changed file with 18 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -108,6 +108,24 @@ A sample dataset is provided under [\demo](https://github.com/Adibvafa/CodonTran
 
 <br>
 
+### Arguments of predict_dna_sequence
+| Argument | Type | Description | Default |
+|----------|------|-------------|---------|
+| `protein` | `str` | Input protein sequence | Required |
+| `organism` | `Union[int, str]` | Organism ID (integer) or name (string) (e.g., "Escherichia coli general") | Required |
+| `device` | `torch.device` | PyTorch device object specifying whether to run on CPU or GPU | Required |
+| `tokenizer` | `Union[str, PreTrainedTokenizerFast, None]` | Either a file path to load tokenizer from, a pre-loaded tokenizer object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | `None` |
+| `model` | `Union[str, torch.nn.Module, None]` | Either a file path to load model from, a pre-loaded model object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | `None` |
+| `attention_type` | `str` | Type of attention mechanism to use in model - 'block_sparse' for memory efficient or 'original_full' for standard attention | `"original_full"` |
+| `deterministic` | `bool` | If True, uses deterministic decoding (picks most likely tokens). If False, samples tokens based on probabilities adjusted by temperature | `True` |
+| `temperature` | `float` | Controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive | `0.2` |
+| `top_p` | `float` | Nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1 | `0.95` |
+| `num_sequences` | `int` | Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive | `1` |
+| `match_protein` | `bool` | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions) | `False` |
+
+Returns: `Union[DNASequencePrediction, List[DNASequencePrediction]]` containing predicted DNA sequence(s) and metadata.
+<br>
+
 ## Installation
 Install CodonTransformer via pip: