From 904ce6d44c6014616af642e4d22b8daff4e83608 Mon Sep 17 00:00:00 2001 From: Adibvafa Fallahpour <90617686+Adibvafa@users.noreply.github.com> Date: Tue, 29 Oct 2024 19:13:52 -0400 Subject: [PATCH 1/3] Update README.md --- README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/README.md b/README.md index baebd02..c2b9977 100644 --- a/README.md +++ b/README.md @@ -108,6 +108,24 @@ A sample dataset is provided under [\demo](https://github.com/Adibvafa/CodonTran
+### Arguments of predict_dna_sequence +| Argument | Type | Description | Default | +|----------|------|-------------|---------| +| `protein` | `str` | Input protein sequence | Required | +| `organism` | `Union[int, str]` | Organism ID (integer) or name (string) (e.g., "Escherichia coli general") | Required | +| `device` | `torch.device` | PyTorch device object specifying whether to run on CPU or GPU | Required | +| `tokenizer` | `Union[str, PreTrainedTokenizerFast, None]` | Either a file path to load tokenizer from, a pre-loaded tokenizer object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | `None` | +| `model` | `Union[str, torch.nn.Module, None]` | Either a file path to load model from, a pre-loaded model object, or None to load from HuggingFace's "adibvafa/CodonTransformer" | `None` | +| `attention_type` | `str` | Type of attention mechanism to use in model - 'block_sparse' for memory efficient or 'original_full' for standard attention | `"original_full"` | +| `deterministic` | `bool` | If True, uses deterministic decoding (picks most likely tokens). If False, samples tokens based on probabilities adjusted by temperature | `True` | +| `temperature` | `float` | Controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive | `0.2` | +| `top_p` | `float` | Nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1 | `0.95` | +| `num_sequences` | `int` | Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive | `1` | +| `match_protein` | `bool` | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repitions) | `False` | + +Returns: `Union[DNASequencePrediction, List[DNASequencePrediction]]` containing predicted DNA sequence(s) and metadata +
+ ## Installation Install CodonTransformer via pip: From 54c64cfec5eddde60fc30914b0749314810f94b6 Mon Sep 17 00:00:00 2001 From: Adibvafa Fallahpour <90617686+Adibvafa@users.noreply.github.com> Date: Tue, 29 Oct 2024 19:14:29 -0400 Subject: [PATCH 2/3] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c2b9977..64a799a 100644 --- a/README.md +++ b/README.md @@ -121,7 +121,7 @@ A sample dataset is provided under [\demo](https://github.com/Adibvafa/CodonTran | `temperature` | `float` | Controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive | `0.2` | | `top_p` | `float` | Nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1 | `0.95` | | `num_sequences` | `int` | Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive | `1` | -| `match_protein` | `bool` | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repitions) | `False` | +| `match_protein` | `bool` | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions) | `False` | Returns: `Union[DNASequencePrediction, List[DNASequencePrediction]]` containing predicted DNA sequence(s) and metadata
From e01d63a523729b03478701a42bbe03e54da5db9e Mon Sep 17 00:00:00 2001 From: Adibvafa Fallahpour <90617686+Adibvafa@users.noreply.github.com> Date: Tue, 29 Oct 2024 19:14:51 -0400 Subject: [PATCH 3/3] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 64a799a..bdf7bc5 100644 --- a/README.md +++ b/README.md @@ -123,7 +123,7 @@ A sample dataset is provided under [\demo](https://github.com/Adibvafa/CodonTran | `num_sequences` | `int` | Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive | `1` | | `match_protein` | `bool` | Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions) | `False` | -Returns: `Union[DNASequencePrediction, List[DNASequencePrediction]]` containing predicted DNA sequence(s) and metadata +Returns: `Union[DNASequencePrediction, List[DNASequencePrediction]]` containing predicted DNA sequence(s) and metadata.
## Installation