All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Add installation instructions using Pixi.
- Write the
min_number_genes
value to the parameters JSON file of thesummary
module. - Set maximum
tensorflow
version to below2.16
.
- Set the
break_on_hyphens
parameter of thetextwrap.fill
function toFalse
to prevent line breaks at-
characters. This ensures that sequences with gaps in FASTA files generated usingSequence.__str__()
maintain consistent line width. - Compare
Enum
by identity in theopen_file
function.
- Added the
--min-number-genes
parameter to thesummary
module. This parameter allows users to set the minimum number of genes a sequence must encode to be considered for classification as a plasmid or virus. The default value is1
. When--conservative
is used, this parameter is set to1
. When--relaxed
is used, this parameter is set to0
. This filter has no effect if theannotate
module is not executed.
- Added a hyperlink to the official documentation in the help dialogue.
- The virus taxonomic lineage is presented using a fixed number of fields separated by semicolons (
;
). As a result, for genomes that could not be assigned to the family level (the most specific taxonomic rank), there will be trailing semicolons at the end of the lineage string. - Do not apply the gene-based post-classification filters when the
annotate
module is not executed. - Set the default value of
--min-plasmid-marker-enrichment
to0.1
.
- Set maximum
keras
version to below3.0
. This prevents errors due to incompatibility withkeras >=3.0
, such as theshape
parameter not accepting an integer as input.
- Set the
CUDA_VISIBLE_DEVICES
environment variable to-1
innn_classification
. This fixes a bug where thenn_classification
module would fail to run when a GPU was available and the input had a single sequence.
- Fixed the parsing of MMseqs2 integrase output to extract only the gene accession, rather than the entire header. This addresses a bug introduced in version 1.5.2, where the integrase gene accession was not accurately parsed because the entire header was extracted. As a result, the
find-proviruses
module can now properly add integrases to gene tables and extend boundaries using integrase coordinates.
- Replace ambiguous variable name in
read_fasta
. - Define name
current_contig
at the beginning of_append_aragorn_tsv
.
- Set minimum
pyrodigal-gv
version to0.3.1
. This fixes a bug introduced in0.3.0
that led to the identification of RBS motifs not reported by Prodigal.
- Remove the
CCGGGG
RBS motif from the list of motifs.
- Add the
CCGGGG
RBS motif to the list of motifs.
- Do not include stop codon (
*
) at the end of protein sequences. - Set minimum
pyrodigal-gv
version to0.2.0
.
- Replace
prodigal-gv
withpyrodigal-gv
- The
mmseqs search
command has been replaced by a two-step alignment workflow. In the first alignment step,--alignment-mode 1
and--max-rejected
are utilized, while the second step uses--alignment-mode 2
and-c 0.2
. This change reduces the number of alignments that are rejected due to not meeting the minimum coverage cutoff and mitigates the issue where the annotation results change when the input sequence order is altered. - The
--min-ungapped-score
parameter ofmmseqs prefilter
was increased from20
to25
. - The
--max-rejected
parameter of the firstmmseqs align
step was increased from225
to280
.
- Replace
np.warnings
withwarnings
to add compatibility withnumpy >= 1.24
.
- Update
numba
(>=0.57
) andnumpy
(>=1.21
) version requirements. - Use
casefold
for sequence comparison within theSequence
class. - Remove type annotations of methods of the
Sequence
class that return an instance ofSequence
. - Use
console.status
to log the deletion of the.tar.gz
file during the execution ofdownload-database
. - Make the conservative assignment at the family level optional via the
--conservative-taxonomy
parameter. This increases the amount of viral genomes assigned to a family when executing geNomad with default parameters.
- Fix parameter names in the error message of
--conservative
and--relaxed
(e.g.--min_score
→--min-score
).
- Display a progress bar showing the progress of the classification process in
nn-classification
.
- Update
README.md
to the database version 1.3.0.
- Make
mmseqs convertalis
output the whole sequence header instead of gene accesions. This prevents parsing conflits with geNomad's other components in cases where MMseqs2 uses its built-in special parsers for specific header formats (e.g. RefSeq).
- Add the
--threads
parameter to thenn-classification
module, which allows controlling the number of threads used for classifying sequences using the neural network model.
- Mention post-classification filters the in the
summary
module description.
- Given that geNomad applies a minimum score filter (since version 1.4.0), the help dialogue of the
--min-score
parameter was modified to remove the following sentence: "By default, the sequence is classified as virus/plasmid if its virus/plasmid score is higher than its chromosome score, regardless of the value". - The following parameters were added to the MMseqs2 search command:
--max-seqs 1000000 --min-ungapped-score 20 --max-rejected 225
. As a result, changing--splits
won't affect the search results anymore.
- Mention Docker and the NMDC EDGE implementation in the
README.md
. - Add the
--min-plasmid-hallmarks-short-seqs
and--min-virus-hallmarks-short-seqs
parameters. These options allow filtering out short sequences (less than 2,500 bp) that don't encode a minimum number of hallmark genes. By default, short sequences need to encode at least one hallmark to be classified as a virus or a plasmid. - Add the
--conservative
and--relaxed
presets that control post-classification filters. The--conservative
option makes those filters even more aggressive, resulting in more restricted sets of plasmid and virus, containing only sequences whose classification is strongly supported. The--relaxed
preset disables all post-classification filters.
- Windows with more than 4,000 Ns are ignored when encoding sequences for the neural network classification. The first window is always processed, regardless of the amount of Ns.
- Changed the default value of
--min-score
from 0.0 to 0.7. - Changed the default search sensitivity from 4.0 to 4.2.
- Update
README.md
to version 1.4.0. This includes mentions to the--conservative
and--relaxed
flags and a warning about how changes in--splits
can affect geNomad's output.
- Fix a bug in
score-calibration
that happened whenfind-proviruses
was executed but no provirus was detected. The module now checks if proviruses were detected (usingutils.check_provirus_execution
) before counting the total number of sequences.
- Require
numpy <1.6
. Fixes #7, which occurs becausenumba
doesn't supportnumpy >=1.24
yet.
- Check if
find-proviruses
was executed when counting the number of sequences in thescore-calibration
module.
- Add support for AMR annotation.
- Update database parsing to allow BUSCO-based USCGs.
- Sequences with no terminal repeats will be flagged with
No terminal repeats
, asLinear
can be misleading. - Print the number of plasmids and viruses in the summary module.
- Set
click.rich_click.MAX_WIDTH
toNone
. - Reduce the default
--sensitivity
to4.0
. - Update
README.md
to version 1.3.0.
- Set
prog_name
inclick.version_option
.
- Mention the Zenodo upload of geNomad's database in
README.md
. - Add the following sentence for the help dialogue of the
--min-plasmid-marker-enrichment
,--min-virus-marker-enrichment
,--min-plasmid-hallmarks
, and--min-virus-hallmarks
parameters: "This option will be ignored if the annotation module was not executed". - Apply a uniform prior to the empirical sample composition in
score_batch_correction
. This will shrink the effect of calibration when the empirical composition distribution is very skewed. - Reduce the
--min-score
in theREADME.md
example to 0.7.
- Fix a bug in the score calibration module where the sample size was set to a constant value and the "Your sample has less than 1,000 sequences…" warning would always appear.
- Dockerfile for version 1.0.0.
Sequence
class: add support forstr
in__eq__
.Sequence
class: add a__hash__
method.- Compute marker enrichment in the
marker-classification
module. - Add columns for plasmid and virus marker enrichment to the
_plasmid_summary.tsv
and_virus_summary.tsv
files. - Set
--min-plasmid-marker-enrichment
and--min-virus-marker-enrichment
to0
as default. This will alter the results when using default parameters. - Add support for plasmid and virus hallmarks. Requires geNomad database v1.1.
- Add CONJscan annotations to
_plasmid_summary.tsv
. Requires geNomad database v1.1.
Sequence
class: simplifyhas_dtr
return statement.Sequence
class: make__repr__
more friendly for long sequences.Sequence
class: rename theid
property toaccession
.- Amino acids are now written to
_provirus_aragorn.tsv
. - Update the XGBoost model file to the
.ubj
format. - Require
xgboost >=1.6
. - The taxonomic lineage in
_taxonomy.tsv
and_virus_summary.tsv
will useViruses
as the highest rank, instead ofroot
. - Change order of the columns in
_plasmid_summary.tsv
and_virus_summary.tsv
. - Explicitly set
fraction
to0.5
intaxopy.find_majority_vote
.
- tRNA coordinates are now 1-indexed.
- Write
summary_execution_info
. - Fix a problem in
DatabaseDownloader.get_version
where only the major version was compared.
- First release.