Skip to content

Latest commit

 

History

History
128 lines (93 loc) · 5.34 KB

README.md

File metadata and controls

128 lines (93 loc) · 5.34 KB

AntPack

AntPack is a toolkit for antibody numbering, data processing, statistical inference and machine learning for antibody sequences. For usage, see the docs. It is currently in active development, so we are adding new features and making more improvements periodically. We will try to avoid breaking changes but nonetheless recommend checking the docs after you install a new version to be sure the component you are using is unaffected.

What's new in v0.3.6

v0.3.6 incorporates a couple of bug fixes and some improvements to the command line tool. It's also about 2x faster than the previous version. On an Intel-i7-13700K, the time taken to number 3500 sequences for different tools was as follows:

Tool Time (s)
ANARCI 45 +/- 1
AntPack 0.18 +/- 0.01
AbRSA 13.02 +-/ 0.01

Previous versions of AntPack achieved about 0.35 seconds. Finally, in previous versions you had to know whether the sequences you were using were single or paired. AntPack still has the SingleChainAnnotator and PairedChainAnnotator but PairedChainAnnotator has been made more flexible and better at dealing with single chains, so if your input data contains both paired and single chains, it's ok to default to using PairedChainAnnotator. (SingleChain will be much faster, so if you know you're dealing with single chains only, always use SingleChainAnnotator instead.) You can also use SingleChainAnnotator on chains that may be single or paired; if you set it to look for a specific chain type (e.g. heavy), it will extract that chain from paired sequences.

Coming soon

We're adding a GUI that can be launched with a single command from the terminal / Windows command line so that AntPack can be used either as a command line tool, a Python library or a GUI, depending on your needs and what you're comfortable with.

We're also planning to link AntPack to a searchable online database of human antibody sequences and add tools for fast sequence clustering and stability prediction -- coming soon.

Installation

pip install antpack

AntPack is distributed as a wheel precompiled for most platforms and CPython >= 3.8, so installation should be very straightforward. A source distribution is also available (C++17) in case there is any need to compile from source.

The only required dependency is numpy. In future, to use the GUI there will be a couple of additional optional dependencies (although this is not added yet).

Capabilities

Antibody numbering

Numbering antibody sequences is an important precursor for many statistical inference / machine learning applications. AntPack is orders of magnitude faster for numbering antibody sequences than existing tools in the literature (e.g. ANARCI, AbRSA), while providing >= reliability. AntPack also provides tools for merging a list of numbered sequences into an MSA and for easy extraction of specific CDRs and framework regions.

V / J genes

Identifying the most similar human V / J gene sequences is useful for a variety of purposes. AntPack provides tools for determining which human V and J gene sequences are most similar to the variable region chain provided as input.

Humanness and developability

Minimizing the risk of immunogenicity is important for selecting clinical candidates. In AntPack v0.1.0, we introduce a simple, fully interpretable generative model for human heavy and light chains that outperforms all comparators in the literature on a large held-out test set for distinguishing human sequences from those of other species. This scoring tool can be used to score sequences for humanness, suggest modifications to make them more human, identify liabilities, and generate highly human sequences that contain selected motifs.

Finding developability liabilities

Some sequence motifs are known to be associated with developability issues -- certain motifs are known, for example, to be prone to N-glycosylation or deamidation. AntPack provides a tool for finding these "liability" motifs in an input sequence. Note that that identifying liabilities through finding motifs in this way is known to be prone to false positives (an N-glycosylation motif, for example, will not always be glycosylated). Still, these kinds of alerts can be useful for making yourself aware of potential developability issues.

Licensing

AntPack is licensed under a GPL license, which means that you are free to use it for your own data analysis irrespective of whether you work on academic research or industrial R&D / QC.

If you are writing software intended for sale / distribution, however, any software you create that uses AntPack must also be open-source, must use the GPL license and must acknowledge AntPack appropriately. If you are interested in using AntPack in a closed-source software product, please contact us to obtain a version of AntPack licensed for this use.

Citing this work

If using AntPack in research intended for publication, please cite either the preprint:

Jonathan Parkinson and Wei Wang. 2024. For antibody sequence generative modeling, mixture models may be all you need. bioRxiv: https://doi.org/10.1101/2024.01.27.577555

or the final paper in Bioinformatics.