The Byte-Pair Encoder (BPE) is a powerful tokenization method widely used in natural language processing. This Python implementation of BPE is inspired by the paper Neural Machine Translation of Rare Words with Subword Units and guided by Lei Mao's educational tutorial.
- Tokenization: Efficient tokenization using Byte-Pair Encoding.
- Vocabulary Management: Tools for managing and analyzing vocabulary.
- Token Pair Frequency: Calculate token pair frequencies for subword units.
To get started with Byte-Pair Encoder, follow these simple steps:
-
Clone the Repository
git clone https://github.com/teleprint-me/byte-pair.git
-
Install Dependencies
virtualenv .venv source .venv/bin/activate pip install -r requirements.txt
-
Run the Code
python -m byte_pair.encode --input_file samples/taming_shrew.md --output_file local/vocab.json --n_merges 5000
For comprehensive usage instructions and options, consult the documentation:
python -m byte_pair.encode --help
Detailed information on how to use and contribute to the project is available in the documentation.
Contributions are welcome! If you have suggestions, bug reports, or improvements, please don't hesitate to submit issues or pull requests.
This project is licensed under the AGPL (GNU Affero General Public License). For detailed information, see the LICENSE file.
Special thanks to Lei Mao for the blog tutorial that inspired this implementation.
- Original Paper: A New Algorithm for Data Compression Optimization
- Johns Hopkins Paper: A Formal Perspective on Byte-Pair Encoding
- Amazon Research: A Statistical Extension of Byte-Pair Encoding