Byte Pair Encoding

Overview

The Byte-Pair Encoder (BPE) is a powerful tokenization method widely used in natural language processing. This Python implementation of BPE is inspired by the paper Neural Machine Translation of Rare Words with Subword Units and guided by Lei Mao's educational tutorial.

Features

Tokenization: Efficient tokenization using Byte-Pair Encoding.
Vocabulary Management: Tools for managing and analyzing vocabulary.
Token Pair Frequency: Calculate token pair frequencies for subword units.

Getting Started

To get started with Byte-Pair Encoder, follow these simple steps:

Clone the Repository

git clone https://github.com/teleprint-me/byte-pair.git

Install Dependencies

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the Code

python -m byte_pair.encode --input_file samples/taming_shrew.md --output_file local/vocab.json --n_merges 5000

Usage

For comprehensive usage instructions and options, consult the documentation:

python -m byte_pair.encode --help

Documentation

Detailed information on how to use and contribute to the project is available in the documentation.

Contributing

Contributions are welcome! If you have suggestions, bug reports, or improvements, please don't hesitate to submit issues or pull requests.

License

This project is licensed under the AGPL (GNU Affero General Public License). For detailed information, see the LICENSE file.

Acknowledgments

Special thanks to Lei Mao for the blog tutorial that inspired this implementation.

Additional Resources

Original Paper: A New Algorithm for Data Compression Optimization
Johns Hopkins Paper: A Formal Perspective on Byte-Pair Encoding
Amazon Research: A Statistical Extension of Byte-Pair Encoding

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
byte_pair		byte_pair
docs		docs
examples		examples
samples		samples
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte Pair Encoding

Overview

Features

Getting Started

Usage

Documentation

Contributing

License

Acknowledgments

Additional Resources

About

Releases

Packages

Contributors 2

Languages

License

teleprint-me/byte-pair

Folders and files

Latest commit

History

Repository files navigation

Byte Pair Encoding

Overview

Features

Getting Started

Usage

Documentation

Contributing

License

Acknowledgments

Additional Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages