Skip to content

Commit

Permalink
Merge pull request #42 from iscgar/support_bytes_and_bytearray
Browse files Browse the repository at this point in the history
Add support for working with `bytes` and `bytearray`
  • Loading branch information
itamarst authored Jan 5, 2024
2 parents 7a4c56b + 410d55f commit 58545e9
Show file tree
Hide file tree
Showing 17 changed files with 573 additions and 83 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ jobs:
pip install .
- name: "Tests"
run: |
flake8 tests
mypy --strict tests # indirect type annotation checking
black --check tests
flake8 pysrc tests
mypy --strict pysrc tests
black --check pysrc tests
pytest tests
- name: "Enable universal2 on Python >= 3.9 on macOS"
if: ${{ startsWith(matrix.os, 'macos') && matrix.python-version != '3.8' }}
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

# Python bytecode files:
*.pyc
*.pyd

# Generated by Pytest
/.pytest_cache/

# Emacs junk:
*~
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .hypothesis/unicode_data/13.0.0/codec-utf-8.json.gz
Binary file not shown.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

## 0.21.0

* Added support for searching `bytes`, `bytearray`, `memoryview`, and similar objects using the `BytesAhoCorasick` class.

## 0.20.0

* Added support for Python 3.12.
Expand Down
68 changes: 34 additions & 34 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 29 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@ Found any problems or have any questions? [File an issue on the GitHub project](

## Quickstart <a name="quickstart"></a>

The `ahocorasick_rs` library allows you to search for multiple strings ("patterns") within a haystack.
The `ahocorasick_rs` library allows you to search for multiple strings ("patterns") within a haystack, or alternatively search multiple bytes.
For example, let's install the library:

```shell-session
$ pip install ahocorasick-rs
```

Then, we can construct a `AhoCorasick` object:
### Searching strings

We can construct a `AhoCorasick` object:

```python
>>> import ahocorasick_rs
Expand Down Expand Up @@ -58,11 +60,29 @@ You can construct a `AhoCorasick` object from any iterable (including generators
['hello', 'world', 'hello']
```

### Searching `bytes` and other similar objects

You can also search `bytes`, `bytearray`, `memoryview`, and other objects supporting the Python buffer API.

```python
>>> patterns = [b"hello", b"world"]
>>> ac = ahocorasick_rs.BytesAhoCorasick(patterns)
>>> haystack = b"hello world"
>>> ac.find_matches_as_indexes(b"hello world")
[(0, 0, 5), (1, 6, 11)]
>>> patterns[0], patterns[1]
(b'hello', b'world')
>>> haystack[0:5], haystack[6:11]
(b'hello', b'world')
```

The `find_matches_as_strings()` API is not supported by `BytesAhoCorasick`.

## Choosing the matching algorithm <a name="matching"></a>

### Match kind

There are three ways you can configure matching in cases where multiple patterns overlap.
There are three ways you can configure matching in cases where multiple patterns overlap, supported by both `AhoCorasick` and `BytesAhoCorasick` objects.
For a more in-depth explanation, see the [underlying Rust library's documentation of matching](https://docs.rs/aho-corasick/latest/aho_corasick/enum.MatchKind.html).

Assume we have this starting point:
Expand Down Expand Up @@ -127,7 +147,8 @@ This returns the leftmost-in-the-haystack matching pattern that is longest:

### Overlapping matches

You can get all overlapping matches, instead of just one of them, but only if you stick to the default matchkind, `MatchKind.Standard`:
You can get all overlapping matches, instead of just one of them, but only if you stick to the default matchkind, `MatchKind.Standard`.
Again, this is supported by both `AhoCorasick` and `BytesAhoCorasick`.

```python
>>> from ahocorasick_rs import AhoCorasick
Expand All @@ -139,7 +160,7 @@ You can get all overlapping matches, instead of just one of them, but only if yo

## Additional configuration: speed and memory usage tradeoffs <a name="configuration2"></a>

### Algorithm implementations: trading construction speed, memory, and performance
### Algorithm implementations: trading construction speed, memory, and performance (`AhoCorasick` and `BytesAhoCorasick`)

You can choose the type of underlying automaton to use, with different performance tradeoffs.
The short version: if you want maximum matching speed, and you don't have too many patterns, try the `Implementation.DFA` implementation and see if it helps.
Expand All @@ -157,7 +178,7 @@ The underlying Rust library supports [four choices](https://docs.rs/aho-corasick
>>> ac = AhoCorasick(["disco", "disc"], implementation=Implementation.DFA)
```

### Trading memory for speed
### Trading memory for speed (`AhoCorasick` only)

If you use ``find_matches_as_strings()``, there are two ways strings can be constructed: from the haystack, or by caching the patterns on the object.
The former takes more work, the latter uses more memory if the patterns would otherwise have been garbage-collected.
Expand All @@ -171,7 +192,8 @@ You can control the behavior by using the `store_patterns` keyword argument to `

## Implementation details <a name="implementation"></a>

* Matching releases the GIL, to enable concurrency.
* Matching on strings releases the GIL, to enable concurrency.
Matching on bytes does not currently release the GIL, but see https://github.com/G-Research/ahocorasick_rs/issues/94 for a case where it could.
* Not all features from the underlying library are exposed; if you would like additional features, please [file an issue](https://github.com/g-research/ahocorasick_rs/issues/new) or submit a PR.

## Benchmarks <a name="benchmarks"></a>
Expand Down
6 changes: 3 additions & 3 deletions justfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ install-dev-dependencies:
setup: venv install-dev-dependencies

lint:
flake8 tests/
black --check tests/
mypy --strict tests
flake8 pysrc tests/
black --check pysrc tests/
mypy --strict pysrc tests

test:
pytest tests/
Expand Down
8 changes: 6 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
[build-system]
requires = ["maturin>=0.14,<0.15"]
requires = ["maturin>=1.0,<2.0"]
build-backend = "maturin"

[project]
name = "ahocorasick_rs"
requires-python = ">=3.7"
requires-python = ">=3.8"
dependencies = [
# Technically not necessary to run, only needed for type checking...
"typing_extensions >= 4.6.0 ; python_version < '3.12'"
]

[tool.maturin]
python-source = "pysrc/"
2 changes: 2 additions & 0 deletions pysrc/ahocorasick_rs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Expose the Rust code:
from .ahocorasick_rs import (
AhoCorasick,
BytesAhoCorasick,
MatchKind,
Implementation,
)
Expand All @@ -12,6 +13,7 @@

__all__ = [
"AhoCorasick",
"BytesAhoCorasick",
"MatchKind",
"Implementation",
# Deprecated:
Expand Down
19 changes: 19 additions & 0 deletions pysrc/ahocorasick_rs/ahocorasick_rs.pyi
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
from __future__ import annotations

from typing import Optional, Iterable
import sys

if sys.version_info >= (3, 12):
from collections.abc import Buffer
else:
from typing_extensions import Buffer

class Implementation:
NoncontiguousNFA: Implementation
Expand All @@ -24,3 +32,14 @@ class AhoCorasick:
def find_matches_as_strings(
self, haystack: str, overlapping: bool = False
) -> list[str]: ...

class BytesAhoCorasick:
def __init__(
self,
patterns: Iterable[Buffer],
matchkind: MatchKind = MatchKind.Standard,
implementation: Optional[Implementation] = None,
) -> None: ...
def find_matches_as_indexes(
self, haystack: Buffer, overlapping: bool = False
) -> list[tuple[int, int, int]]: ...
2 changes: 1 addition & 1 deletion rust-toolchain.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[toolchain]
channel = "1.73"
channel = "1.75"
components = ["rustfmt", "clippy"]
Loading

0 comments on commit 58545e9

Please sign in to comment.