Add C accelerator for `fnmatch` #121445

picnixz · 2024-07-06T19:26:57Z

Enhancement

By adding a C accelerator, it is possible to improve the running time of fnmatch.filter and fnmatch.translate (the previous proposal is no longer correct since fnmatch(3) does not produce the same results as Python) by 1.1x and 6.5x respectively.

Machine specs:

OS: openSUSE Leap 15.5 x86_64
Kernel: 5.14.21-150500.55.68-default
CPU: 13th Gen Intel i9-13980HX (32) @ 5.400GHz
gcc (SUSE Linux) 7.5.0

`fnmatch.translate`

Without --with-pydebug

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 1.26 us: 4.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.40 us               | 1.29 us: 4.97x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.17 us               | 303 ns: 7.16x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.00 us               | 286 ns: 7.00x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 459 ns: 6.63x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.14 us               | 685 ns: 7.50x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.26x faster          |
+------------------------------------------+-----------------------+-----------------------+

With --enable-optimizations

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.15 us               | 1.21 us: 5.07x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us               | 1.24 us: 5.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.15 us               | 287 ns: 7.49x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.94 us               | 269 ns: 7.21x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 427 ns: 7.03x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.07 us               | 636 ns: 7.98x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.56x faster          |
+------------------------------------------+-----------------------+-----------------------+

Benchmark script

import pyperf

import _fnmatch
from fnmatch import _translate, _join_translated_parts

def translate_py(pattern):
    STAR = object()
    parts = _translate(pattern, STAR, '.')
    return _join_translated_parts(parts, STAR)

def translate_ref(loops, pattern):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        translated = translate_py(pattern)
    return pyperf.perf_counter() - t0

def translate_c(loops, pattern):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        res = _fnmatch.translate(pattern)
    return pyperf.perf_counter() - t0

PATTERNS = [
    'abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
    '!abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
    'a**?**cd**?**??k***',
    'a/**/b/**/c',
    'man/man1/bash.1',
    'a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**',
]

def bench(runner, func):
    for pattern in PATTERNS:
        runner.bench_time_func(pattern, func, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['ref', 'c'])
    args = runner.parse_args()

    if args.impl == 'ref':
        bench(runner, translate_ref)
    elif args.impl == 'c':
        bench(runner, translate_c)

`fnmatch.filter` (POSIX)

Here are the true timings for using fnmatch.filter. I think we can change the Python implementation as well but I did not bother doing it in the PR (too lazy...). By the way, the timings are for POSIX platforms. On non-posix platforms os.path.normcase is called on every name before performing the match, but if the POSIX implementation is faster, I doubt the non-posix one would be slower (the code is essentially the same). I did not include the DEBUG builds since they are not really of interest to us.

Without --with-pydebug

+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark         | fnmatch-filter-ref | fnmatch-filter-py     | fnmatch-filter-c      |
+===================+====================+=======================+=======================+
| small_match_none  | 353 ns             | 432 ns: 1.22x slower  | 383 ns: 1.08x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some  | 429 ns             | 480 ns: 1.12x slower  | 440 ns: 1.03x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all   | 485 ns             | 525 ns: 1.08x slower  | not significant       |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.46 us            | 2.36 us: 1.04x faster | 2.28 us: 1.08x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 3.22 us            | 2.90 us: 1.11x faster | 2.76 us: 1.16x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all  | 3.89 us            | 3.51 us: 1.11x faster | 3.33 us: 1.17x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none  | 25.5 us            | 22.6 us: 1.13x faster | 22.1 us: 1.16x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some  | 32.3 us            | 27.5 us: 1.18x faster | 26.8 us: 1.20x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all   | 38.8 us            | 33.0 us: 1.18x faster | 32.1 us: 1.21x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean    | (ref)              | 1.03x faster          | 1.09x faster          |
+-------------------+--------------------+-----------------------+-----------------------+

With --enable-optimizations

+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark         | fnmatch-filter-ref | fnmatch-filter-py     | fnmatch-filter-c      |
+===================+====================+=======================+=======================+
| small_match_none  | 355 ns             | 427 ns: 1.20x slower  | 364 ns: 1.03x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some  | 437 ns             | 475 ns: 1.09x slower  | 427 ns: 1.02x faster  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all   | 476 ns             | 520 ns: 1.09x slower  | 468 ns: 1.02x faster  |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.50 us            | 2.37 us: 1.05x faster | 2.20 us: 1.14x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 3.17 us            | 2.79 us: 1.14x faster | 2.71 us: 1.17x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all  | 3.68 us            | 3.29 us: 1.12x faster | 3.11 us: 1.18x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none  | 24.9 us            | 21.5 us: 1.15x faster | 20.8 us: 1.20x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some  | 30.6 us            | 24.8 us: 1.23x faster | 24.5 us: 1.25x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all   | 35.8 us            | 29.6 us: 1.21x faster | 27.6 us: 1.30x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean    | (ref)              | 1.06x faster          | 1.13x faster          |
+-------------------+--------------------+-----------------------+-----------------------+

Benchmark script

import pyperf

import _fnmatch
import os
from fnmatch import _compile_pattern

def fnmatch_ref(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        result = []
        pat = os.path.normcase(pat0)
        match = _compile_pattern(pat)
        for name in names:
            if match(name):
                result.append(name)
    return pyperf.perf_counter() - t0

def fnmatch_py(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        pat = os.path.normcase(pat0)
        match = _compile_pattern(pat)
        res = list(filter(match, names))
    return pyperf.perf_counter() - t0

def fnmatch_c(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        res = _fnmatch.filter(names, pat0)
    return pyperf.perf_counter() - t0

def bench(runner, func):
    base_data = ['Python', 'Ruby', 'Perl', 'Tcl']
    multipliers = [('small', 1), ('medium', 10), ('large', 100)]
    patterns = [('none', 'A*'), ('some', 'P*'), ('all', '*')]
    for data_label, data_size in multipliers:
        for pat_label, pattern in patterns:
            name = f'{data_label}_match_{pat_label}'
            data = base_data * data_size
            _compile_pattern.cache_clear()
            runner.bench_time_func(name, func, data, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['ref', 'py', 'c'])
    args = runner.parse_args()

    if args.impl == 'ref':
        bench(runner, fnmatch_ref)
    elif args.impl == 'py':
        bench(runner, fnmatch_py)
    elif args.impl == 'c':
        bench(runner, fnmatch_c)

Linked PRs

The text was updated successfully, but these errors were encountered:

dg-pb · 2024-07-10T08:37:13Z

Could you run benchmarks for more optimized python version?

normcase = os.path.normcase
os_path = os.path

def filter(names, pat):
    pat = normcase(pat)
    match = _compile_pattern(pat)
    if os_path is posixpath:
        return list(filter(match, names))
    else:
        ncased = map(match, map(normcase, names))
        return list(itl.compress(names, ncased))

Also, don't run list on return value of filter as it already returns a list:

res = list(filter(match, names))

picnixz · 2024-07-10T09:27:48Z

res = list(filter(match, names))

The 'filter' here is the built-in filter, not the fnmatch's filter. So I must use list. Your suggestion is the same as the fnmatch_py benchmark by the way. Note that the timings I shown are without any calls to normcase because I don't know how to use them in C (for now).

Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms. I'll first try to make the C implementation compile on the CI/CD (I can run it fine on my laptop but I didn't know how to correctly update the entire ecosystem, but I think I managed to today).

picnixz · 2024-07-11T12:22:31Z

What a blunder. I did not thought that:

#include <fnmatch.h>
#include <stdio.h>

void main() {
    int res = fnmatch("[a-z]", "a", 0);
    printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes");  // 'yes'
    res = fnmatch("[^-`]", "~", 0);
    printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes"); // 'yes'
}

but in Python... ~ would not match [^-`]. By looking at the fnmatch(3) implementation, I think the character classes are only allowed to be alpha numerics.

EDIT: let's forget about fnmatch(3) because the latter actually supports [[:alpha:]] classes which the Python implementation does not.

dg-pb · 2024-07-14T20:47:14Z

Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms.

It can be run on Unix too, but is only executed on windows. You can test it on Unix and it will be a good proxy for it.

picnixz · 2024-07-15T06:54:16Z

In the end everything is the same as in Python. I'll regenerate the benchmarks as well.

picnixz · 2024-07-28T14:12:32Z

Implementation of ideas from #122289 at the C level gave even better performances in a release (non-PGO) build:

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 1.26 us: 4.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.40 us               | 1.29 us: 4.97x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.17 us               | 303 ns: 7.16x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.00 us               | 286 ns: 7.00x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 459 ns: 6.63x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.14 us               | 685 ns: 7.50x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.26x faster          |
+------------------------------------------+-----------------------+-----------------------+

picnixz · 2024-08-20T14:03:47Z

And by implementing more ideas, I get in PGO builds

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.15 us               | 1.21 us: 5.07x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us               | 1.24 us: 5.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.15 us               | 287 ns: 7.49x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.94 us               | 269 ns: 7.21x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 427 ns: 7.03x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.07 us               | 636 ns: 7.98x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.56x faster          |
+------------------------------------------+-----------------------+-----------------------+

We are going very fast now!

dg-pb · 2024-08-20T14:05:34Z

One question. Is the implementation actually 2K lines long?

picnixz · 2024-08-20T14:07:26Z

Yes, but there are a lot of comments. So it's probably 1k long I think (in those 2k, there are ~400 that are clinic, 150 that are tests).

dg-pb · 2024-08-20T14:15:10Z

Performance results look good. Hopefully code can be simplified though.

picnixz · 2024-08-20T14:20:33Z

The code could be simplified by removing the implementation of filter, fnmatch and fnmatchcase and only keep translate in C. The speedy implementation is only 500 lines long (translate.c). And there are some parts that I needed to C/C from unicodeobject.c since the latter does not export unicode_char.

dg-pb · 2024-08-20T14:35:19Z

That sounds sensible. It doesn't seem there is going to be any significant performance improvements from having any other functions written in C.

picnixz · 2024-08-20T14:37:14Z

Yes, I'm working on an alternate branch that removes those bits. We can always re-add them later but yes, the performances are really not that impressive (10% gain is still good but not that good).

picnixz · 2024-08-23T08:38:30Z

Following #123181 (comment) and #123181 (comment), I decided to close this feature. While the improvements are significant, they will not necessarily be that significant in production. I suggest we rather use #122289 which already bring an almost 2x improvement and is a lot easier to maintain.

picnixz added the type-feature A feature request or enhancement label Jul 6, 2024

bedevere-app bot mentioned this issue Jul 6, 2024

gh-121445: C implementation for fnmatch #121446

Closed

13 tasks

picnixz changed the title ~~Add C accelerator for fnmatch.filter & co.~~ Add C accelerator for fnmatch & co. Jul 11, 2024

picnixz changed the title ~~Add C accelerator for fnmatch & co.~~ Add C accelerator for fnmatch Jul 11, 2024

AlexWaygood added the performance Performance or resource usage label Jul 11, 2024

picnixz mentioned this issue Jul 25, 2024

Improve performances of fnmatch.translate #122288

Open

bedevere-app bot mentioned this issue Aug 20, 2024

gh-121445: C implementation for fnmatch.translate #123181

Closed

picnixz closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add C accelerator for `fnmatch` #121445

Add C accelerator for `fnmatch` #121445

picnixz commented Jul 6, 2024 •

edited by bedevere-app bot

Loading

dg-pb commented Jul 10, 2024 •

edited

Loading

picnixz commented Jul 10, 2024

picnixz commented Jul 11, 2024 •

edited

Loading

dg-pb commented Jul 14, 2024 •

edited

Loading

picnixz commented Jul 15, 2024

picnixz commented Jul 28, 2024 •

edited

Loading

picnixz commented Aug 20, 2024

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024 •

edited

Loading

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024

picnixz commented Aug 23, 2024

Add C accelerator for fnmatch #121445

Add C accelerator for fnmatch #121445

Comments

picnixz commented Jul 6, 2024 • edited by bedevere-app bot Loading

Enhancement

fnmatch.translate

fnmatch.filter (POSIX)

Linked PRs

dg-pb commented Jul 10, 2024 • edited Loading

picnixz commented Jul 10, 2024

picnixz commented Jul 11, 2024 • edited Loading

dg-pb commented Jul 14, 2024 • edited Loading

picnixz commented Jul 15, 2024

picnixz commented Jul 28, 2024 • edited Loading

picnixz commented Aug 20, 2024

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024 • edited Loading

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024

dg-pb commented Aug 20, 2024

picnixz commented Aug 20, 2024

picnixz commented Aug 23, 2024

Add C accelerator for `fnmatch` #121445

Add C accelerator for `fnmatch` #121445

picnixz commented Jul 6, 2024 •

edited by bedevere-app bot

Loading

`fnmatch.translate`

`fnmatch.filter` (POSIX)

dg-pb commented Jul 10, 2024 •

edited

Loading

picnixz commented Jul 11, 2024 •

edited

Loading

dg-pb commented Jul 14, 2024 •

edited

Loading

picnixz commented Jul 28, 2024 •

edited

Loading

picnixz commented Aug 20, 2024 •

edited

Loading