Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C accelerator for fnmatch #121445

Closed
picnixz opened this issue Jul 6, 2024 · 14 comments
Closed

Add C accelerator for fnmatch #121445

picnixz opened this issue Jul 6, 2024 · 14 comments
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@picnixz
Copy link
Member

picnixz commented Jul 6, 2024

Enhancement

By adding a C accelerator, it is possible to improve the running time of fnmatch.filter and fnmatch.translate (the previous proposal is no longer correct since fnmatch(3) does not produce the same results as Python) by 1.1x and 6.5x respectively.

Machine specs:

OS: openSUSE Leap 15.5 x86_64
Kernel: 5.14.21-150500.55.68-default
CPU: 13th Gen Intel i9-13980HX (32) @ 5.400GHz
gcc (SUSE Linux) 7.5.0

fnmatch.translate

Without --with-pydebug
+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 1.26 us: 4.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.40 us               | 1.29 us: 4.97x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.17 us               | 303 ns: 7.16x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.00 us               | 286 ns: 7.00x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 459 ns: 6.63x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.14 us               | 685 ns: 7.50x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.26x faster          |
+------------------------------------------+-----------------------+-----------------------+
With --enable-optimizations
+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.15 us               | 1.21 us: 5.07x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us               | 1.24 us: 5.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.15 us               | 287 ns: 7.49x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.94 us               | 269 ns: 7.21x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 427 ns: 7.03x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.07 us               | 636 ns: 7.98x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.56x faster          |
+------------------------------------------+-----------------------+-----------------------+
Benchmark script
import pyperf

import _fnmatch
from fnmatch import _translate, _join_translated_parts

def translate_py(pattern):
    STAR = object()
    parts = _translate(pattern, STAR, '.')
    return _join_translated_parts(parts, STAR)

def translate_ref(loops, pattern):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        translated = translate_py(pattern)
    return pyperf.perf_counter() - t0

def translate_c(loops, pattern):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        res = _fnmatch.translate(pattern)
    return pyperf.perf_counter() - t0

PATTERNS = [
    'abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
    '!abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
    'a**?**cd**?**??k***',
    'a/**/b/**/c',
    'man/man1/bash.1',
    'a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**',
]

def bench(runner, func):
    for pattern in PATTERNS:
        runner.bench_time_func(pattern, func, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['ref', 'c'])
    args = runner.parse_args()

    if args.impl == 'ref':
        bench(runner, translate_ref)
    elif args.impl == 'c':
        bench(runner, translate_c)

fnmatch.filter (POSIX)

Here are the true timings for using fnmatch.filter. I think we can change the Python implementation as well but I did not bother doing it in the PR (too lazy...). By the way, the timings are for POSIX platforms. On non-posix platforms os.path.normcase is called on every name before performing the match, but if the POSIX implementation is faster, I doubt the non-posix one would be slower (the code is essentially the same). I did not include the DEBUG builds since they are not really of interest to us.

Without --with-pydebug
+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark         | fnmatch-filter-ref | fnmatch-filter-py     | fnmatch-filter-c      |
+===================+====================+=======================+=======================+
| small_match_none  | 353 ns             | 432 ns: 1.22x slower  | 383 ns: 1.08x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some  | 429 ns             | 480 ns: 1.12x slower  | 440 ns: 1.03x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all   | 485 ns             | 525 ns: 1.08x slower  | not significant       |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.46 us            | 2.36 us: 1.04x faster | 2.28 us: 1.08x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 3.22 us            | 2.90 us: 1.11x faster | 2.76 us: 1.16x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all  | 3.89 us            | 3.51 us: 1.11x faster | 3.33 us: 1.17x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none  | 25.5 us            | 22.6 us: 1.13x faster | 22.1 us: 1.16x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some  | 32.3 us            | 27.5 us: 1.18x faster | 26.8 us: 1.20x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all   | 38.8 us            | 33.0 us: 1.18x faster | 32.1 us: 1.21x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean    | (ref)              | 1.03x faster          | 1.09x faster          |
+-------------------+--------------------+-----------------------+-----------------------+
With --enable-optimizations
+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark         | fnmatch-filter-ref | fnmatch-filter-py     | fnmatch-filter-c      |
+===================+====================+=======================+=======================+
| small_match_none  | 355 ns             | 427 ns: 1.20x slower  | 364 ns: 1.03x slower  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some  | 437 ns             | 475 ns: 1.09x slower  | 427 ns: 1.02x faster  |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all   | 476 ns             | 520 ns: 1.09x slower  | 468 ns: 1.02x faster  |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.50 us            | 2.37 us: 1.05x faster | 2.20 us: 1.14x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 3.17 us            | 2.79 us: 1.14x faster | 2.71 us: 1.17x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all  | 3.68 us            | 3.29 us: 1.12x faster | 3.11 us: 1.18x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none  | 24.9 us            | 21.5 us: 1.15x faster | 20.8 us: 1.20x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some  | 30.6 us            | 24.8 us: 1.23x faster | 24.5 us: 1.25x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all   | 35.8 us            | 29.6 us: 1.21x faster | 27.6 us: 1.30x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean    | (ref)              | 1.06x faster          | 1.13x faster          |
+-------------------+--------------------+-----------------------+-----------------------+
Benchmark script
import pyperf

import _fnmatch
import os
from fnmatch import _compile_pattern

def fnmatch_ref(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        result = []
        pat = os.path.normcase(pat0)
        match = _compile_pattern(pat)
        for name in names:
            if match(name):
                result.append(name)
    return pyperf.perf_counter() - t0

def fnmatch_py(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        pat = os.path.normcase(pat0)
        match = _compile_pattern(pat)
        res = list(filter(match, names))
    return pyperf.perf_counter() - t0

def fnmatch_c(loops, names, pat0):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        res = _fnmatch.filter(names, pat0)
    return pyperf.perf_counter() - t0

def bench(runner, func):
    base_data = ['Python', 'Ruby', 'Perl', 'Tcl']
    multipliers = [('small', 1), ('medium', 10), ('large', 100)]
    patterns = [('none', 'A*'), ('some', 'P*'), ('all', '*')]
    for data_label, data_size in multipliers:
        for pat_label, pattern in patterns:
            name = f'{data_label}_match_{pat_label}'
            data = base_data * data_size
            _compile_pattern.cache_clear()
            runner.bench_time_func(name, func, data, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['ref', 'py', 'c'])
    args = runner.parse_args()

    if args.impl == 'ref':
        bench(runner, fnmatch_ref)
    elif args.impl == 'py':
        bench(runner, fnmatch_py)
    elif args.impl == 'c':
        bench(runner, fnmatch_c)

Linked PRs

@picnixz picnixz added the type-feature A feature request or enhancement label Jul 6, 2024
@dg-pb
Copy link
Contributor

dg-pb commented Jul 10, 2024

Could you run benchmarks for more optimized python version?

normcase = os.path.normcase
os_path = os.path

def filter(names, pat):
    pat = normcase(pat)
    match = _compile_pattern(pat)
    if os_path is posixpath:
        return list(filter(match, names))
    else:
        ncased = map(match, map(normcase, names))
        return list(itl.compress(names, ncased))

Also, don't run list on return value of filter as it already returns a list:

res = list(filter(match, names))

@picnixz
Copy link
Member Author

picnixz commented Jul 10, 2024

res = list(filter(match, names))

The 'filter' here is the built-in filter, not the fnmatch's filter. So I must use list. Your suggestion is the same as the fnmatch_py benchmark by the way. Note that the timings I shown are without any calls to normcase because I don't know how to use them in C (for now).

Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms. I'll first try to make the C implementation compile on the CI/CD (I can run it fine on my laptop but I didn't know how to correctly update the entire ecosystem, but I think I managed to today).

@picnixz
Copy link
Member Author

picnixz commented Jul 11, 2024

What a blunder. I did not thought that:

#include <fnmatch.h>
#include <stdio.h>

void main() {
    int res = fnmatch("[a-z]", "a", 0);
    printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes");  // 'yes'
    res = fnmatch("[^-`]", "~", 0);
    printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes"); // 'yes'
}

but in Python... ~ would not match [^-`]. By looking at the fnmatch(3) implementation, I think the character classes are only allowed to be alpha numerics.

EDIT: let's forget about fnmatch(3) because the latter actually supports [[:alpha:]] classes which the Python implementation does not.

@picnixz picnixz changed the title Add C accelerator for fnmatch.filter & co. Add C accelerator for fnmatch & co. Jul 11, 2024
@picnixz picnixz changed the title Add C accelerator for fnmatch & co. Add C accelerator for fnmatch Jul 11, 2024
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Jul 11, 2024
@dg-pb
Copy link
Contributor

dg-pb commented Jul 14, 2024

Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms.

It can be run on Unix too, but is only executed on windows. You can test it on Unix and it will be a good proxy for it.

@picnixz
Copy link
Member Author

picnixz commented Jul 15, 2024

In the end everything is the same as in Python. I'll regenerate the benchmarks as well.

@picnixz
Copy link
Member Author

picnixz commented Jul 28, 2024

Implementation of ideas from #122289 at the C level gave even better performances in a release (non-PGO) build:

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 1.26 us: 4.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.40 us               | 1.29 us: 4.97x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.17 us               | 303 ns: 7.16x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.00 us               | 286 ns: 7.00x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 459 ns: 6.63x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.14 us               | 685 ns: 7.50x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.26x faster          |
+------------------------------------------+-----------------------+-----------------------+

@picnixz
Copy link
Member Author

picnixz commented Aug 20, 2024

And by implementing more ideas, I get in PGO builds

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-c   |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.15 us               | 1.21 us: 5.07x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us               | 1.24 us: 5.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.15 us               | 287 ns: 7.49x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.94 us               | 269 ns: 7.21x faster  |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 427 ns: 7.03x faster  |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.07 us               | 636 ns: 7.98x faster  |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 6.56x faster          |
+------------------------------------------+-----------------------+-----------------------+

We are going very fast now!

@dg-pb
Copy link
Contributor

dg-pb commented Aug 20, 2024

One question. Is the implementation actually 2K lines long?

@picnixz
Copy link
Member Author

picnixz commented Aug 20, 2024

Yes, but there are a lot of comments. So it's probably 1k long I think (in those 2k, there are ~400 that are clinic, 150 that are tests).

@dg-pb
Copy link
Contributor

dg-pb commented Aug 20, 2024

Performance results look good. Hopefully code can be simplified though.

@picnixz
Copy link
Member Author

picnixz commented Aug 20, 2024

The code could be simplified by removing the implementation of filter, fnmatch and fnmatchcase and only keep translate in C. The speedy implementation is only 500 lines long (translate.c). And there are some parts that I needed to C/C from unicodeobject.c since the latter does not export unicode_char.

@dg-pb
Copy link
Contributor

dg-pb commented Aug 20, 2024

That sounds sensible. It doesn't seem there is going to be any significant performance improvements from having any other functions written in C.

@picnixz
Copy link
Member Author

picnixz commented Aug 20, 2024

Yes, I'm working on an alternate branch that removes those bits. We can always re-add them later but yes, the performances are really not that impressive (10% gain is still good but not that good).

@picnixz
Copy link
Member Author

picnixz commented Aug 23, 2024

Following #123181 (comment) and #123181 (comment), I decided to close this feature. While the improvements are significant, they will not necessarily be that significant in production. I suggest we rather use #122289 which already bring an almost 2x improvement and is a lot easier to maintain.

@picnixz picnixz closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants