Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injecting custom regex implementations #1142

Open
jemand771 opened this issue Jul 25, 2023 · 11 comments
Open

Injecting custom regex implementations #1142

jemand771 opened this issue Jul 25, 2023 · 11 comments
Labels
Dialects v2 Issues which will likely be addressed as part of reworked dialect support Enhancement Some new desired functionality

Comments

@jemand771
Copy link

Hi there,

this is a kind of mix between question and feature request.

I'm interested in a regex feature called possessive quantifiers. In short, these allow you to use the quantifiers *+, ++ and ?+. These act like their counterparts (*, +, ?) except that if a match was found, they will not backtrack. as an example, the pattern a?+a will not match the string a because the regex engine doesn't backtrack "out of" the first a?. kind of like a super-greedy match

python supports these starting with 3.11, but some of us are stuck on lower versions (3.9 in my case).
Is there an easy way to make the pattern property (or the keys of patternProperties) use the regex library instead of the builtin re? It's compatible with the builtin, but has some additional backported features like possessive quantifiers.

Ideally, I'd like some kind of optional argument where I can enable the 3rd party module and make python-jsonschema use that instead of the builtin. I don't think that should be the default, as you probably don't want "useless" extra dependencies.
Alternatively, patching this myself at runtime is probably possible, but it's not going to be pretty. If that's your recommendation, any hints about where to patch it?
As a final option, there are ways to mimic the behavior of possessive quantifiers with existing regex features, but that's not pretty either.

@jemand771
Copy link
Author

sorry if I'm getting lost in super-fine details, here's a minimal example:

import jsonschema
jsonschema.validate(
    dict(
        foo="aa"
    ),
    dict(
        type="object",
        properties=dict(
            foo=dict(
                type="string",
                pattern=r"a?+a"
            )
        )
    )
)

works in python 3.11 (returns None as expected), but crashes in 3.9. stacktrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/jsonschema/_format.py", line 137, in check
    result = func(instance)
  File "/usr/local/lib/python3.9/site-packages/jsonschema/_format.py", line 388, in is_regex
    return bool(re.compile(instance))
  File "/usr/local/lib/python3.9/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/local/lib/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/lib/python3.9/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/lib/python3.9/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/local/lib/python3.9/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/local/lib/python3.9/sre_parse.py", line 672, in _parse
    raise source.error("multiple repeat",
re.error: multiple repeat at position 2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/jsonschema/validators.py", line 1298, in validate
    cls.check_schema(schema)
  File "/usr/local/lib/python3.9/site-packages/jsonschema/validators.py", line 297, in check_schema
    raise exceptions.SchemaError.create_from(error)
jsonschema.exceptions.SchemaError: 'a?+a' is not a 'regex'

Failed validating 'format' in metaschema['allOf'][1]['properties']['properties']['additionalProperties']['$dynamicRef']['allOf'][3]['properties']['pattern']:
    {'format': 'regex', 'type': 'string'}

On schema['properties']['foo']['pattern']:
    'a?+a'

this isn't a library issue, again, python's re module just doesn't support possessive quantifiers before 3.11.
re.compile("a?+a") in 3.9 gives a similar error, but regex.compile("a?+a") works

@Julian
Copy link
Member

Julian commented Jul 25, 2023

Hi there -- today no, but in the future yes (so happy to leave this open). The spec says that implementations SHOULD use a dialect of JavaScript regexes -- which we've never been able to do because no implementation of them was available to Python, but now there is one, so yes it's definitely planned to allow you to inject your own regular expression implementation (at which point sure you'd be able to inject this one too).

Though to do so will involve implementing a protocol most likely, since the libraries all have subtly different APIs.

@Julian Julian added the Enhancement Some new desired functionality label Jul 25, 2023
@jemand771
Copy link
Author

oh, that sounds amazing! let's just leave this open for now then, I'll look for a different short-term workaround instead. thanks :)

@jemand771 jemand771 changed the title extra regex features using regex instead of re Injecting custom regex implementations Jul 25, 2023
@sirosen
Copy link
Member

sirosen commented Aug 15, 2023

@Julian, I just saw this and wanted to let you know that I'd be super-interested in trying out the python-ized regress in check-jsonschema at some point.

Right now there's a super gross hack I put in to let some common JS regexes pass the format check. That seems like a reasonable place to slot this in and start kicking the tires. I'll queue that up, though I'm not sure when I'll get to it.

patternProperties is the harder thing to manage. I'm not sure how best to slot that in.
Although it would be nice to have a single slot where you give jsonschema your regex implementation, it would be manageable to have format remain its own piece since it's already pretty pluggable.

On that last note! For the OP:

If you construct your own format checker, you should be able to slot in a customized regex check by applying the checks decorator to your desired method.
Here's the usage site in check-jsonschema where I use this exact approach.

@Julian
Copy link
Member

Julian commented Aug 15, 2023

(As usual helpful comments and as usual I'm responding just to a bit to start hah, but...)

  1. yes if you do try this out 1000% helpful
  2. yeah exactly, really this is needed for pattern (and patternProperties)! So at some point indeed I think one will give such an implementation in one place and the validators will use that for all their regex matching.

More comments of course welcome!

@Julian Julian added the Dialects v2 Issues which will likely be addressed as part of reworked dialect support label Nov 16, 2023
@sirosen
Copy link
Member

sirosen commented Nov 29, 2023

Just a quick little update on this:

I've been using regress in the CLI for a while now for format validation and it's working great. No complaints from users and no issues across my own usages.

I have one outstanding issue which would potentially be handled by being able to use regress for pattern. I haven't taken a look at plugging it in yet, but if it's possible today, I'd be interested in giving it a try. I'm not sure if there's a way to do it by overriding the validator for pattern, but it seems possible in theory?

@Julian
Copy link
Member

Julian commented Nov 29, 2023

Yeah essentially swapping out the whole definition of pattern via jsonschema.validators.extend would be one way with no internal changes. Obviously it could be made nicer to do so.

@choeppler
Copy link

I've got a need for changing the regex implementation to be able to support unicode categories such as \p{L} in a json schema. An easy fix that works today would be to drop-in replace the standard library's re with https://pypi.org/project/regex/ module if present. Would changing the imports in https://github.com/python-jsonschema/jsonschema/blob/main/jsonschema/_format.py#L7 and

to

try:
    import regex as re
except ModuleNotFoundError:
    import re

an option? It works for my uses case as tested with current versions of both libraries on Python 3.12. The conditional import may have to be changed for earlier Python versions...

@sirosen
Copy link
Member

sirosen commented Nov 8, 2024

That's coincidentally related to some work going on with regress, where we're also looking at unicode mode regexes!

It would be bad and surprising if two different regex implementations were used by default internally. So we need to consider the format checker too.
Changing the regex library changes what matches and does not match, and the signal you get from format checking. So it would need to be a major release.

In aggregate, that makes me think that simply swapping things out would be a bad idea.

Once the regress work is done, I have follow up work in check-jsonschema to finish re-instrumenting the two different regex modes. I'll put a bit more polish on it and ideally be able to come back to this topic with links to a working example for what regex customization could look like.

I think my favorite idea for how this eventually looks is that you could do...

from jsonschema import regex_variants
...
validator = MyValidatorClass(regex_variant=regex_variants.REGRESS)

(i.e. we push at least a couple of implementations down into jsonschema)

Then, regress could become the default in a major release and -- here's a bit I like about that -- the path to upgrade but retain old behavior is open. You'd just have to start passing the variant explicitly.

I think you have to pass the variant implementation to the format checker as well? So that's not a great interface but it works.


Some of these ideas have been in my head for a few days but haven't been put to paper until just now.

@choeppler
Copy link

choeppler commented Nov 8, 2024

@sirosen , thank you, I now see it is not quite as easy as I thought after having a first look.

My current issue is similar to python-jsonschema/check-jsonschema#353 (comment) and would be solved if I could replace the regex implementation used for pattern. To start with I do not have the need to replace the regex implementation everywhere in jsonschema including all dependent (and maybe custom) format checkers. So I am looking very much forward to the first step you describe in python-jsonschema/check-jsonschema#353 (comment)

@sirosen
Copy link
Member

sirosen commented Jan 8, 2025

I've just done a check-jsonschema release (v0.31.0) which uses regress with flags="u" by default (unicode mode), and we'll see how it fares in the wild.

Regarding implementation, I have some notes which should hopefully be useful for jsonschema dev:

  • Defining the various regex variants themselves is quite simple. I went with a version where each variant defines three methods: check_format, pattern_keyword, and patternProperties_keyword. It may have some redundancy, but it's very easy to reason about. There are lots of other reasonable ways to do this too.
  • Plugging the regex variant into the validator class and format checker is pretty straightforward, but it's notable that Validator.FORMAT_CHECKER seems like a potential issue. To play it safe, I made a copy of FORMAT_CHECKER, but it seems like there's room for either (a) me to learn how to improve my usage or (b) some new API around building format checkers?
  • Validator.check_schema cannot be used if you have a custom regex engine (and want consistent behaviors). You need the metaschema validator to be customized with your regex variant, and its format checker customized as well. I ended up reimplementing check_schema, as it's quite small, but this looks to me like an issue to address.
  • It works! I think we can refine it, but we shouldn't forget this part of the message. Plugging regress into jsonschema is now fully demonstrated as a thing you can do!

EDIT: To clarify, when I said that FORMAT_CHECKER may be an issue, I mean that in reference the fact that out of the box if you try to get "the appropriate format checker for this validator", you're starting from an instance. And the FormatChecker.checks API for modification modifies that instance. I would like to see an API around FormatChecker similar to Validator.extend(), which would let users safely evolve a checker. I'm aware that there are some comments in _format.py and existing issues around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dialects v2 Issues which will likely be addressed as part of reworked dialect support Enhancement Some new desired functionality
Projects
None yet
Development

No branches or pull requests

4 participants