Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use regress to implement JS regex usage for pattern and patternProperties + use unicode mode regexes by default #511

Merged
merged 12 commits into from
Jan 8, 2025

Conversation

sirosen
Copy link
Member

@sirosen sirosen commented Jan 1, 2025

Changelog

  • Regular expression interpretation in "pattern", "patternProperties", and "format": "regex" usages now uses unicode-mode JS regular expressions by default. (resolves Support ECMAScript unicode-mode RegExp usage for 'pattern' and 'patternProperties' #353)

    • Use --regex-variant nonunicode to get non-unicode JS regular expressions, the default behavior from previous versions.
    • Custom validators may be impacted by the new regular expression features. Validators are now always modified with the jsonschema library's extend() API to control the pattern and patternProperties keywords.

Notes

  • Right now, this is still using Validator.check_schema to do metaschema checks, which means the regex engine for metaschema checking is always python.
    • This has been addressed, adding some extra phrasing around metaschema checks to ensure that the correct regex engine is always used, including as a format checker.
    • When format checking is disabled, we currently fall-back to using the python-based format checker. This is a subtle flaw, in that it would be better to use the desired regex engine even when format checking is disabled. This issue is addressed as well.
  • There's no central source of truth for whether or not unicode mode regexes should be the default or not for implementations. Based on observation of the ecosystem, this change is guessing that "opt out of unicode mode" is better for users than "opt into unicode mode".
    • It's rare that schemas use unicode mode regex patterns by mistake. The major counterexample is the Azure Pipelines schema, which has many problems.
    • The VSCode language server is using unicode mode by default without it being a major issue. Other checkers are probably mostly using their language-native engines, so it's hard to baseline this against non-JS implementations for comparison.

Docs cover the expected usage, which is not implemented in this
initial change. The rename and alias is all that's covered here.
When an alternate RegexImplementation is selected, it will be passed
down to the code which builds a Validator. The Validator is extended
with the keyword validator provided by the RegexImplementation.

Because this uses the `extend()` interface, a test which subclassed a
validator broke -- this is documented in `jsonschema` as unsupported
usage, so the test simply had to be updated to use supported
interfaces.
Change RegexImplementation to dispatch between variants at init time,
so that each call to the relevant methods is faster and does not
re-check the variant.

Minor tool config fixes:
- flake8 ignore vs extend-ignore (whoops)
- tox.ini depends needed to run in parallel
This allows the `check_schema` call to succeed when it requires that
the regex implementation is regress. Patterns which can compile as
regress patterns are thereby supported.

Add initial acceptance tests for regress 'pattern' support, which
revealed this issue.
This is a variant of the regress engine usage in which the `u` flag is
not set. It is needed for usages which rely on unicode escapes *not*
being interpreted. In particular, the azure pipelines schema won't
pass metaschema validation (let alone apply correctly) if `pattern`
interpretation uses unicode-mode regexes.
Primarily, make the regress forms inherit and share most of their
code.
@sirosen sirosen force-pushed the use-regress-for-patterns branch from bac71eb to e9aa64e Compare January 1, 2025 19:53
Rather than relying directly on `validator_cls.check_schema`,
reimplement this functionality in-tree here to allow for
customizations to the validator class before it is run on the user's
schema.

A new test uses an invalid regex under the regress unicode engine
which is valid in the python engine to ensure that consistent checking
is applied.

Manual testing revealed that the `_fail()` message production for
SchemaError was showing error information twice, which is fixed
without a new test to guarantee the new behavior.
@sirosen sirosen force-pushed the use-regress-for-patterns branch from 62aedbd to cc12d98 Compare January 8, 2025 03:24
In the implementation prior to this change, passing
`--disable-formats` would impact not only the "actual" schema
validator, but also the validator built to evaluate the schema against
its metaschema. As a result, `--disable-formats *` and similar would
enable schemas to run which previously should have been caught as
invalid.

Furthermore, the customized format checker which had extensions for
date and time evaluation added was used, and any other customizations
to format checking would implicitly be shared with the metaschema
check.

To resolve, refactor format checker building to allow it to be used
more directly for the metaschema check, and add test cases to confirm
that a bad regex in a `pattern` is always rejected, even when
`--disable-formats regex` or similar is used.
@sirosen sirosen merged commit 8da1fef into main Jan 8, 2025
45 checks passed
@sirosen sirosen deleted the use-regress-for-patterns branch January 8, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ECMAScript unicode-mode RegExp usage for 'pattern' and 'patternProperties'
1 participant