Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re matching looks broken for unicode #6860

Closed
kevinjwalters opened this issue Sep 3, 2022 · 4 comments
Closed

re matching looks broken for unicode #6860

kevinjwalters opened this issue Sep 3, 2022 · 4 comments

Comments

@kevinjwalters
Copy link

CircuitPython version

Adafruit CircuitPython 7.3.3 on 2022-08-29; Adafruit MagTag with ESP32S2

Code/REPL

import re

text1 = "CircuitPython loves regular expressions"
text2 = "CircuitPython \u2764 regular expressions"

regex1 = re.compile("re")

for text in (text1, text2):
    match = regex1.search(text)
    print(text)
    print("Match:", text[match.start():match.end()])
    print()

Behavior

The start() and end() are borked if there's preceeding unicode probably due to flaws in character vs byte counting for variable length unicode/utf world.

Description

No response

Additional information

Output from a MagTag:

Adafruit CircuitPython 7.3.3 on 2022-08-29; Adafruit MagTag with ESP32S2
>>>
soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
CircuitPython loves regular expressions
Match: re

CircuitPython ❤ regular expressions
Match: gu


Code done running.
@jepler
Copy link
Member

jepler commented Sep 3, 2022

I've reproduced and reported the bug in micropython. Thank you for this report.

@dhalbert dhalbert added this to the Long term milestone Sep 5, 2022
@michalpokusa
Copy link

I also encountered this issue while working on POC of adafruit_template_engine.

While this is not fixed, one way of making it work is using encode("utf-8") and then decode("utf-8").

text.encode("utf-8")[match.start():match.end()].decode("utf-8")

This is not an elegant solution, but might be the only way of doing it right now.

@jepler
Copy link
Member

jepler commented Sep 15, 2023

It's worth noting that we should get micropython's bug fix for this when we merge mp version 1.20, which is planned for circuitpython version 9.

@dhalbert
Copy link
Collaborator

Re-tested. This is fixed as of 9.0.0-alpha.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants