Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\w not helpful for non-Roman scripts #44795

Closed
nathanlmiles mannequin opened this issue Apr 2, 2007 · 15 comments
Closed

\w not helpful for non-Roman scripts #44795

nathanlmiles mannequin opened this issue Apr 2, 2007 · 15 comments

Comments

@nathanlmiles
Copy link
Mannequin

nathanlmiles mannequin commented Apr 2, 2007

BPO 1693050
Nosy @malemburg, @loewis, @terryjreedy, @vstinner, @ezio-melotti
Superseder
  • bpo-24194: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-03-15.00:32:39.376>
    created_at = <Date 2007-04-02.15:27:11.000>
    labels = ['expert-regex']
    title = '\\w not helpful for non-Roman scripts'
    updated_at = <Date 2018-03-15.00:32:39.375>
    user = 'https://bugs.python.org/nathanlmiles'

    bugs.python.org fields:

    activity = <Date 2018-03-15.00:32:39.375>
    actor = 'terry.reedy'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-03-15.00:32:39.376>
    closer = 'terry.reedy'
    components = ['Regular Expressions']
    creation = <Date 2007-04-02.15:27:11.000>
    creator = 'nathanlmiles'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 1693050
    keywords = []
    message_count = 15.0
    messages = ['31688', '31689', '76556', '76557', '81221', '190075', '190100', '190219', '190226', '190268', '190322', '190323', '190324', '190326', '313849']
    nosy_count = 10.0
    nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'vstinner', 'nathanlmiles', 'rsc', 'timehorse', 'ezio.melotti', 'mrabarnett', 'l0nwlf']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '24194'
    type = None
    url = 'https://bugs.python.org/issue1693050'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @nathanlmiles
    Copy link
    Mannequin Author

    nathanlmiles mannequin commented Apr 2, 2007

    When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w.

    I think that if you wish \w to be useful for Indic
    scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me.

    I am using Python 2.4.4 on Windows XP SP2.

    I ran the following script to see the characters which I think ought to match \w but don't

    import re
    import unicodedata

    text = ""
    for i in range(0x901,0x939): text += unichr(i)
    for i in range(0x93c,0x93d): text += unichr(i)
    for i in range(0x93e,0x94d): text += unichr(i)
    for i in range(0x950,0x954): text += unichr(i)
    for i in range(0x958,0x963): text += unichr(i)

    parts = re.findall("\W(?u)", text)
    for ch in parts:
    print "%04x" % ord(ch), unicodedata.category(ch)

    The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database.

    @nathanlmiles nathanlmiles mannequin added topic-regex labels Apr 2, 2007
    @malemburg
    Copy link
    Member

    Python 2.4 is using Unicode 3.2. Python 2.5 ships with Unicode 4.1.

    We're likely to ship Unicode 5.x with Python 2.6 or a later release.

    Regarding the char classes: I don't think Mc, Mn and Me should be considered parts of a word. Those are marks which usually separate words.

    @terryjreedy
    Copy link
    Member

    Vowel 'marks' are condensed vowel characters and are very much part of
    words and do not separate words. Python3 properly includes Mn and Mc as
    identifier characters.

    http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords

    For instance, the word 'hindi' has 3 consonants 'h', 'n', 'd', 2 vowels
    'i' and 'ii' (long i) following 'h' and 'd', and a null vowel (virama)
    after 'n'. [The null vowel is needed because no vowel mark indicates the
    default vowel short a. So without it, the word would be hinadii.]
    The difference between the devanagari vowel characters, used at the
    beginning of words, and the vowel marks, used thereafter, is purely
    graphical and not phonological. In short, in the sanskrit family,
    word = syllable+
    syllable = vowel | consonant + vowel mark

    From a clp post asking why re does not see hindi as a word:

    हिन्दी
    ह DEVANAGARI LETTER HA (Lo)
    ि DEVANAGARI VOWEL SIGN I (Mc)
    न DEVANAGARI LETTER NA (Lo)
    ् DEVANAGARI SIGN VIRAMA (Mn)
    द DEVANAGARI LETTER DA (Lo)
    ी DEVANAGARI VOWEL SIGN II (Mc)

    .isapha and possibly other unicode methods need fixing also
    >>> 'हिन्दी'.isalpha()#2.x and 3.0
    False

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 28, 2008

    Unicode TR#18 defines \w as a shorthand for

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}

    which would include all marks. We should recursively check whether we
    follow the recommendation (e.g. \p{alpha} refers to all character having
    the Alphabetic derived core property, which is Lu+Ll+Lt+Lm+Lo+Nl +
    Other_Alphabetic, where Other_Alphabetic is a selected list of
    additional character - all from Mn/Mc)

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Feb 5, 2009

    In issue bpo-2636 I'm using the following:

    Alpha is Ll, Lo, Lt, Lu.
    Digit is Nd.
    Word is Ll, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc.

    These are what are specified at
    http://www.regular-expressions.info/posixbrackets.html

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented May 26, 2013

    Am I correct in saying that this must stay open as it targets the re module but as given in msg81221 is fixed in the new regex module?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented May 26, 2013

    I had to check what re does in Python 3.3:

    >>> print(len(re.match(r'\w+', 'हिन्दी').group()))
    1

    Regex does this:

    >>> print(len(regex.match(r'\w+', 'हिन्दी').group()))
    6

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented May 28, 2013

    Matthew, I think that is considered a single word in Sanscrit or Thai so Python 3.x is correct. In this case you've written the Sanscrit word for Hindi.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented May 28, 2013

    I'm not sure what you're saying.

    The re module in Python 3.3 matches only the first codepoint, treating the second codepoint as not part of a word, whereas the regex module matches all 6 codepoints, treating them all as part of a single word.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented May 29, 2013

    Maybe you could show us the byte-for-byte hex of the string you're testing so we can examine if it's really a code point intending word boundary or just a code point for the sake of beginning a new character.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented May 29, 2013

    You could've obtained it from msg76556 or msg190100:

    >>> print(ascii('हिन्दी'))
    '\u0939\u093f\u0928\u094d\u0926\u0940'
    >>> import re, regex
    >>> print(ascii(re.match(r"\w+", '\u0939\u093f\u0928\u094d\u0926\u0940').group()))
    '\u0939'
    >>> print(ascii(regex.match(r"\w+", '\u0939\u093f\u0928\u094d\u0926\u0940').group()))
    '\u0939\u093f\u0928\u094d\u0926\u0940'

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented May 29, 2013

    Thanks Matthew and sorry to put you through more work; I just wanted to verify exactly which unicode (UTF-16 I take it) were being used to verify if the UNICODE standard expected them to be treated as unique words or single letters within a word. Sanskrit is an alphabet, not an ideograph so each symbol is considered a letter. So I believe your implementation is correct and yes, you are right, re is at fault. There are just accenting characters and letters in that sequence so they should be interpreted as a single word of 6 letters, as you determine, and not one of the first letter. Mind you, I misinterpreted msg190100 in that I thought you were using findall in which case the answer should be 1, but as far as length of extraction, yes, 6, I totally agree. Sorry for the misunderstanding. http://www.unicode.org/charts/PDF/U0900.pdf contains the code chart for Hindi.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented May 29, 2013

    UTF-16 has nothing to do with it, that's just an encoding (a pair of them actually, UTF-16LE and UTF-16BE).

    And I don't know why you thought I was using findall in msg190100 when the examples were using match! :-)

    @vstinner
    Copy link
    Member

    Let see Modules/_sre.c:

    #define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch)
    #define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')
    >>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
    [True, False, True, False, True, False]
    >>> import unicodedata
    >>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
    ['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc']

    So the matching ends at U+093f because its category is a "spacing combining" (Mc), which is part of the Mark category, where the re module expects an alphanumeric character.

    msg76557:

    """
    Unicode TR#18 defines \w as a shorthand for

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}
    """

    So if we want to respect this standard, the re module needs to be modified to accept other Unicode categories.

    @terryjreedy
    Copy link
    Member

    Whatever I may have said before, I favor supporting the Unicode standard for \w, which is related to the standard for identifiers.

    This is one of 2 issues about \w being defined too narrowly. I am somewhat arbitrarily closing this as a duplicate of bpo-12731 (fewer digits ;-).

    There are 3 issues about tokenize.tokenize failing on valid identifiers, defined as \w sequences whose first char is an identifier itself (and therefore a start char). In msg313814 of bpo-32987, Serhiy indicates which start and continue identifier characters are matched by \W for re and regex. I am leaving bpo-24194 open as the tokenizer name issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants