Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

There is no underscore in the character class in the regular expression capture for charset detection in URL previews #10307

Closed
srividyut opened this issue Jul 3, 2021 · 6 comments · Fixed by #10410
Labels
good first issue Good for newcomers S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@srividyut
Copy link
Contributor

srividyut commented Jul 3, 2021

There is no underscore in the character class in the regular expression capture for charset detection >> There is no underscore in the character class in the regular expression capture for charset detection in URL previews

line61

_charset_match = re.compile(br'<\s*meta[^>]*charset\s*=\s*"?([a-z0-9-]+)"?', flags=re.I)

line63

br'\s*<\s*\?\s*xml[^>]*encoding="([a-z0-9-]+)"', flags=re.I

  • When used in countries other than Europe and the United States, garbled characters are awkward on web pages that use certain character codes. As you know the names of these character codes, we also use underscores.
  • The fix is 2 lines but only 2 character correction.

([a-z0-9-]+) -> ([a-z0-9-_]+)
(21/07/16 02:00) ([a-z0-9-]+) -> ([a-z0-9_-]+)

(21/07/16 02:00) Hyphens need to be escaped unless they are at the beginning or end. The "Source Editor Screenshot" is also incorrect, so I deleted it.

  • When I added an underscore and sent a message including a URL from the client, the content containing the underscore in the name of the character code such as Shift_JIS was displayed without garbled characters.
  • Pull request with the same content / Sorry for being a beginner in python.
    Ignore this as it seems to be excluded in the test
  • Or I read deeply that there may be a deep reason why there is no underscore.
@richvdh
Copy link
Member

richvdh commented Jul 5, 2021

please could you explain what the user-visible symptoms of this issue are?

@babolivier babolivier changed the title There is no underscore in the character class in the regular expression capture for charset detection There is no underscore in the character class in the regular expression capture for charset detection in URL previews Jul 5, 2021
@richvdh richvdh added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 7, 2021
@srividyut
Copy link
Contributor Author

babolivier, thank you for fixing the wrong grammar.

@clokep
Copy link
Member

clokep commented Jul 12, 2021

I'm assuming this is to match additional charsets, e.g. both Shift-JIS and Shift_JIS? This should probably be fine. Do you get mojibake without this change?

@srividyut
Copy link
Contributor Author

srividyut commented Jul 15, 2021

Sorry for the late reply.
Mr. clokep, in the case of Shift_JIS, mojibake occurred without the changes described above.

Steps to reproduce

I referred to the following article to find the "sample URL for verification".

https://w3techs.com/technologies/overview/character_encoding
( The usage rate of Shift_JIS is 0.1%, but it also includes ITMedia, a media conglomerate. Kakaku.com is also often used. )
Below is a screenshot of the WEB client version of "Element" connecting to "Synapse", creating a room and sending some URLs. All the pasted URLs are WEB pages using Shift_JIS.
(To avoid past cache hits, I chose a different path, the content site is the same)

mxs-mojibake1

When I changed the regex character class, restarted the server and did the same, it looked like the screenshot below.

mxs-fixed1

I checked the response header using firefox's web development tool

The "upper two pasted URLs" in the validation image did not contain the character set definition in the content-type line of the HTTP response header.

For the third www.jalan.net, even if the HTTP response header contains "charset = Windows-31J", it is not output properly as a result.

I haven't tracked how the "the retrieved character set in variable" are processed, but they are clearly defined in the hash variables in webencodings.labels.
( ... /matrix-synapse/lib/python3.8/site-packages/webencodings/labels.py *Equivalent to here )

There is a recognition that in the past situation in Japan, in order to avoid the occurrence of problems, the WEB server side tended not to clearly set a specific character set for sending response headers. At least in the era when "individual blog operation including servers" became popular, it was seen in many "server setting articles" introduced by individuals. My personal server has the same settings as those.

I tried my best with machine translation and wrote it desperately, I'm sorry if there is something rude

@clokep
Copy link
Member

clokep commented Jul 15, 2021

@srividyut Thanks for including the screenshots! That makes it clear what's happening. I think the original PR you had put up (#10306) was correct. You'll just need to:

  • Add a newsfile
  • Sign-off on your commit
  • Run the linting scripts

https://github.com/matrix-org/synapse/blob/master/CONTRIBUTING.md#9-submit-your-patch has a bit more info about this.

@clokep clokep added good first issue Good for newcomers S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Jul 15, 2021
@srividyut
Copy link
Contributor Author

srividyut commented Jul 15, 2021

I made a mistake not only in the grammar but also in the fixed code.
It should have been inserted before the hyphen at the end.
Of course I will fix it.
Thank you for your kind guidance to the pull request !!
I'll try to do my best,

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants