Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverse_adoc: Clean Unicode whitespace in headers and paragraphs #80

Merged
merged 3 commits into from
Jun 4, 2024

Conversation

hmdne
Copy link
Contributor

@hmdne hmdne commented Jun 1, 2024

This fixes #65 and fixes #67.

I don't necessarily agree with this. A full-width space is semantically similar to an NBSP, ie. it's not trimmed by web browsers. If anything, I think this should not be a generic feature - while for this particular usecase, full-width space has no meaning, other than formatting - in other documents they may be crucial.

The character still persists in table cells, lists and sections (which are mapped from DIVs):

  • Removing whitespace from lists will cause certain tests to fail, as they expect a list item to end with " ".
  • Removing them from sections will cause a document reflow (in this particular document, they are used as   to ensure there's a deeper line break).
  • Removing them from table cells - I have not tested the impact yet.

Metanorma PR checklist

@hmdne hmdne force-pushed the clean-headers-paragraphs branch from 7438c2c to 99cdd4d Compare June 1, 2024 22:54
@ronaldtse
Copy link
Contributor

@hmdne I understand your concern with regards to the full-width space, but the question is actually about the compatibility of "AsciiDoc" (which uses ASCII sequences as control/markup sequences) and CJK in general.

AsciiDoc syntax heavily depends on these control symbols that are not easily accessed/used in CJK:

  • empty ASCII space
    • delimiter after clause title specification ==)
    • with multiple text lines without 2 sequential newline characters, join those lines into one paragraph replace the newline character with a single empty space
  • empty line
    • empty line denotes new paragraph
    • with multiple text lines without 2 sequential newline characters, join those lines into one paragraph replace the newline character with a single empty space
  • asterisk, underscore, backtick
    • for inline formatting

Retracting our steps, notice that AsciiDoc was designed for "ASCII"-encoding, which is really made to allow easy and predictable entry on an English keyboard, and to an extent Latin based keyboards. CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

The comments about "a full width space means something" are unintended consequences with AsciiDoc compatibility with CJK:

Removing whitespace from lists will cause certain tests to fail, as they expect a list item to end with " ".

It should not be the case. This is simply a CJK compatibility issue with AsciiDoc.

Removing them from sections will cause a document reflow (in this particular document, they are used as to ensure there's a deeper line break).

In CJK, the initial "full width spaces" (one or more than one) are formatting concerns. This is to be determined by the rendering template as part of "paragraph initial line indenting", it plays no part in the textual meaning.

Removing them from table cells - I have not tested the impact yet.

They should be stripped from the table cells.

@ronaldtse
Copy link
Contributor

If I use the Japanese keyboard and retain the semantics of the equal sign, hyphens, spaces, open/close brackets, comma, I get this. This means I won't need to swap between Japanese/English when entering. Wondering if this is something we should support... "(Ascii)Doc for CJK"

= 日本語

「ソース、ruby」
ーーーー
ソースコード
ーーーー

@ReesePlews
Copy link

is there status on these updates? would a work-around be to fill any empty cells in a table with a single character?

@hmdne
Copy link
Contributor Author

hmdne commented Jun 4, 2024

@ReesePlews @ronaldtse

I have pushed an updated version that deals with almost all of the leading CJK whitespace in the document while trying to preserve compatibility. The only issue is with sections: as mentioned above, empty paragraphs are collapsed, but this is an issue with this particular document and may not be really an issue, if it is, please inform me on that. I have found another problem, with generation, but I will try to amend that shortly.

@hmdne hmdne force-pushed the clean-headers-paragraphs branch from 5fa95dc to 875dfb4 Compare June 4, 2024 01:18
@hmdne
Copy link
Contributor Author

hmdne commented Jun 4, 2024

This is ready for merge now.

@ronaldtse
Copy link
Contributor

Thanks @hmdne !

@ronaldtse ronaldtse merged commit 8e7538c into metanorma:main Jun 4, 2024
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants