Soft hyphenation in Weasyprint #176

jdus · 2014-03-27T19:28:26Z

Hello,

I have been experimenting with hyphenation recently. Automatic hyphenation works like a charme. It's simply amazing.

However, while manual hyphenation works, weasyprint never appends the actual hyphen character "-". It will break correctly, if needed, when encountering a shy but the trailing hyphen character is omitted.

I have tested it using this simple HTML file

<html>
<head>
</head>
<body>
ffffffffffffffffffffffffffffffffffffffffffffffffffffffg&shy;fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffg&shy;ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff&shy;fffffffffffffffffffffff&nsbp;fffffffffffffffffffffffffffffffffffff
</body>
</html>

Now, I have investigated and browsed the relevant parts in text.py, but it appears as if weasyprint is only concerned with automatic hyphenation and does not do the manual hyphenation part at all.

Could anybody confirm this behaviour? Or give me a hint what I am doing wrong? If this is indeed a bug maybe somebody could point out the relevant parts of the code?

Thanks for any comments :)

The text was updated successfully, but these errors were encountered:

SimonSapin · 2014-03-27T20:30:21Z

I belive that breaking on U+00AD (a.k.a. ) is Pango’s doing, so we do not have direct control over it. Alternatively, we could handle it ourselves and remove it from the input we give to Pango. @liZe ?

jdus · 2014-03-27T20:37:22Z

Hey Simon,

thanks for your fast reply. So you can confirm this? Would it make sense to handle soft hyphens in the text.py, where automatic hyphenation is handled as well?

Greetings,
jdus

SimonSapin · 2014-03-27T21:13:21Z

I think it would make sense (though @liZe might have an opinion), it just needs someone to do the work.

liZe · 2014-03-27T23:16:41Z

As far as I can remember, WeasyPrint is only concerned about automatic hyphenation, and relies on Pango to do the manual part of hyphenation.

It would make sense to make WeasyPrint handle both parts, because 1) the rules to know where text can be split are different between HTML+CSS and Pango, and 2) text is already split by WeasyPrint between tags before letting Pango split it, so text can be broken inside a word when the letters are put in tags one by one without spaces between them (and that's really bad).

Handling soft hyphens is a really small part of the work. The rules to break lines depend for example on the language of the text and on some CSS properties. These rules are not handled by Pango, so we have to do the whole work in WeasyPrint, without relying on Pango at all. The code added for automatic hyphenation was a first step to solve this problem, that was quite hard to add, it is hard to understand how it works now, but that was a really simple piece of code compared to the complex rules we need to respect if we want to correctly handle line breaks.

By the way, before adding support of more complex rules to this part of the code, we should really add the support of right-to-left languages. The specification takes care of this feature, and we can't seriously break lines without handling text direction before. The code will be closer to the specification and thus easier to understand once we have rtl support (yes, that's my pessimistic point of view about text :p).

SimonSapin · 2014-03-28T07:06:57Z

As long as we don’t have someone working on RTL support I don’t think we should block any text-related work on it.

jdus · 2014-03-28T07:37:15Z

Hello Simon, hello liZe,

I understand your considerations here. I will take a look at weasyprint's automatic hyphenation code and try add support for soft hyphenation. As far as I am concerned this should all happen in text.py.

Greetings,
jusd

SimonSapin · 2014-03-28T07:45:56Z

soft hyphenation […] should all happen in text.py.

Yes, that sounds right. Thanks a lot for volunteering for this!

jdus · 2014-03-28T08:41:55Z

Hey,

concerning line breaks within tags, as liZe mentioned it. Is this also the reason why weasyprint does not respect or white-space: no-wrap? It would already help a lot and improve the situation, if it would.

liZe · 2014-03-28T10:18:03Z

WeasyPrint repects white-space: nowrap when there's only text inside the tag, but allows to cut lines at each new open or closed nested tag.

jdus · 2014-05-19T08:37:27Z

I have finally found some time to look into this. As far as I understand weasyprint retrieves the laid out lines from Pango. Since Pango is not capable of automatic hyphenation, weasyprint tries to put the first word from the second line on the first line. If it does not fit and hyphenation is set to auto, it will try to split the word and try again. So far so good.
Unlike automatic hyphenation, Pango seems to support manual/soft hyphenation, but simply omits the hyphenation character at the end. So when weasyprint retrieves the laid out lines of text from Pango (and hyphenation is set to manual), the text may be hyphenated already based on soft hyphens. But we do not know, whether the last word in the first line is complete or has been divided between the first and second line.
I assume a quick fix would be to determine whether the last element in the first line is indeed a soft hyphen ("\u00AD") and insert a hyphenation character accordingly. I have already played with this and it works, but it's ugly and error prone, i.e. what would happen if someone put a shy at the end of a word?
I'm not sure how to proceed now, since there are more bugs concering line breaks, that should be taken care of and maybe this would mean to get rid of pango at all?

[EDIT: Reposted because I have accidentally used an old account for posting this.]

liZe · 2016-03-08T19:17:07Z

Grouped in #301.

liZe · 2017-12-26T14:10:22Z

For the record: fixed before closing #301 by a random WeasyPrint version.

Smylers mentioned this issue May 1, 2014

Nested elements subvert white-space: nowrap #190

Closed

liZe added the feature New feature that should be supported label Mar 8, 2016

liZe closed this as completed Mar 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft hyphenation in Weasyprint #176

Soft hyphenation in Weasyprint #176

jdus commented Mar 27, 2014

SimonSapin commented Mar 27, 2014

jdus commented Mar 27, 2014

SimonSapin commented Mar 27, 2014

liZe commented Mar 27, 2014

SimonSapin commented Mar 28, 2014

jdus commented Mar 28, 2014

SimonSapin commented Mar 28, 2014

jdus commented Mar 28, 2014

liZe commented Mar 28, 2014

jdus commented May 19, 2014

liZe commented Mar 8, 2016

liZe commented Dec 26, 2017

Soft hyphenation in Weasyprint #176

Soft hyphenation in Weasyprint #176

Comments

jdus commented Mar 27, 2014

SimonSapin commented Mar 27, 2014

jdus commented Mar 27, 2014

SimonSapin commented Mar 27, 2014

liZe commented Mar 27, 2014

SimonSapin commented Mar 28, 2014

jdus commented Mar 28, 2014

SimonSapin commented Mar 28, 2014

jdus commented Mar 28, 2014

liZe commented Mar 28, 2014

jdus commented May 19, 2014

liZe commented Mar 8, 2016

liZe commented Dec 26, 2017