Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft hyphenation in Weasyprint #176

Closed
jdus opened this issue Mar 27, 2014 · 12 comments
Closed

Soft hyphenation in Weasyprint #176

jdus opened this issue Mar 27, 2014 · 12 comments
Labels
feature New feature that should be supported

Comments

@jdus
Copy link
Contributor

jdus commented Mar 27, 2014

Hello,

I have been experimenting with hyphenation recently. Automatic hyphenation works like a charme. It's simply amazing.

However, while manual hyphenation works, weasyprint never appends the actual hyphen character "-". It will break correctly, if needed, when encountering a shy but the trailing hyphen character is omitted.

I have tested it using this simple HTML file

<html>
<head>
</head>
<body>
ffffffffffffffffffffffffffffffffffffffffffffffffffffffg&shy;fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffg&shy;ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff&shy;fffffffffffffffffffffff&nsbp;fffffffffffffffffffffffffffffffffffff
</body>
</html>

Now, I have investigated and browsed the relevant parts in text.py, but it appears as if weasyprint is only concerned with automatic hyphenation and does not do the manual hyphenation part at all.

Could anybody confirm this behaviour? Or give me a hint what I am doing wrong? If this is indeed a bug maybe somebody could point out the relevant parts of the code?

Thanks for any comments :)

@SimonSapin
Copy link
Member

I belive that breaking on U+00AD (a.k.a. &shy;) is Pango’s doing, so we do not have direct control over it. Alternatively, we could handle it ourselves and remove it from the input we give to Pango. @liZe ?

@jdus
Copy link
Contributor Author

jdus commented Mar 27, 2014

Hey Simon,

thanks for your fast reply. So you can confirm this? Would it make sense to handle soft hyphens in the text.py, where automatic hyphenation is handled as well?

Greetings,
jdus

@SimonSapin
Copy link
Member

I think it would make sense (though @liZe might have an opinion), it just needs someone to do the work.

@liZe
Copy link
Member

liZe commented Mar 27, 2014

As far as I can remember, WeasyPrint is only concerned about automatic hyphenation, and relies on Pango to do the manual part of hyphenation.

It would make sense to make WeasyPrint handle both parts, because 1) the rules to know where text can be split are different between HTML+CSS and Pango, and 2) text is already split by WeasyPrint between tags before letting Pango split it, so text can be broken inside a word when the letters are put in tags one by one without spaces between them (and that's really bad).

Handling soft hyphens is a really small part of the work. The rules to break lines depend for example on the language of the text and on some CSS properties. These rules are not handled by Pango, so we have to do the whole work in WeasyPrint, without relying on Pango at all. The code added for automatic hyphenation was a first step to solve this problem, that was quite hard to add, it is hard to understand how it works now, but that was a really simple piece of code compared to the complex rules we need to respect if we want to correctly handle line breaks.

By the way, before adding support of more complex rules to this part of the code, we should really add the support of right-to-left languages. The specification takes care of this feature, and we can't seriously break lines without handling text direction before. The code will be closer to the specification and thus easier to understand once we have rtl support (yes, that's my pessimistic point of view about text :p).

@SimonSapin
Copy link
Member

As long as we don’t have someone working on RTL support I don’t think we should block any text-related work on it.

@jdus
Copy link
Contributor Author

jdus commented Mar 28, 2014

Hello Simon, hello liZe,

I understand your considerations here. I will take a look at weasyprint's automatic hyphenation code and try add support for soft hyphenation. As far as I am concerned this should all happen in text.py.

Greetings,
jusd

@SimonSapin
Copy link
Member

soft hyphenation […] should all happen in text.py.

Yes, that sounds right. Thanks a lot for volunteering for this!

@jdus
Copy link
Contributor Author

jdus commented Mar 28, 2014

Hey,

concerning line breaks within tags, as liZe mentioned it. Is this also the reason why weasyprint does not respect or white-space: no-wrap? It would already help a lot and improve the situation, if it would.

@liZe
Copy link
Member

liZe commented Mar 28, 2014

WeasyPrint repects white-space: nowrap when there's only text inside the tag, but allows to cut lines at each new open or closed nested tag.

@jdus
Copy link
Contributor Author

jdus commented May 19, 2014

I have finally found some time to look into this. As far as I understand weasyprint retrieves the laid out lines from Pango. Since Pango is not capable of automatic hyphenation, weasyprint tries to put the first word from the second line on the first line. If it does not fit and hyphenation is set to auto, it will try to split the word and try again. So far so good.
Unlike automatic hyphenation, Pango seems to support manual/soft hyphenation, but simply omits the hyphenation character at the end. So when weasyprint retrieves the laid out lines of text from Pango (and hyphenation is set to manual), the text may be hyphenated already based on soft hyphens. But we do not know, whether the last word in the first line is complete or has been divided between the first and second line.
I assume a quick fix would be to determine whether the last element in the first line is indeed a soft hyphen ("\u00AD") and insert a hyphenation character accordingly. I have already played with this and it works, but it's ugly and error prone, i.e. what would happen if someone put a shy at the end of a word?
I'm not sure how to proceed now, since there are more bugs concering line breaks, that should be taken care of and maybe this would mean to get rid of pango at all?

[EDIT: Reposted because I have accidentally used an old account for posting this.]

@liZe liZe added the feature New feature that should be supported label Mar 8, 2016
@liZe
Copy link
Member

liZe commented Mar 8, 2016

Grouped in #301.

@liZe liZe closed this as completed Mar 8, 2016
@liZe
Copy link
Member

liZe commented Dec 26, 2017

For the record: fixed before closing #301 by a random WeasyPrint version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported
Projects
None yet
Development

No branches or pull requests

3 participants