-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GetInnerText Paragraphs do add too much linebreaks before and after the tag #155
Comments
That is, unfortunately, a wrong assumption. This is fully depending on the styling - in most settings it will render a space, but if its literal it might be a line break. In block display it will not render anything. I don't see a clear reproducible (external sites are not regarded as a reproducible as their content might change, they have too much noise, and its unclear if the source is IP / user dependent). Please provide a short and concise MWE to illustrate and show the problem. Look at other bug reports on how to do it. Thanks! |
Just as an example consider: <!DOCTYPE html>
<html>
<style>
.a * {
display: block;
}
.b * {
display: inline-block;
}
.c * {
white-space: pre;
}
</style>
<body>
<div class="a">
<p>a</p><span>
</span><p>b</p>
</div>
<div class="b">
<p>a</p><span>
</span>
<p>b</p>
</div>
<div class="c">
<p>a</p><span>
</span>
<p>b</p>
</div>
<script>
console.log(document.querySelector('.a').innerText);
console.log(document.querySelector('.b').innerText);
console.log(document.querySelector('.c').innerText);
</script>
</body>
</html> The outcome is quite different: And, for good reasons, So long story short: The text node has to be respected, but its interpretation is depending on the sheet. We need a clear reproducible to see if its a bug and if so where the bug originates. |
No - my answer is independent of the possible solution section. My answer had two parts; one to show how a reproducible should be done (thanks for the update, that qualifies) and one part to illustrate that your assumption that text nodes should be removed does not apply here / in general. It might be effectively the case in the given scenario, but in general text nodes contribute to the rendering and (almost) every line break in an HTML source code ends up as either part of or an individual text node. That's just how HTML parsing works. So the "Possible Solution" that you provided does not work - see my example. You cannot just remove text nodes - the way that |
Do you might have an CSS where this case is considered ? And do you perhaps already have an idea on how to approach this bug? |
I think we should compare the implementation against the spec and see where it goes wrong; |
So this fix basically involves the requiredLineBreakCounts because based on what i read in the Specs, every white space has to be considered and because we handled the required line breaks at the end, all the whitespace css rules couldn't be applied to these breaks [TestCase("<div><p>test1 test1 test1 ! test1</p>\\n<p>test2 test2 test2 test2</p></div>", "test1 test1 test1 ! test1\n\n\\n\n\ntest2 test2 test2 test2")] [TestCase("<div style=\"white-space: normal;\"><p>Test</p>
<p>Test</p></div>", "Test\n\nTest")] [TestCase("<div style=\"white-space:pre;\"><p>Test</p>
 <p>Test</p></div>", "Test\n\n\n \n\nTest")] [TestCase("<div style=\"white-space: pre-wrap;\"><p>Test</p>
<p>Test</p></div>", "Test\n\n\n\n\nTest")] [TestCase("<div style=\"white-space: pre-line;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\nTest")] [TestCase("<div style=\"white-space: break-spaces;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\n\n\n\nTest")] diff --git a/src/AngleSharp.Css/Extensions/ElementExtensions.cs b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
index 7e687f2..5eb65c3 100644
--- a/src/AngleSharp.Css/Extensions/ElementExtensions.cs
+++ b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
@@ -196,7 +196,7 @@ private static void ItcInCssBox(ICssStyleDeclaration elementStyle, ICssStyleDecl
var lastLine = node.NextSibling is null ||
String.IsNullOrEmpty(node.NextSibling.TextContent) ||
node.NextSibling is IHtmlBreakRowElement;
- ProcessText(textElement.Data, sb, parentStyle, lastLine);
+ ProcessText(textElement.Data, sb, parentStyle, lastLine, requiredLineBreakCounts);
}
else if (node is IHtmlBreakRowElement)
{
@@ -390,12 +390,15 @@ private static Boolean IsBlockLevel(INode node)
}
}
- private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine)
+ private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine, Dictionary<Int32, Int32> requiredLineBreakCounts)
{
var startIndex = sb.Length;
var whiteSpace = style?.GetWhiteSpace();
var textTransform = style?.GetTextTransform();
- var isWhiteSpace = startIndex > 0 ? Char.IsWhiteSpace(sb[startIndex - 1]) && sb[startIndex - 1] != Symbols.NoBreakSpace : true;
+ bool IsWhiteSpace(Char c) { return Char.IsWhiteSpace(c) && c != Symbols.NoBreakSpace; }
+ var isWhiteSpace = startIndex <= 0
+ || (IsWhiteSpace(sb[startIndex - 1])
+ || (requiredLineBreakCounts.ContainsKey(startIndex) && IsWhiteSpace(text[0])));
for (var i = 0; i < text.Length; i++)
{
|
Fix for it with additional tests landed in |
Bug Report
Prerequisites
AngleSharp.Css
for CSS support)For more information, see the
CONTRIBUTING
guide.Description
GetInnerText of the latest AngleSharp.CSS version renders too many line breaks at the start and end of a paragraph
Steps to Reproduce
Expected behavior: Outputs
Pasting this code into the browser consoles also outputs the same
document.querySelector('div[itemprop=\"description\"]').innerText
Actual behavior: Outputs
Environment details: .NET 7
Possible Solution
Adjust the RequiredLineBreakCounts in ElementExtension.cs for the Start from 2 to 1 and for the end from 2 to 1Adding a IsEmpty check for Text Nodes did fix the issue, it was actually empty "\n" line break Text Nodes between the elements (nodes which aren't visible in the frontend anyway, so they shouldn't be rendered).
The current behaviour is actually adding 2 line breaks before the element and 2 line breaks after the element, but Paragraphs should actually render 2 line breaks in total, 1 at the start and 1 at the end.
The text was updated successfully, but these errors were encountered: