Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetInnerText Paragraphs do add too much linebreaks before and after the tag #155

Closed
5 tasks done
Seyden opened this issue Jan 6, 2024 · 8 comments
Closed
5 tasks done
Milestone

Comments

@Seyden
Copy link

Seyden commented Jan 6, 2024

Bug Report

Prerequisites

  • Can you reproduce the problem in a MWE?
  • Are you running the latest version of AngleSharp?
  • Did you check the FAQs to see if that helps you?
  • Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
  • Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

GetInnerText of the latest AngleSharp.CSS version renders too many line breaks at the start and end of a paragraph

Steps to Reproduce

using AngleSharp;
using AngleSharp.Css;
using AngleSharp.Dom;
using System.Diagnostics;

var content = "<div class=\"entry-content entry-content-single\" itemprop=\"description\"><p><em><strong>[By the studio that brought you &lt;Solo Leveling&gt;, &lt;Reaper of the Drifting Moon&gt;, and many more!]</strong></em></p>\n<p>He was the hound of the Baskerville family: Vikir.</p>\n<p>Yet his loyalty was rewarded by the blade of a guillotine dirtied by slander.</p>\n<p>“I will never live the life of a hound slaughtered after the rabbit is caught.”</p>\n<p>In place of death, an unexpected opportunity awaits him.</p>\n<p>Vikir’s eyes glowed red as he sharpened his canines in the dark.</p>\n<p>“Just you wait, Hugo. I will rip out your throat this time.”</p>\n<p>It’s time for the hound to exact bloody revenge on his owner.</p>\n</div>";

var context = BrowsingContext.New(Configuration.Default
    .WithCss());

var doc = await context.OpenAsync(req => req.Content(content));
var description = doc.QuerySelector(("div[itemprop=\"description\"]"))?.GetInnerText().Trim();
Console.WriteLine(description);
Console.ReadKey();

Expected behavior: Outputs

"[By the studio that brought you <Solo Leveling>, <Reaper of the Drifting Moon>, and many more!]\n\nHe was the hound of the Baskerville family: Vikir.\n\nYet his loyalty was rewarded by the blade of a guillotine dirtied by slander.\n\n“I will never live the life of a hound slaughtered after the rabbit is caught.”\n\nIn place of death, an unexpected opportunity awaits him.\n\nVikir’s eyes glowed red as he sharpened his canines in the dark.\n\n“Just you wait, Hugo. I will rip out your throat this time.”\n\nIt’s time for the hound to exact bloody revenge on his owner."

Pasting this code into the browser consoles also outputs the same
document.querySelector('div[itemprop=\"description\"]').innerText

Actual behavior: Outputs

"[By the studio that brought you <Solo Leveling>, <Reaper of the Drifting Moon>, and many more!]\n\n \n\nHe was the hound of the Baskerville family: Vikir.\n\n \n\nYet his loyalty was rewarded by the blade of a guillotine dirtied by slander.\n\n \n\n“I will never live the life of a hound slaughtered after the rabbit is caught.”\n\n \n\nIn place of death, an unexpected opportunity awaits him.\n\n \n\nVikir’s eyes glowed red as he sharpened his canines in the dark.\n\n \n\n“Just you wait, Hugo. I will rip out your throat this time.”\n\n \n\nIt’s time for the hound to exact bloody revenge on his owner."

Environment details: .NET 7

Possible Solution

Adjust the RequiredLineBreakCounts in ElementExtension.cs for the Start from 2 to 1 and for the end from 2 to 1

Adding a IsEmpty check for Text Nodes did fix the issue, it was actually empty "\n" line break Text Nodes between the elements (nodes which aren't visible in the frontend anyway, so they shouldn't be rendered).

grafik

The current behaviour is actually adding 2 line breaks before the element and 2 line breaks after the element, but Paragraphs should actually render 2 line breaks in total, 1 at the start and 1 at the end.

@Seyden Seyden added the bug label Jan 6, 2024
@FlorianRappl
Copy link
Contributor

Adding a IsEmpty check for Text Nodes did fix the issue, it was actually empty "\n" line break Text Nodes between the elements (nodes which aren't visible in the frontend anyway, so they shouldn't be rendered).

That is, unfortunately, a wrong assumption. This is fully depending on the styling - in most settings it will render a space, but if its literal it might be a line break. In block display it will not render anything.

I don't see a clear reproducible (external sites are not regarded as a reproducible as their content might change, they have too much noise, and its unclear if the source is IP / user dependent).

Please provide a short and concise MWE to illustrate and show the problem. Look at other bug reports on how to do it. Thanks!

@FlorianRappl
Copy link
Contributor

Just as an example consider:

<!DOCTYPE html>
<html>
<style>
.a * {
  display: block;
}
.b * {
  display: inline-block;
}
.c * {
  white-space: pre;
}
</style>
<body>
<div class="a">
<p>a</p><span>
</span><p>b</p>
</div>
<div class="b">
<p>a</p><span>
</span>
<p>b</p>
</div>
<div class="c">
<p>a</p><span>
</span>
<p>b</p>
</div>
<script>
console.log(document.querySelector('.a').innerText);
console.log(document.querySelector('.b').innerText);
console.log(document.querySelector('.c').innerText);
</script>
</body>

</html>

The outcome is quite different:

image

And, for good reasons, innerText also follows here:

image

So long story short: The text node has to be respected, but its interpretation is depending on the sheet. We need a clear reproducible to see if its a bug and if so where the bug originates.

@Seyden
Copy link
Author

Seyden commented Jan 6, 2024

I updated the MWE, i think you also might have misunderstood what the possible solution code is supposed to do.

as an example, i added a test case

Working case:

[TestCase("<p>test1</p><p>test2</p>", "test1\n\ntest2")]

Broken case:

[TestCase("<p>test1</p>\n<p>test2</p>", "test1\n\ntest2")]

grafik

@FlorianRappl
Copy link
Contributor

I updated the MWE, i think you also might have misunderstood what the possible solution code is supposed to do.

No - my answer is independent of the possible solution section.

My answer had two parts; one to show how a reproducible should be done (thanks for the update, that qualifies) and one part to illustrate that your assumption that text nodes should be removed does not apply here / in general. It might be effectively the case in the given scenario, but in general text nodes contribute to the rendering and (almost) every line break in an HTML source code ends up as either part of or an individual text node. That's just how HTML parsing works.

So the "Possible Solution" that you provided does not work - see my example. You cannot just remove text nodes - the way that innerText works (and what makes it different from innerHTML) is that is respects the styling. Therefore, if you omit the information from a text node fully depends on the styling context.

@Seyden
Copy link
Author

Seyden commented Jan 6, 2024

Do you might have an CSS where this case is considered ?

And do you perhaps already have an idea on how to approach this bug?

@FlorianRappl
Copy link
Contributor

I think we should compare the implementation against the spec and see where it goes wrong;

@Seyden
Copy link
Author

Seyden commented Jan 7, 2024

So this fix basically involves the requiredLineBreakCounts because based on what i read in the Specs, every white space has to be considered and because we handled the required line breaks at the end, all the whitespace css rules couldn't be applied to these breaks

${{\color{Green}\Huge{\textsf{Passing\ tests:\ }}}}$

[TestCase("<div><p>test1 test1 test1 ! test1</p>\\n<p>test2 test2  test2 test2</p></div>", "test1 test1 test1 ! test1\n\n\\n\n\ntest2 test2 test2 test2")]
[TestCase("<div style=\"white-space: normal;\"><p>Test</p>&#010;<p>Test</p></div>", "Test\n\nTest")]
[TestCase("<div style=\"white-space:pre;\"><p>Test</p>&#010;                                          <p>Test</p></div>", "Test\n\n\n                                          \n\nTest")]
[TestCase("<div style=\"white-space: pre-wrap;\"><p>Test</p>&#010;<p>Test</p></div>", "Test\n\n\n\n\nTest")]

${{\color{Red}\Huge{\textsf{Broken\ tests:\ }}}}$

[TestCase("<div style=\"white-space: pre-line;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\nTest")]
[TestCase("<div style=\"white-space: break-spaces;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\n\n\n\nTest")]

${{\color{Aqua}\Huge{\textsf{Fix\ patch:\ }}}}$

diff --git a/src/AngleSharp.Css/Extensions/ElementExtensions.cs b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
index 7e687f2..5eb65c3 100644
--- a/src/AngleSharp.Css/Extensions/ElementExtensions.cs
+++ b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
@@ -196,7 +196,7 @@ private static void ItcInCssBox(ICssStyleDeclaration elementStyle, ICssStyleDecl
                     var lastLine = node.NextSibling is null ||
                         String.IsNullOrEmpty(node.NextSibling.TextContent) ||
                         node.NextSibling is IHtmlBreakRowElement;
-                    ProcessText(textElement.Data, sb, parentStyle, lastLine);
+                    ProcessText(textElement.Data, sb, parentStyle, lastLine, requiredLineBreakCounts);
                 }
                 else if (node is IHtmlBreakRowElement)
                 {
@@ -390,12 +390,15 @@ private static Boolean IsBlockLevel(INode node)
             }
         }
 
-        private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine)
+        private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine, Dictionary<Int32, Int32> requiredLineBreakCounts)
         {
             var startIndex = sb.Length;
             var whiteSpace = style?.GetWhiteSpace();
             var textTransform = style?.GetTextTransform();
-            var isWhiteSpace = startIndex > 0 ? Char.IsWhiteSpace(sb[startIndex - 1]) && sb[startIndex - 1] != Symbols.NoBreakSpace : true;
+            bool IsWhiteSpace(Char c) { return Char.IsWhiteSpace(c) && c != Symbols.NoBreakSpace; }
+            var isWhiteSpace = startIndex <= 0
+                || (IsWhiteSpace(sb[startIndex - 1])
+                || (requiredLineBreakCounts.ContainsKey(startIndex) && IsWhiteSpace(text[0])));
 
             for (var i = 0; i < text.Length; i++)
             {

FlorianRappl added a commit that referenced this issue Jan 16, 2024
@FlorianRappl
Copy link
Contributor

Fix for it with additional tests landed in devel.

@FlorianRappl FlorianRappl added this to the v1.0 milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants