Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 3.0.1 to 3.0.2 un-escapes & #117

Open
benasher44 opened this issue Jan 25, 2022 · 15 comments
Open

Upgrading from 3.0.1 to 3.0.2 un-escapes & #117

benasher44 opened this issue Jan 25, 2022 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@benasher44
Copy link

benasher44 commented Jan 25, 2022

Describe the bug
We're encoding some HTML into an XML document. The HTML has an already encoded &. Going from 3.0.1 to 3.0.2, this sequence is encoded back into a & instead of being left as &amp.

To Reproduce

const html = `<ul style="min-height:1.5em"><li><p style="min-height:1.5em">Peanut butter &amp; jelly.<br /></p></li></ul><p style="min-height:1.5em"><strong>Heading:</strong><br /></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">list item.</p></li><li><p style="min-height:1.5em">list item; more.</p></li><li><p style="min-height:1.5em">list item.</p></li><li><p style="min-height:1.5em">list item.</p></li><li><p style="min-height:1.5em">list item.</p></li><li><p style="min-height:1.5em">list item.<br /></p></li></ul>`;
const xml = create({ encoding: "utf-8", version: "1.0" }, { test: html }).end({
  prettyPrint: true,
});

Expected behavior
We expect the following XML (3.0.1):

<?xml version=\\"1.0\\" encoding=\\"utf-8\\"?>
      <test>&lt;ul style=\\"min-height:1.5em\\"&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;Peanut butter &amp; jelly.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p style=\\"min-height:1.5em\\"&gt;&lt;strong&gt;Heading:&lt;/strong&gt;&lt;br /&gt;&lt;/p&gt;&lt;ul style=\\"min-height:1.5em\\"&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item; more.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style=\\"min-height:1.5em\\"&gt;list item.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;</test>

We're getting this, instead (3.0.2):

<test>&lt;ul style="min-height:1.5em"&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;Peanut butter & jelly.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p style="min-height:1.5em"&gt;&lt;strong&gt;Heading:&lt;/strong&gt;&lt;br /&gt;&lt;/p&gt;&lt;ul style="min-height:1.5em"&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item; more.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p style="min-height:1.5em"&gt;list item.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;</test>

The only difference is that first &amp; is now a &, making the XML document invalid. Picture to highlight differences:

Screen Shot 2022-01-25 at 10 43 10 AM

Version:

  • node.js: 16.13.0
  • xmlbuilder2 3.0.2

Additional context
Add any other context about the problem here.

@benasher44 benasher44 added the bug Something isn't working label Jan 25, 2022
@jasonkhanlar
Copy link

jasonkhanlar commented Apr 6, 2022

This bug report seems related. I started using xmlbuilder2 with v3.0.2, and I'm trying to convert XML to JSON and back to XML and verify the output identically matches the source, and source XML contains "&amp;" but the generated output XML retains the JSON "&" as "&"

I would like to find an option or way to ensure all "&" are converted back to html entity "&amp;"

My code at the moment:

      let xml = await fs.promises.readFile(file, { encoding: 'utf8' });
      let json = xmlbuilder2.convert(xml, { format: 'object' });
      let xmlverify = xmlbuilder2.convert(json, { format: 'xml', prettyPrint: true, spaceBeforeSlash: true });

These also seem relevant:


Also note that in my comparing to my jq/xq/yq codebase that I'm migrating away from, converting the XML source to JSON, the XML having "&amp;" piping to xq (https://github.com/kislyuk/xq) the returned JSON shows "&" (which may be fine or okay, since xq does it), but then when I use yq (https://github.com/kislyuk/yq) to convert the JSON back to XML, it converts the "&" in the JSON back to "&amp;" XML generated output. This is the expected behavior. And xmlbuilder2 seems to handle the XML -> JSON conversion alright (at quick comparison glance), but the JSON -> XML conversion of xmlbuilder2 doesn't seem to return the expected return results.


Edited to add some notes while glancing through the codebase:

Possibly relevant script files (and possibly relevant code-related matches):

  • lib/writers/BaseWriter.js

    • 926 * 3. Replace any occurrences of "&" in markup by "&amp;".
    • 927 * 4. Replace any occurrences of "<" in markup by "&lt;".
    • 928 * 5. Replace any occurrences of ">" in markup by "&gt;".
    • 931 var markup = node.data.replace(/(?!&([^&;]*);)&/g, '&amp;')
    • 932 .replace(/</g, '&lt;')
    • 933 .replace(/>/g, '&gt;');
    • 1582 * - "&" with "&amp;"
    • 1583 * - """ with "&quot;"
    • 1584 * - "<" with "&lt;"
    • 1585 * - ">" with "&gt;"
    • 1591 return value.replace(/(?!&([^&;]*);)&/g, '&amp;')
    • 1592 .replace(/</g, '&lt;')
    • 1593 .replace(/>/g, '&gt;')
    • 1594 .replace(/"/g, '&quot;');
  • lib/readers/BaseReader.js

    • 51 return text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
    • 53 }).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
    • 160 "amp": "&",
  • lib/builder/XMLBuilderCBImpl.js

    • 217 var markup = node.data.replace(/(?!&(lt|gt|amp|apos|quot);)&/g, '&amp;')
    • 218 .replace(/</g, '&lt;')
    • 219 .replace(/>/g, '&gt;');
    • 679 return value.replace(/(?!&(lt|gt|amp|apos|quot);)&/g, '&amp;')
    • 680 .replace(/</g, '&lt;')
    • 681 .replace(/>/g, '&gt;')
    • 682 .replace(/"/g, '&quot;');

@jasonkhanlar
Copy link

jasonkhanlar commented Apr 6, 2022

Debugging the code referenced in #117 (comment) particularly modifying lib/readers/BaseReader.js

    BaseReader.prototype._decodeText = function (text) {
        if (text == null) return text;
console.log('decoding', text, 'after', text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
            return BaseReader._entityTable[tag];
        }).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
            return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));
        })
); // end of console.log
        return text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
            return BaseReader._entityTable[tag];
        }).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
            return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));
        });
    };

I can definitely see when I call:

let json = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' });

that the _decodeText() function is running and properly converting "&amp;" to "&" and...

and I could have sworn that this was working just a couple hours ago, converting properly, however, unfortunately the actual return value being passed into json variable is "&amp;" (unchanged from XML source). And well I think remember what I was doing when it was working. So when I revert back to using

let json = xmlbuilder2.convert(xml, { format: 'object' });

Nope. That didn't work either. I am so confused right now, haha, cuz I wasn't having this issue earlier, if I remember correctly (yep), and I was preparing to investigate on a different concern, but got stuck at this point. Whatever, anyway, investigating what is not working properly in this situation, ...

lib/readers/XMLReader.js line 105 has

context = this.text(context, this._decodeText(this.sanitize(text.data))) || context;

and the context variable has the correct output that should be returned from the convert() function call, however after the while loop lastChild is returned, and I don't see how the value of the context variable is otherwise passed along to be included in the return value. I'm still looking at the code to try to understand if maybe I missed something. The case interfaces_1.TokenType.Element switch seems to handle this, but this is too complex for me to understand.

However, I think I possibly understand my confusion to a slight extent. I believe about 4 hours ago I was looking at the sourceXML -> JSON -> generatedXML, I was looking at the generatedXML output and seeing "&" instead of "&amp;" and assuming that the generated JSON object from xmlbuilder2 also had "&" but upon this stage of inspection, it appears that the JSON object does not, and instead preserves the "&amp;" value. Comparing this behavior to jq/xq/yq, this is different, and (and even though this may be acceptable in my use case scenario that I could potentially work with, I'm going to try to mirror/match the behavior I'm used to) it may be related to the bug described in various locations, including here.

Oh also, I should backtrack to the case interfaces_1.TokenType.Text and make sure that this.text(...) function is actually storing the value (or whatever this function does -- I haven't looked at it yet). So far, seems good. File lib/builder/XMLBuilderImpl.js, function XMLBuilderImpl.prototype.txt(), variable child is set with TextImpl object that has _data attribute with correct value "&" instead of "&amp;"


Aha! This seems interesting! In lib/readers/XMLReader.js, function XMLReader.prototype._parse(), immediately after the while (token.type !== interfaces_1.TokenType.EOF) { line, I added:

        while (token.type !== interfaces_1.TokenType.EOF) {
console.log('token', token);

and I ran my script and specifically for the line with "&amp;", console.log shows:

token { type: 4, data: 'Hi 1 &amp; 2 &#x0070; 3 &#70; ' }

That's it! Comparing that console.log output to the output for all the other lines, I see a bunch of token.type === 3 and token.type === 8, however for processing this line, there is none of that happening! Therefore, I suspect that this is where I conclude the problem lies (even though I still don't understand where that is, lol). The way I understand it is, the while loop processing the lines with instances of "&amp;" the switch cases get to the point of correctly creating an XML child node to exist and be stored into the context variable, but then nothing else happens with that variable and it's just lost/gone/poof/useless/ignored/neglected/rejected/cancelcultured/dismissed/devalued/undermined/disrespected/untermensched/worthless-eatered/failureissuccessandsuccessisfailured/etc. lol

Actually, I'm wrong! After the while loop, just before the return statement, I added console.log line and it appears that the value returned by the XMLReader.prototype._parse() function has the correct value.

        console.log('debugging', [...[...[...[...lastChild._domNode._children][1]._children][3]._children][6]._children][0]._data);
        return lastChild;
    };
    return XMLReader;
}(BaseReader_1.BaseReader));
exports.XMLReader = XMLReader;
//# sourceMappingURL=XMLReader.js.map

So then why is the variable in my code not showing this?

        console.log('debugging', [...[...[...[...lastChild._domNode._children][1]._children][3]._children][6]._children][0]._data);
// output is "debugging &"

// and in my code from main script:
      let json = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' });
      console.log('debugging2', json.mediawiki.page.revision.text['#text']);
// output is "debugging2 &amp;"

And editing file lib/builder/BuilderFunctions.js, function create():

    ...
    console.log([...[...[...[...[...builder._domNode._children][0]._children][1]._children][3]._children][6]._children][0]._data);
    return builder;
}

the output is also "&" which is good! so somewhere from there to the actual output, the data is getting lost. I'm getting closer to finding out where! This is being called from the convert function in the same file! And the convert function tacks on a call to end() function after that, so I shall go digging in there to find the culprit!

Back in file lib/builder/XMLBuilderImpl.js, function XMLBuilderImpl.prototype.end, it is the _serialize() function call that is converting the "&" back into "&amp;" where I shall continue digging further in there, but as I reflect back upon this investigation process, where I started from noticing the replacing "&amp;" to "&" (which is perfect! This is what I want when converting the XML into JSON -- Once I have the JSON, do some processing, and then convert the JSON back into XML, then that process can convert the "&" back into "&amp;" at that stage, which is also what I want.), it's quite unfortunate to see that this process is including double conversion, and converting the conversion back into the unconverted. I'm not sure if this is intended or not, but it seems like an oversight perhaps.

Each of the writers have their own prototype.serialize function which then point to serializeNode function in lib/writers/BaseWriter.js which points to _serializeNodeNS which points to _serializeText which then explicitly shows this code:

        var markup = node.data.replace(/(?!&([^&;]*);)&/g, '&amp;')
            .replace(/</g, '&lt;')
            .replace(/>/g, '&gt;');
        this.text(markup);

which appears to forcibly convert back to "&amp;" without any way to prevent the round-trip, all in a single function call:

let json = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' });

So a summary recap of what I learned, starting with grep -r "&" --color=always|grep -v "min.js" I discovered three files that appear to relate to converting html entities for "&"

  • lib/writers/BaseWriter.js
  • lib/readers/BaseReader.js
  • lib/builder/XMLBuilderCBImpl.js

and I didn't know which one of those files to start with specific to converting from XML source to JavaScript object output, but apparently all... oh actually, lib/builder/XMLBuilderCBImpl.js did not cross my path, however lib/builder/XMLBuilderImpl.js did cross my path, except, that file does not contain any explicit "&" conversion-related code, whereas lib/builder/XMLBuilderCBImpl.js does. In any case, I don't know how to fix this since it's not my project, but maybe I learned enough to document my observation that any related bugs regarding this can be fixed.

  • lib/readers/BaseReader.js / BaseReader.prototype._decodeText
  • to lib/readers/XMLReader.js / XMLReader.prototype._parse
  • to lib/builder/XMLBuilderImpl.js / XMLBuilderImpl.prototype.txt
  • to lib/builder/BuilderFunctions.js / create
  • to lib/builder/XMLBuilderImpl.js / XMLBuilderImpl.prototype.end
  • to lib/writers/ObjectWriter.js / serialize
  • to lib/writers/BaseWriter.js / serializeNode
  • to lib/writers/BaseWriter.js / _serializeNodeNS
  • to lib/writers/BaseWriter.js / _serializeText

TL;DR: file lib/builder/BuilderFunctions.js function convert,

return create(builderOptions, contents) // where example contents contains XML with "&amp;" this process converts "&amp;" to "&"
.end(convertOptions); // and then the "&" that was previously converted from the create function, gets converted back into "&amp;" and if you expected "&" in JavaScript object or JSON or other format, too bad!

@jasonkhanlar
Copy link

jasonkhanlar commented Apr 7, 2022

Since yesterday, everything that I commented about regarding my own spontaneous investigation of the code base pertaining to converting "&amp;" to "&" (and apparently back again to "&amp;" all within a single convert() function call (note that using the create function call instead, does not yield results that work as expected either), I was not entirely certain if I was on to something that relates to other issues posted in this repository pertaining to escaping and unescaping &amp;, &lt;, &gt; or not, and I don't even know whether this should or should not happen. So I'm searching for Tim Berners-Lee World Wide Web website URLs with content that may otherwise better source citedly relate to this concept. Oh actually, I found some converters/generators that show existing working conversion implementations that appear to do what I was intending to instead reference as documentation. That seems better:

INPUT XML STRING -> <text>&lt;Foo &amp; Bar&gt;</text>

That variety of outputs (with default settings) seems useful information, especially that not everything is identical result or consistent across all generators. This may suggest that a generator library should have configuration options/settings to be able to produce all possible output results so that every use case scenario is made available to the end users. Really there are only two core outputs: preserving the "&amp;" and "&lt;" and "&gt;" as is, or converting it to "&" and "<" and ">"

So then when I call

let obj = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' });

currently this reads input of "&amp;" and converts it to "&" and converts it again back to "&amp;" before returning to store the value into the obj variable. As of this time, I do not see any way to pass any configuration options, not in the BuilderOptions, nor in the ConvertOptions object argument (position 1 for BuilderOptions, and position 3 for ConvertOptions) that tell the convert function to not convert html entities at all (zero times), to convert only once (either during the building stage from the create function call or during the the convert stage from the end function call), or to convert two times (which is almost identical to zero times, except the &#x0123; and &#01; notation entities will not be restored the same way, so they get filtered out in this process that appears to not be configurable/customizable)

edited to add:

"except the ģ and � notation entities will not be restored the same way"

Rereading this, I made a mistake with the two characters. They should appear as:

"except the &#x0123; and &#01; notation entities will not be restored the same way"

I fixed it in the body of the message, but also in case there is any confusion, the 0123 and 01 are just random numbers and the characters are not important, just the reference to the syntax

@jasonkhanlar
Copy link

jasonkhanlar commented Apr 7, 2022

Also in #117 (comment) I went backwards, trying to wrap my head around understanding things out of order. I'll try again to make sense of understanding what is happening, but in sequence. Starting with:

let obj = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, "<text>&lt;Foo &amp; Bar&gt;</text>", { format: 'object' });
  • # 1: lib/builder/BuilderFunctions.js, convert function
    • This function is short and straightforward, easy to understand. Directly before the return statement:
      • builderOptions == { convert: { text: '#text' }, encoding: 'utf8' }
      • contents == "<text>&lt;Foo &amp; Bar&gt;</text>"
        • same as initial input source (0 conversions)
      • convertOptions == { format: 'object' }
    • return value depends on output of function chain -> create(builderOptions, contents).end(convertOptions);
      • create function, see # 2.
      • end function, see # 6.
  • # 2: lib/builder/BuilderFunctions.js, create function
    • Directly before the return statement:
      • options == {convert: { text: '#text', att: '@', ins: '?', cdata: '$', comment: '!' },encoding: 'utf8',version: '1.0',standalone: undefined,keepNullNodes: false,keepNullAttributes: false,ignoreConverters: false,defaultNamespace: { ele: undefined, att: undefined },namespaceAlias: {html: 'http://www.w3.org/1999/xhtml',xml: 'http://www.w3.org/XML/1998/namespace',xmlns: 'http://www.w3.org/2000/xmlns/',mathml: 'http://www.w3.org/1998/Math/MathML',svg: 'http://www.w3.org/2000/svg',xlink: 'http://www.w3.org/1999/xlink'},invalidCharReplacement: undefined,parser: undefined}
      • contents == "<text>&lt;Foo &amp; Bar&gt;</text>"
        • same as initial input source (0 conversions)
      • [...[...doc._children][0]._children][0]._data == "<Foo & Bar>"
      • [...[...builder._domNode._children][0]._children][0]._data == "<Foo & Bar>"
    • return value depends on the value of builder variable.
      • var builder = new _1.XMLBuilderImpl(doc);
        • At this position, the value of builder is insufficient, and does not contain any ties to the Foo Bar input:
      • if (contents !== undefined) { builder.ele(contents); }
        • This is where the builder variable has processed the "<text>&lt;Foo &amp; Bar&gt;</text>" input and produced a "<Foo & Bar>" child node _data value
          • different from initial input source (1 conversion)
    • return value depends on output of function -> builder.ele(contents)
      • ele function, see # 3.
  • # 3: lib/builder/XMLBuilderImpl.js, XMLBuilderImpl.prototype.ele function
    • Directly before the return statement:
    • return value depends on output of function chain -> new readers_1.XMLReader(this._options).parse(this, p1);
      • XMLReader function
        • At this position, the value of new readers_1.XMLReader(this._options) is insufficiently containing ties to the Foo Bar input. Nothing to see here.
      • parse function, see # 4.
        • At this position, the value of parse(this, p1) is sufficiently containing ties to the Foo Bar input:
          • [...new readers_1.XMLReader(this._options).parse(this, p1)._domNode._children][0]._data == "<Foo & Bar>"
            • different from initial input source (1 conversion)
            • note: the parse function appears to have iterations that execute this ele function again (in the case of the initial input for this run, this ele function is run only 1 other time), and the process involves the variable builder which equals a large blank XMLBuilderImpl _domNode template structure, nothing important. Only the initial call appears to be relevant to the Foo Bar concerns pertaining to "&amp; and "&lt;" and "&gt;" conversions to "&" and "<" and ">" and back (I'm looking to find/document the redundancy, and to try to make sense of, or wrap my head around, whether this behavior is correct or not)
  • # 4: lib/readers/XMLReader.js, XMLReader.prototype._parse function
    • Directly before the return statement:
      • interfaces_1.TokenType == {'0': 'EOF','1': 'Declaration','2': 'DocType','3': 'Element','4': 'Text','5': 'CDATA','6': 'PI','7': 'Comment','8': 'ClosingTag',EOF: 0,Declaration: 1,DocType: 2,Element: 3,Text: 4,CDATA: 5,PI: 6,Comment: 7,ClosingTag: 8}
      • lastChild equals an important large XMLBuilderImpl _domNode template structure
        • [...lastChild._domNode._children][0]._data == "<Foo & Bar>"
          • different from initial input source (1 conversion)
    • return value depends on while loop -> while (token.type !== interfaces_1.TokenType.EOF) {}
      • i: token == { type: 3, name: 'text', attributes: [], selfClosing: false }
      • ii: token == { type: 4, data: '&lt;Foo &amp; Bar&gt;' }
        • case interfaces_1.TokenType.Text: var text = token; context = this.text(context, this._decodeText(this.sanitize(text.data))) || context; break;
          • text.data == "&lt;Foo &amp; Bar&gt;"
            • same as initial input source (0 conversions)
          • this.sanitize(text.data) == "&lt;Foo &amp; Bar&gt;"
            • same as initial input source (0 conversions)
          • this._decodeText(this.sanitize(text.data))) == "<Foo & Bar>"
            • different from initial input source (1 conversion)
              • _decodeText function, see # 5.
      • iii: token == { type: 8, name: 'text' }
      • iv: token == { type: 3, name: 'text', attributes: [], selfClosing: false }
      • v: token == { type: 0 }
      • note: related to # 3 ele function, parse function running iterations of the parse function in separate instance than this first initial one, this _parse function is also being executed more than once from processing scoped within this initial execution of _parse. For simple evaluation, all the iterated values relevant to all executions of this _parse function will be referenced here.
    • work in progress
  • # 5: lib/readers/BaseReader.js, _decodeText function
    • Directly before the return statement:
      • text == "&lt;Foo &amp; Bar&gt"
        • same as initial input source (0 conversions)
    • Full code of function (no further processing in this thread beyond this function -- returns back to # 1 and proceeds to # 6)
    • BaseReader.prototype._decodeText = function (text) {
          if (text == null)
              return text;
          return text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
              return BaseReader._entityTable[tag];
          }).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
              return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));
          });
      };
      • text == "&lt;Foo &amp; Bar&gt"
        • same as initial input source (0 conversions)
      • text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {return BaseReader._entityTable[tag];}) == "<Foo & Bar>"
        • different from initial input source (1 conversion)
      • text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {return BaseReader._entityTable[tag];}).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));}) == "<Foo & Bar>"
        • different from initial input source (1 conversion)
  • # 6: lib/builder/XMLBuilderImpl.js, XMLBuilderImpl.prototype.end function
    • Directly before the return statement:
      • writerOptions == { format: 'object' }
    • return value depends on output of function chain -> this.doc()._serialize(writerOptions);
      • doc function
        • At this position, the value of this.doc() is sufficiently containing ties to the Foo Bar input:
          • [...[...this.doc()._domNode._children][0]._children][0]._data == "<Foo & Bar>"
            • different from initial input source (1 conversion)
              • but this is already redundant from the other thread. See # 5. Therefore, nothing to see here.
      • _serialize function, see # 7.
        • At this position, the value of this.doc()._serialize(writerOptions) is sufficiently containing ties to the Foo Bar input:
          • this.doc()._serialize(writerOptions) == { text: '&lt;Foo &amp; Bar&gt;' }
            • same as initial input source (2 conversions)
  • # 7: lib/builder/XMLBuilderImpl.js, XMLBuilderImpl.prototype._serialize function
    • Directly before the return statement:
      • this.node equals a large XMLDocumentImpl template structure representation of input source XML
      • this._options == {convert: { text: '#text', att: '@', ins: '?', cdata: '$', comment: '!' },encoding: 'utf8',version: '1.0',standalone: undefined,keepNullNodes: false,keepNullAttributes: false,ignoreConverters: false,defaultNamespace: { ele: undefined, att: undefined },namespaceAlias: {html: 'http://www.w3.org/1999/xhtml',xml: 'http://www.w3.org/XML/1998/namespace',xmlns: 'http://www.w3.org/2000/xmlns/',mathml: 'http://www.w3.org/1998/Math/MathML',svg: 'http://www.w3.org/2000/svg',xlink: 'http://www.w3.org/1999/xlink'},invalidCharReplacement: undefined,parser: undefined}
      • writerOptions == { format: 'object' }
    • return value depends on the value of serialize function.
      • serialize function, see # 8.
  • # 8: lib/writers/XMLWriter.js, XMLWriter.prototype.serialize function
    • Directly before the return statement:
      • this._currentList == [ { text: [ { '#': '&lt;Foo &amp; Bar&gt;' } ] } ]
        • same as initial input source (2 conversions)
      • this._writerOptions == { format: 'object', wellFormed: false, group: false, verbose: false }
    • return value depends on processing of -> this.serializeNode(node, this._writerOptions.wellFormed);
      • serializeNode function, see # 9.
  • # 9: lib/writers/BaseWriter.js, BaseWriter.prototype.serializeNode function
    • _serializeNode function, see # 10.
  • # 10: lib/writers/BaseWriter.js, BaseWriter.prototype._serializeNode function
    • At the end of the function:
      • `interfaces_1.NodeType == {'1': 'Element','2': 'Attribute','3': 'Text','4': 'CData','5': 'EntityReference','6': 'Entity','7': 'ProcessingInstruction','8': 'Comment','9': 'Document','10': 'DocumentType','11': 'DocumentFragment' '12': 'Notation',Element: 1,Attribute: 2,Text: 3,CData: 4,EntityReference: 5,Entity: 6,ProcessingInstruction: 7,Comment: 8,Document: 9,DocumentType: 10,DocumentFragment: 11,Notation: 12}
    • depends on processing of -> switch (node.nodeType) {}
      • case interfaces_1.NodeType.Element:
        • Nothing special here
      • case interfaces_1.NodeType.Text:
        • _serializeText function, see # 11
  • # 11: lib/writers/BaseWriter.js, BaseWriter.prototype._serializeText function
    • Full code of function (no further processing in this thread beyond this function -- returns back to # 1 and proceeds to return to initial xmlbuilder2.convert(...) call)
    • BaseWriter.prototype._serializeText = function (node, requireWellFormed) {
          /**
           * 1. If the require well-formed flag is set (its value is true), and
           * node's data contains characters that are not matched by the XML Char
           * production, then throw an exception; the serialization of this node's
           * data would not be well-formed.
           */
          if (requireWellFormed && !algorithm_1.xml_isLegalChar(node.data)) {
              throw new Error("Text data contains invalid characters (well-formed required).");
          }
          /**
           * 2. Let markup be the value of node's data.
           * 3. Replace any occurrences of "&" in markup by "&amp;".
           * 4. Replace any occurrences of "<" in markup by "&lt;".
           * 5. Replace any occurrences of ">" in markup by "&gt;".
           * 6. Return the value of markup.
           */
          var markup = node.data.replace(/(?!&([^&;]*);)&/g, '&amp;')
              .replace(/</g, '&lt;')
              .replace(/>/g, '&gt;');
          this.text(markup);
      };
      • node.data == "<Foo & Bar>"
        • different from initial input source (1 conversion)
          • but this is already redundant from the other thread. See # 5. Therefore, nothing to see here.
      • node.data.replace(/(?!&([^&;]*);)&/g, '&amp;') == "<Foo &amp; Bar>"
        • different from initial input source (1.5 conversions)
      • node.data.replace(/(?!&([^&;]*);)&/g, '&amp;').replace(/</g, '&lt;') == "&lt;Foo &amp; Bar>"
        • different from initial input source (1.75 conversions)
      • node.data.replace(/(?!&([^&;]*);)&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;') == "&lt;Foo &amp; Bar&gt;"
        • same as initial input source (2 conversions)

to be completed later (in edits)

@jasonkhanlar
Copy link

jasonkhanlar commented Apr 7, 2022

@oozcitak I propose some kind of ConvertOption boolean to disable # 11 (see #117 (comment)) from reverting the characters back into HTML entities. When working with JavaScript objects (converting XML source to JS Object or JSON), it seems reasonable to expect to find "&" and "<" and ">" in the text. Many of the generators cited in #117 (comment) as well as my previous usage of jq/xq/yq converted the HTML entities to their corresponding characters for use within JavaScript objects/JSON, and therefore I didn't have to write extra code to do the conversion process myself. Since the xmlbuilder2 already has a practically simple ability to offer this type of output (simple by preventing the last _serializeText function from processing is configured to not do it, I suggest that this be made into a feature perhaps in the ConvertOptions object as a boolean value.


Oh what? Comparing:

https://github.com/oozcitak/xmlbuilder2/blob/a7ad0d5f8e117c97acee5e40ce8f0b3a2bb9d03b/src/writers/BaseWriter.ts

  private _serializeText(node: CharacterData, requireWellFormed: boolean): void {

to

https://github.com/oozcitak/xmlbuilder2/blob/b48b061a4dd437d552a064d3f2ec7275814e582f/src/writers/BaseWriter.ts

  private _serializeText(node: CharacterData, requireWellFormed: boolean, noDoubleEncoding: boolean): void {

it appears that previously there was a way to prevent this, but apparently it has been removed/deprecated.

e9d3f93 as a fix for #82


I have not tested or played around with this library outside of my initial intended use case, however I am suspecting that the mapping of conversion back and forth between having HTML entities in the returned output or not having them, this should be dependent on (DEFAULT) the type of format being converted to, for example, XML by default requires, and therefore must use HTML entities as "&amp;" and "&gt;" and "&lt;" whereas JSON and JS objects, perhaps should permit for either default or ConvertOptions argument to allow for preserving the HTML entities to be converted into their actual character, and preserved to pass as return output to the end user/developer that is working with this xmlbuilder2 library.

@jasonkhanlar
Copy link

jasonkhanlar commented Apr 7, 2022

Oh, I just realized something that seems confusing. # 11 lib/writers/BaseWriter.js, BaseWriter.prototype._serializeText function

Note: This is BaseWriter. This is not XMLWriter. BaseWriter. And in this file in that _serializeText function, the code is:

if (requireWellFormed && !algorithm_1.xml_isLegalChar(node.data)) {
        throw new Error("Text data contains invalid characters (well-formed required).");
    }

but what if I am trying to convert to JSON? or JavaScript Object? I don't even know what that does really, but otherwise, the markup changes after that, ...... I'm trying to write my own code work-around to fix this for my use case scenario, but I have absolutely no idea what I'm doing, but whatever it is that I'm doing, I'm gonna keep doing it, cuz I am determined to get a working solution to fix this so that I can use the library to preserve "&" and "<" and ">" characters when working with JS Object conversion types. I'm trying to figure out how to add a ConvertOption, but I'll see if I can figure it out and submit a pull request or something. Actually, before I forget and get lost, my intended goal is to 100% perfectly preserve the formatting of source XML exactly as is without any changes whatsoever, but just splitting the XML into smaller chunks. Therefore, it might actually be even more preferable to completely disable any and all conversions entirely.

@Sozialarchiv
Copy link

Sozialarchiv commented Apr 21, 2022

I can confirm this issue with 3.0.2. With 3.0.1 everything works fine.

@universalhandle
Copy link
Collaborator

This looks like a duplicate of various issues that have been opened related to ampersand encoding (see #105, #109, #110). I believe this to be fixed with the release of v3.1.0. Please reopen if you can still reproduce after upgrading.

@Sozialarchiv
Copy link

Sozialarchiv commented May 15, 2023

I can still reproduce it.

With 3.0.1 its working with 3.1.1 it is not working.

@universalhandle I can not reopen it.

@universalhandle
Copy link
Collaborator

Okay, I see you've got code to reproduce this in the OP. I'll spend some time looking at this (and the lengthy thread that pre-dates my time as maintainer) soon. Thanks.

@Sozialarchiv
Copy link

@universalhandle Thank you very much. If I can do something let me know.

@StephenRieger
Copy link

StephenRieger commented Jul 31, 2023

I can also reproduce the problem. The result I get from:

convert("<test att=\"&lt; &gt; &amp; &quot; ' &apos;\">&lt; &gt; &amp; \" &quot; ' &apos;</test>", { format: "json" }) is:

'{"test":{"@att":"&lt; &gt; &amp; &quot; ' &apos;","#":"&lt; &gt; &amp; \" &quot; ' &apos;"}}'

where several entities are not decoded. The result should be:

'{"test":{"@att":"< > & " ' '","#":"< > & " " ' '"}}'

(I actually want to convert to “object”, but the JSON result has the same problems and is clearer here because of the escaped quote and apostrophe characters.)

When converting the corrected object back to XML, the result I get from:

create({"test":{"@att":"< > & \" ' '","#":"< > & \" \" ' '"}}).end() is:

'<?xml version="1.0"?><test att="&lt; &gt; &amp; &quot; ' '">&lt; &gt; &amp; " " ' '</test>'

Which is correct.

@StephenRieger
Copy link

StephenRieger commented Aug 1, 2023

There's a pull request intended to fix this problem: #131.

@universalhandle
Copy link
Collaborator

I'm on extended holiday so I won't be able to look at this for a few weeks, but thank you for the contribution. If another maintainer hasn't addressed it before I get back, I'll have a look about this time next month. Thanks for your patience.

@Sozialarchiv
Copy link

Are they any news about this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants