-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading from 3.0.1 to 3.0.2 un-escapes & #117
Comments
This bug report seems related. I started using xmlbuilder2 with v3.0.2, and I'm trying to convert XML to JSON and back to XML and verify the output identically matches the source, and source XML contains " I would like to find an option or way to ensure all " My code at the moment: let xml = await fs.promises.readFile(file, { encoding: 'utf8' });
let json = xmlbuilder2.convert(xml, { format: 'object' });
let xmlverify = xmlbuilder2.convert(json, { format: 'xml', prettyPrint: true, spaceBeforeSlash: true }); These also seem relevant:
Also note that in my comparing to my jq/xq/yq codebase that I'm migrating away from, converting the XML source to JSON, the XML having " Edited to add some notes while glancing through the codebase: Possibly relevant script files (and possibly relevant code-related matches):
|
Debugging the code referenced in #117 (comment) particularly modifying BaseReader.prototype._decodeText = function (text) {
if (text == null) return text;
console.log('decoding', text, 'after', text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
return BaseReader._entityTable[tag];
}).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));
})
); // end of console.log
return text.replace(/&(quot|amp|apos|lt|gt);/g, function (_match, tag) {
return BaseReader._entityTable[tag];
}).replace(/&#(?:x([a-fA-F0-9]+)|([0-9]+));/g, function (_match, hexStr, numStr) {
return String.fromCodePoint(parseInt(hexStr || numStr, hexStr ? 16 : 10));
});
}; I can definitely see when I call: let json = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' }); that the _decodeText() function is running and properly converting " and I could have sworn that this was working just a couple hours ago, converting properly, however, unfortunately the actual return value being passed into let json = xmlbuilder2.convert(xml, { format: 'object' }); Nope. That didn't work either. I am so confused right now, haha, cuz I wasn't having this issue earlier, if I remember correctly (yep), and I was preparing to investigate on a different concern, but got stuck at this point. Whatever, anyway, investigating what is not working properly in this situation, ... lib/readers/XMLReader.js line 105 has context = this.text(context, this._decodeText(this.sanitize(text.data))) || context; and the However, I think I possibly understand my confusion to a slight extent. I believe about 4 hours ago I was looking at the sourceXML -> JSON -> generatedXML, I was looking at the generatedXML output and seeing " Oh also, I should backtrack to the
while (token.type !== interfaces_1.TokenType.EOF) {
console.log('token', token);
Actually, I'm wrong! After the while loop, just before the return statement, I added console.log line and it appears that the value returned by the XMLReader.prototype._parse() function has the correct value.
So then why is the variable in my code not showing this?
And editing file lib/builder/BuilderFunctions.js, function create():
the output is also " Back in file lib/builder/XMLBuilderImpl.js, function XMLBuilderImpl.prototype.end, it is the _serialize() function call that is converting the " Each of the writers have their own prototype.serialize function which then point to serializeNode function in lib/writers/BaseWriter.js which points to _serializeNodeNS which points to _serializeText which then explicitly shows this code:
which appears to forcibly convert back to " let json = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' }); So a summary recap of what I learned, starting with
and I didn't know which one of those files to start with specific to converting from XML source to JavaScript object output, but apparently all... oh actually, lib/builder/XMLBuilderCBImpl.js did not cross my path, however lib/builder/XMLBuilderImpl.js did cross my path, except, that file does not contain any explicit "
TL;DR: file lib/builder/BuilderFunctions.js function convert,
|
Since yesterday, everything that I commented about regarding my own spontaneous investigation of the code base pertaining to converting " INPUT XML STRING ->
That variety of outputs (with default settings) seems useful information, especially that not everything is identical result or consistent across all generators. This may suggest that a generator library should have configuration options/settings to be able to produce all possible output results so that every use case scenario is made available to the end users. Really there are only two core outputs: preserving the " So then when I call let obj = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, xml, { format: 'object' }); currently this reads input of " edited to add:
Rereading this, I made a mistake with the two characters. They should appear as:
I fixed it in the body of the message, but also in case there is any confusion, the 0123 and 01 are just random numbers and the characters are not important, just the reference to the syntax |
Also in #117 (comment) I went backwards, trying to wrap my head around understanding things out of order. I'll try again to make sense of understanding what is happening, but in sequence. Starting with: let obj = xmlbuilder2.convert({ convert: { text: '#text' }, encoding: 'utf8' }, "<text><Foo & Bar></text>", { format: 'object' });
to be completed later (in edits) |
@oozcitak I propose some kind of ConvertOption boolean to disable # 11 (see #117 (comment)) from reverting the characters back into HTML entities. When working with JavaScript objects (converting XML source to JS Object or JSON), it seems reasonable to expect to find " Oh what? Comparing: private _serializeText(node: CharacterData, requireWellFormed: boolean): void { to private _serializeText(node: CharacterData, requireWellFormed: boolean, noDoubleEncoding: boolean): void { it appears that previously there was a way to prevent this, but apparently it has been removed/deprecated. I have not tested or played around with this library outside of my initial intended use case, however I am suspecting that the mapping of conversion back and forth between having HTML entities in the returned output or not having them, this should be dependent on (DEFAULT) the type of format being converted to, for example, XML by default requires, and therefore must use HTML entities as " |
Oh, I just realized something that seems confusing. # 11 lib/writers/BaseWriter.js, BaseWriter.prototype._serializeText function Note: This is BaseWriter. This is not XMLWriter. BaseWriter. And in this file in that _serializeText function, the code is: if (requireWellFormed && !algorithm_1.xml_isLegalChar(node.data)) {
throw new Error("Text data contains invalid characters (well-formed required).");
} but what if I am trying to convert to JSON? or JavaScript Object? I don't even know what that does really, but otherwise, the markup changes after that, ...... I'm trying to write my own code work-around to fix this for my use case scenario, but I have absolutely no idea what I'm doing, but whatever it is that I'm doing, I'm gonna keep doing it, cuz I am determined to get a working solution to fix this so that I can use the library to preserve " |
I can confirm this issue with 3.0.2. With 3.0.1 everything works fine. |
I can still reproduce it. With 3.0.1 its working with 3.1.1 it is not working. @universalhandle I can not reopen it. |
Okay, I see you've got code to reproduce this in the OP. I'll spend some time looking at this (and the lengthy thread that pre-dates my time as maintainer) soon. Thanks. |
@universalhandle Thank you very much. If I can do something let me know. |
I can also reproduce the problem. The result I get from:
where several entities are not decoded. The result should be:
(I actually want to convert to “object”, but the JSON result has the same problems and is clearer here because of the escaped quote and apostrophe characters.) When converting the corrected object back to XML, the result I get from:
Which is correct. |
There's a pull request intended to fix this problem: #131. |
I'm on extended holiday so I won't be able to look at this for a few weeks, but thank you for the contribution. If another maintainer hasn't addressed it before I get back, I'll have a look about this time next month. Thanks for your patience. |
Are they any news about this issue? |
Describe the bug
We're encoding some HTML into an XML document. The HTML has an already encoded
&
. Going from 3.0.1 to 3.0.2, this sequence is encoded back into a&
instead of being left as&
.To Reproduce
Expected behavior
We expect the following XML (3.0.1):
We're getting this, instead (3.0.2):
The only difference is that first
&
is now a&
, making the XML document invalid. Picture to highlight differences:Version:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: