-
-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix existing and add new options #74
Conversation
You want to dynamically enable |
I am just looking for more fine-grained control over the parser and no changes to the tokenizer are required. I updated the parser to allow the option to recognize self-closing tags and CDATA while still in "HTML mode". The issue was that "xmlMode" was a single option that was being used to control a lot of different parsing behaviors. In my case, I have HTML files that may include XML constructs such as self-closing tags and CDATA sections. The changes I made will not impact existing code at all because there will only be an impact on parsing if the new options are enabled. As a separate fix, I fixed the bug where the "lowerCaseTags" and "lowerCaseAttributeNames" options were not working as expected. I pushed to the same branch so that commit was merged into this Pull Request. Please let me know if you need more clarification. Thanks for looking into this Pull Request |
What I was saying is this: If you don't want dynamic |
I'll make the change and add a new commit. Thanks. |
I have updated my Pull Request with the suggested change. Please review and let me know if you find any issues. Thanks. |
@@ -89,6 +89,9 @@ function Parser(cbs, options){ | |||
this.startIndex = 0; | |||
this.endIndex = null; | |||
|
|||
this._lowerCaseTagNames = options.lowerCaseTags == null ? !options.xmlMode : options.lowerCaseTags === true; | |||
this._lowerCaseAttributeNames = options.lowerCaseAttributeNames == null ? !options.xmlMode : options.lowerCaseAttributeNames === true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails if options
is undefined (have a look at the failing test). Also, I would prefer a solution that accepts all truthy values, not only true
(the property should be a bool, though). The in
operator is probably the better choice here as well.
I think we are close. I updated the code to handle the case where options is null and verified that tests passed. I also updated the code to allow truthy values for the |
I reviewed the changes independently and they look good. I like the fine-grain control over XML-like parsing behavior instead of having only a single "xmlMode" flag. The only line that didn't need to be changed was "this._options = options || {};" (which was split into two lines). Are there any other changes that need to be made before this can be merged? |
I don't really understand why the |
The reason var parser = new htmlparser.Parser(handlers, {
lowerCaseTags: false
}); The user would expect the ("lowerCaseTags" in options) === true The following expression will result in the correct boolean value: (!!options.lowerCaseTags) === false For your second point, why would you want to avoid polymorphic |
@@ -89,6 +90,9 @@ function Parser(cbs, options){ | |||
this.startIndex = 0; | |||
this.endIndex = null; | |||
|
|||
this._lowerCaseTagNames = options.lowerCaseTags == null ? !options.xmlMode : !!options.lowerCaseTags; | |||
this._lowerCaseAttributeNames = options.lowerCaseAttributeNames == null ? !options.xmlMode : !!options.lowerCaseAttributeNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this._lowerCaseAttributeNames = "lowerCaseAttributeNames" in this._options ? !!this._options.lowerCaseAttributeNames : !this._options.xmlMode;
I updated the code to use the Feel free to merge my change and make any additional syntactic changes if you think they're necessary. My end-goal is to have additional control over parsing behavior in HTML parsing mode via the new options. Thanks. |
@@ -76,7 +76,8 @@ var voidElements = { | |||
var re_nameEnd = /\s|\//; | |||
|
|||
function Parser(cbs, options){ | |||
this._options = options || {}; | |||
options = options || {}; | |||
this._options = options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just move this to a single line & I'll merge.
It would also be great if you could update the wiki page to reflect these changes. |
Please review the latest change with the options init code merged into a single line. |
Also, I think it would be beneficial to move the documentation on the Wiki into the README so that the documentation is versioned with the code. Thoughts? Would you like me to make that change? |
Merged and done! Thanks a lot, @patrick-steele-idem! |
It would probably be beneficial to move the documentation to the module, but eg. the cheerio docs link to the page & I doubt it's worth the effort. |
To allow a hybrid approach to parsing HTML code that may have certain XML constructs, a few more options have been added:
true
then self-closing tags will result in the tag being closed even ifxmlMode
is not set totrue
true
then CDATA text will result in thecontext
event being fired even ifxmlMode
is not set totrue
Also, the
lowerCaseTags
andlowerCaseAttributeNames
options have been fixed so that case conversion can be disabled in non-XML mode.