Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix existing and add new options #74

Merged
merged 6 commits into from
Feb 14, 2014
Merged

Conversation

patrick-steele-idem
Copy link
Contributor

To allow a hybrid approach to parsing HTML code that may have certain XML constructs, a few more options have been added:

  • recognizeSelfClosing: If set to true then self-closing tags will result in the tag being closed even if xmlMode is not set to true
  • recognizeCDATA: If set to true then CDATA text will result in the context event being fired even if xmlMode is not set to true

Also, the lowerCaseTags and lowerCaseAttributeNames options have been fixed so that case conversion can be disabled in non-XML mode.

@fb55
Copy link
Owner

fb55 commented Feb 9, 2014

You want to dynamically enable xmlMode? That will require further changes to the tokenizer. Otherwise, I'd prefer to just add another attribute to the instance (instead of the method).

@patrick-steele-idem
Copy link
Contributor Author

I am just looking for more fine-grained control over the parser and no changes to the tokenizer are required. I updated the parser to allow the option to recognize self-closing tags and CDATA while still in "HTML mode". The issue was that "xmlMode" was a single option that was being used to control a lot of different parsing behaviors. In my case, I have HTML files that may include XML constructs such as self-closing tags and CDATA sections. The changes I made will not impact existing code at all because there will only be an impact on parsing if the new options are enabled.

As a separate fix, I fixed the bug where the "lowerCaseTags" and "lowerCaseAttributeNames" options were not working as expected. I pushed to the same branch so that commit was merged into this Pull Request.

Please let me know if you need more clarification. Thanks for looking into this Pull Request

@fb55
Copy link
Owner

fb55 commented Feb 10, 2014

What I was saying is this: If you don't want dynamic xmlMode (which you apparently don't): Add a _lowerCaseTagNames attribute to the instance and get rid of the isLowerCaseTagsEnabled() method.

@patrick-steele-idem
Copy link
Contributor Author

I'll make the change and add a new commit. Thanks.

@patrick-steele-idem
Copy link
Contributor Author

I have updated my Pull Request with the suggested change. Please review and let me know if you find any issues. Thanks.

@@ -89,6 +89,9 @@ function Parser(cbs, options){
this.startIndex = 0;
this.endIndex = null;

this._lowerCaseTagNames = options.lowerCaseTags == null ? !options.xmlMode : options.lowerCaseTags === true;
this._lowerCaseAttributeNames = options.lowerCaseAttributeNames == null ? !options.xmlMode : options.lowerCaseAttributeNames === true;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails if options is undefined (have a look at the failing test). Also, I would prefer a solution that accepts all truthy values, not only true (the property should be a bool, though). The in operator is probably the better choice here as well.

@patrick-steele-idem
Copy link
Contributor Author

I think we are close. I updated the code to handle the case where options is null and verified that tests passed. I also updated the code to allow truthy values for the lowerCaseTags and lowerCaseAttributeNames options. However, I chose not to use "in" since that only checks for the property being in the options and would evaluate to true even if one of the option values was set to false. Please review the updated code and let me know if you would like me to make any additional updates. Thanks

@philidem
Copy link

I reviewed the changes independently and they look good. I like the fine-grain control over XML-like parsing behavior instead of having only a single "xmlMode" flag. The only line that didn't need to be changed was "this._options = options || {};" (which was split into two lines).

Are there any other changes that need to be made before this can be merged?

@fb55
Copy link
Owner

fb55 commented Feb 14, 2014

I don't really understand why the in operator shouldn't be used. == null does pretty much the same, but ignores properties with the values null and undefined. I would also prefer to avoid polymorphic variables (ie. options).

@patrick-steele-idem
Copy link
Contributor Author

The reason in is not a good choice is because the following code would not work as expected when using in:

var parser = new htmlparser.Parser(handlers, {
        lowerCaseTags: false
    });

The user would expect the lowerCaseTags option to be set to be false, but because that property is in the options it actually is set to true because:

("lowerCaseTags" in options) === true

The following expression will result in the correct boolean value:

(!!options.lowerCaseTags) === false

For your second point, why would you want to avoid polymorphic options argument? The parser already supported an options argument so I don't see the harm in extending it with new options.

@@ -89,6 +90,9 @@ function Parser(cbs, options){
this.startIndex = 0;
this.endIndex = null;

this._lowerCaseTagNames = options.lowerCaseTags == null ? !options.xmlMode : !!options.lowerCaseTags;
this._lowerCaseAttributeNames = options.lowerCaseAttributeNames == null ? !options.xmlMode : !!options.lowerCaseAttributeNames;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this._lowerCaseAttributeNames = "lowerCaseAttributeNames" in this._options ? !!this._options.lowerCaseAttributeNames : !this._options.xmlMode;

@patrick-steele-idem
Copy link
Contributor Author

I updated the code to use the in operator as suggested.

Feel free to merge my change and make any additional syntactic changes if you think they're necessary. My end-goal is to have additional control over parsing behavior in HTML parsing mode via the new options.

Thanks.

@@ -76,7 +76,8 @@ var voidElements = {
var re_nameEnd = /\s|\//;

function Parser(cbs, options){
this._options = options || {};
options = options || {};
this._options = options;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move this to a single line & I'll merge.

@fb55
Copy link
Owner

fb55 commented Feb 14, 2014

It would also be great if you could update the wiki page to reflect these changes.

@patrick-steele-idem
Copy link
Contributor Author

Please review the latest change with the options init code merged into a single line.

fb55 added a commit that referenced this pull request Feb 14, 2014
Fix existing and add new options
@fb55 fb55 merged commit 4497ee4 into fb55:master Feb 14, 2014
@patrick-steele-idem
Copy link
Contributor Author

Also, I think it would be beneficial to move the documentation on the Wiki into the README so that the documentation is versioned with the code. Thoughts? Would you like me to make that change?

@fb55
Copy link
Owner

fb55 commented Feb 14, 2014

Merged and done! Thanks a lot, @patrick-steele-idem!

@fb55
Copy link
Owner

fb55 commented Feb 14, 2014

It would probably be beneficial to move the documentation to the module, but eg. the cheerio docs link to the page & I doubt it's worth the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants