-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a built in HTML parser #37
Comments
Here is a test case with an example of error that Floki does not support today: henrik/sipper@49a4c09 Thanks @henrik for the example! |
@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser? |
@gmile yeah, I thought about that, but what I want is to not depend on an external dependency. But, this is not discarded. I also think Servo's HTML is a good option. |
@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes. |
@gmile I'm not looking into this right now. So, please go for it. 👍 |
I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.
Would a replacement function recreate this behavior for backwards compatibility or break the api? BTW, thanks for the awesome library! |
It would be awesome to have something like this: %Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"} instead of: {"p", [], []}
"content"
{comment: "content"} I was think also about: Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs Features:
Optional features:
<div style='fontt-color: white;'></div> |
Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching... As far as html5ever, check out https://github.com/hansihe/Rustler |
@mhsjlw I agree. Please follow this issue for more details: #94 (sorry for the delay 😅 ). |
@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki! Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser |
@philss wow, that's awesome! Thanks! |
Rust NIFs anyone? https://github.com/servo/html5ever ;) |
@liveresume this was mentioned, twice, see #37 (comment) and #37 (comment) |
Please have a look at:
@Overbryd gave a talk about it in Berlin |
@f34nk Happy to help on this one. I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety. I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package. |
I didn't know we had bindings for We could for sure write an adapter like we did for Thank you for letting us know, @f34nk! Can you open a new issue with the proposal? |
This is part of a bigger effort to write a compliant HTML parser in Elixir. The implementation follows WHATWG specification which is the living standard of HTML, but parts of the tokenizer are still missing like the handling of parse errors and some states. Those missing parts are not essential for most of the documents. You can see details about the HTML specification here: https://html.spec.whatwg.org/multipage/ This commit contains a lot of files. The most important one is the `lib/floki/html/tokenizer.ex`. We added a lot of test files that were generated according to html5lib-tests - a project that aims to provide test cases based on WHATWG specs. See: https://github.com/html5lib/html5lib-tests This tokenizer was written based on the specs as seen around September 2019. Most of the parser development progress is being tracked at https://github.com/philss/floki/projects/2 For now it will remain "private" and no other module is using it. This is related to #37 :)
Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.
The parser goals are:
The text was updated successfully, but these errors were encountered: