-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doctypes other than HTML5 aren't emitted correctly #37
Comments
This is because Markup's HTML serializer follows the spec here, which allows these doctypes only:
That's obviously too strict. I think this is best solved by actually stripping off the doctype if it is present, rather than sending it to Markup, and outputting it directly from Lambda Soup. If someone later wants legacy doctype support in HTML in Markup, we can move the support there and make Lambda Soup use that, as an internal change. |
Also note that XHTML should probably be parsed using the XML parser in Markup, and serialized with the XML serializer. That should have precluded most of these issues, which arise from parsing XML with an HTML5 parser and serializing it with an HTML5 serializer. I understand that the XML parser might not be as convenient, because you would have to use |
Given that, how would you like to proceed? Are you facing a mixture of documents, some of which are HTML5 and some XML, and need some kind of easy way of Markup/Lambda Soup automatically figuring out which parser and serializer to use? I haven't considered that before, but it might be the right thing to do. |
It's an interesting question. It's not a problem I'm having personally, so it's mostly a theoretical issue. I suppose maybe we should leave it as is until a person who actually has that problem comes along. |
The HTML5 spec requires outputting <!DOCTYPE html> as the doctype. However, Markup.ml already has a helper (Markup.html5) for replacing any doctype with the HTML5 doctype. So, it is better for the writer to output whatever doctype is present in the signal stream, trusting the user to call Markup.html5 if it is necessary for their purpose. Part of aantron/lambdasoup#37.
Can you test again with I modified Markup's HTML writer to output whatever doctype is there, instead of trying to output only HTML5 doctypes. This seems reasonable, given that there is I also added a test to Lambda Soup for round-tripping the doctype from this issue. It passes with Markup See the two commits linked above if curious about the code. If this works for you, I will release it in Markup 1.1.0. Lambda Soup won't need a release. Note that Markup |
I've just tested it and it seems to work fine. Thanks for the fix! I actually have both markup and lambdasoup dependencies specified explicitly in soupault.opam due to some previous situation, so I don't mind. |
I'll give it a day in case anything else comes to mind, as the doctype is proving to be a rich source of problems for Lambda Soup and Markup.ml :P |
Well, I'm working on some other improvements for soupault 2.1.0 too, I wasn't planning to release it right today either. ;) One visual oddity: traditionally people put a newline between the doctype and the |
Can you open an issue on Markup.ml about this and any other such things? I think we can fix this in the pretty printer (I am assuming based on the code at the beginning of this issue, that you are using it). |
And ping me if/when you would like a release. I'll wait a bit to hopefully include several changes, and then do it. |
I need to sit with the Markup.ml code to actually understand it so that I can contriubute in the future... |
@aantron I've completed my work on changes I want in soupault 2.1.0, so I'm ready to release it whenever the new Markup release is available. |
I plan to do aantron/markup.ml#59 and retag 1.0.0 and release it in the first half of the coming week. |
Yeah, I've seen that issue. Whenever the re-tagged markup 1.0.0 is available from the OPAM repos, I'll set the dependency to |
I'm sorry for not giving #32 proper testing!
Your implementation only works for the trivial HTML5 doctype. This makes it impossible to keep the real original doctype for any legacy pages.
Example:
The text was updated successfully, but these errors were encountered: