-
Notifications
You must be signed in to change notification settings - Fork 93
explore mediawiki parsers instead of parsing HTML directly #58
Comments
Hello Suyash, I may like to work on this if I have the time. Some questions:
|
Hi @sehwol , thank you for your interest in this! I was planning to use mwparserfromhell to parse the wikitext directly instead of HTML mainly for the following reasons
The wikitext can be retrieved using Wiktionary's API I'll accept a PR if the tests work and the code looks good to me. The two pending ones aren't really complete so I haven't merged them yet. |
Hi Suyash, do the tests all run on your computer? If I fetch words like "video (Latin)" oldid 50291344, I'm sometimes getting stuff like this ...
"text": [
"Lua error in Module:la-verb at line 747: The parameter \"conj\" is not used by this template.",
"I see, perceive; look (at)",
... Source: https://en.wiktionary.org/wiki/video?printable=yes&oldid=50291344#Verb_2 I'm not sure if wiktionary just developed a bug or if it's something else. Edit: |
Tbh I haven't worked on this project in a while but, I'll take a look at the tests right away. The exception looks like an error on Wiktionary's end that turns up when the wikitext is rendered. Adding |
I think this is one of the most comprehensive parsers which does that: https://github.com/tatuylonen/wiktextract |
@frankier this looks very promising, thanks for pointing out! |
Instead of parsing the HTML, use existing mediawiki parsers (like mwparserfromhell) as a second stage since headings/content/tags/comments etc are clearly defined and the wikitext content is more compact
The text was updated successfully, but these errors were encountered: