-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support extending CodeMirror highlighting with plugins #12
Comments
Agreed, let's bike shed this ⚡(tomorrow for me!) |
I think CodeMirror overlay modes might be the right tool for this job. Each extension that want's to extend the syntax can just add an overlay. If this were the case, we wouldn't need to implement an interface for this; the existing CodeMirror interface should (I believe) suffice |
I had a brief chat with the EBP people, and spent a little bit of time looking into the feasibility of this. AFAICT, with the current markdown-it + CM5 approach, each plugin will need to write two different tokenizers, one for markdown-it and one for a CM mode. The Double-Implementation ProblemThis doesn't sit hugely well with me - it seems crazy that we do effectively the same work twice. The simplest solution here is to use a Markdown library that does include position information, and fit it into a CM Mode. There would be some challenges here:
I mentioned Lezer - CM6 standardises language information around a concrete syntax tree, which can either be generated by Lezer's LR runtime, or by another parser that produces the same structures. The summary here is that CodeMirror (5 and 6) really needs an incremental parser, both wrt. performance and API-matching. So, if one wanted to re-use the parser for Markdown rendering and highlighting, then really we want to satisfy that. I think CM6 has a nicer API here: instead of feeding the parser line-by-line and requiring the formatting each time, CM6 wants the entire parse tree, but can later call in to reparse only a subset. I don't know CM5 well enough to be sure, but I suspect that we would have to handle parse-tree invalidation inside the CM mode ourselves. Relatedly, there is discussion about how to move beyond TextMate grammars for VSCode. Wider IssuesBefore I thought about the re-parse cost, I was going to suggest something radical — having spoken to the EBP team made me think about the fact that we have two separate implementations of markdown-it and the plugin ecosystem; one in Python, and one in JS. We could think-ahead and use a Rust/WASM base for markdown parsing, which could then be used by the Python tools too. I think this is where the space is probably headed, but it's a lot of work. Additionally, LSP markdown-support is something we've talked about, and being able to share some of the implementation here would also be nice too. ConclusionsI think there are two separate issues now being discussed in this post:
I am just not familiar enough yet with the problem space to know what the best long-term solution is. If incremental parsing is viable for rendered markup, then it sounds like the best approach - it will also reduce our repaint times (although the DOM/VDOM is ultimately going to be the bottleneck I suspect). However, we would need a second pass IIRC to handle things like link validation which aren't possible in a single forward-pass. I've seen a few ideas here: The only WASM-friendly option is the last one. Toastmark extends commonmark.js to add enough information to be able to build an AST. It seems like they rely on being able to use contextual information to move back to a CST:
I think Toastmark is avoiding the CM mode API by instead using marks. Maybe that would be a good interrim, because the Mode API only handles highlighting and indentation (i.e. not folding) Additionally, rendering / analysis tools might want more than the CST - an AST would be much easier to render. This would warrant a second pass. I am considering whether it's better to take a longer-term view of the solution here. Rather than investing time into getting highlighting working with CM5 Modes + markdown-it (and writing everything twice), maybe actually moving to lezer (or at least generating a lezer CST) would be a good thing™ in the long run? By dropping markdown-it we would immediately lose the entire ecosystem, which would not be ideal. However, the core Markdown extensions that make it worth using are not too complex. Maybe a community-wide effort here would be sufficient to keep things ticking over? I can't see a way that we can have our cake and eat it unless we make some bold decisions regarding the future plans here :/ Useful links / recap:
|
@agoose77 something I think you mentioned to me, that you don't mention here, is https://github.com/syntax-tree/mdast (based on https://github.com/syntax-tree/unist), which is basically a nice language agnostic (JSONable) and extensible AST format for Markdown (and also includes line/column source mapping). I'm not sure how this would fit in with the incremental parsing (and lezer etc), I'm thinking to write the MyST spec basically as an extension on mdast and then, in principle 😬, you can just use any parser, renderer, LSP that supports it |
Yeah: the balance between future architectural correctness and getting software into peoples' hands is elusive. In the near term... We probably need to continue making the most pragmatic choices such that we can ship software that folk can use, today, with other tools they like. So for now: we have to deal with in CM(5). That first step might be a new So, the really messy option today that would be to make it possible would be maybe some kinda middleware junk: export interface IMarkdownModeOpts {
modes: {[key: string]: any} // initially, gfm, tex
multiplexingModes: any[];
config: CodeMirror.EditorConfiguration; // the runtime ones
modeOptions?: any; // the runtime ones
}
export interface IPluginProvider {
// ...
syntaxExtension: (options: IMarkdownModeOpts) => IMarkdownModeOpts;
} ...and then we stack everything up when a mode is requested. But longer term... I am a pretty big proponent of WASM. Seems like an appropriate thing for a rendering engine, but feels overkill for "just" syntax highlighting. Indeed, we had to deploy some wasm for jupyterlab-simple-syntax because textMate highlighting bundles use a flavor of regex that is... non-trivial. Felt icky. But for full LSP-grade analysis, as has been in noted this thread and elsewhere (e.g. sync scrolling)... yeah, might as well get your syntax highlighting in the same parse. Moving outside of the text editing/rendering experience: the jupyterlite experiment has been great, showing that a (mostly) familiar interactive computing experience (pyolite on pyodide on emscripten) is workable... but as a new platform, we're still limited in many ways. I'd say stay tuned in 2022 for more composable stuff that people can plug into... I think WASM is going to be the bottom of a next-level version of reproducible, interactive computing, and jupyter is well suited to be a banner under which it gets into users' hands. I doubt the next generation of users will think so much about what language a particular function is implemented in, and whether code is being run in-loop in the browser per keystroke or being executed in a massively parallel HPC setting. Things like WASM Types, extended to work with Arrow, and wrapped with metadata like real SI units, will make doing real science pretty awesome. |
#Thanks @bollwyvl, is there any good reading on wasm, it's not something I've looked into much yet. |
Yes, my thoughts on this topic are motivated by the wider landscape of who is using jupyterlab-markup, and who needs markdown rendering more generally. I just don't like the fact that if I want to implement extensions to commonmark that support syntax highlighting + executable books, I'd have to write the same parser/lexer three times! With respect to solving "delivering solutions now", I am currently in favour of not using the CM5 Mode API, and instead relying solely on the Marks API. I think that is workable, and if so it would allow us to get started on using a high-granularity parser today. The common problem that we all have is generating a document-aware syntax tree. Whether that is a CST or AST is less important. If we could standardise the parsing of Markdown for "commonmark extensions", then LSP + EB + Jupyter would all get that for free. The rendering again could be shared between EB + Jupyter. I don't know how VSCode would fit into this w.r.t rendering - they seem currently reluctant to expose the Markdown renderer itself as an extension point. Maybe it wouldn't be so bad to add another editor, which seems to be what they recommend.
WASM & Rust (which can compile to WASM) are both accessible from Python. This means it is possible to write the implementation once, and re-use it in Python + JS. Of course, this would mean writing code in the common denominator language, e.g. Rust. One could do this in Python as Python can be compiled to WASM, but right now that involves a lot of work & bloat as @bollwyvl alludes to. |
There is also the benefit of standardising the AST for existing tools: the ToC extension IIRC parses the Markdown to identify headings in notebooks / Markdown documents. Having a generated AST / being able to request the AST would mean:
Another benefit is prosemirror integration: I imagine that is a lot easier if you're working at the level of an AST. |
Well... part of that comes from tools hand-writing parser/lexers in an implementation language in the first place. But markdown is a crazy mess, to parse properly, even before adding extensibility. But if starting over... rather than jumping straight to PARSER IN RUST NOW, at least taking a cursory look at a portable stack like antlr or lark, which focus effort on writing declarative specifications and then generating implementations, might be worthwhile. Indeed: Jupyter would really benefit from a declarative (preferrably, JSON-compatible) way for e.g. kernels to describe their language grammars (especially dynamic deviances a la jupyter-lsp/jupyterlab-lsp#191). Briefly, on jupyter-lsp: despite its warts, for the larger code editing mission, we can't afford to lose what CM5 already represents to the community. We are excited to get to our hands on CM6. Maybe i'll warm
Here's a high level site, some specs (including the forthcoming types) as well as some nuts-and-bolts blog posts, like asciinema, and some position pieces.
WASM is a target for a number of compiled languages, now: c, rust, erlang, go, haskell, etc. There are some higher-level languages, such as the typescript-like assemblyscript. Initially, this grew out of the corpus of tricks in asm.js, and was to enable reasonably performant in-browser execution of otherwise-opaque software: in 2022, it's not much of a stretch to say it's easier to run a lot of things in the browser than natively (and well) on windows. More recently, is proving interesting as a non-browser technology due to its sandboxing: or even more weirdly, firefox will soon be shipping some vendored stuff compiled from C to WASM, and then back into C!
In JupyterLite, which only cares about (real) browsers, we're using pyodide to deliver the IPython/ipykernel stack, including ipywidgets. Most packages run unmodified! But the biggest win is that you can deploy certain interactive experience to, theoretically, millions of simultaneous users (willing to maybe download ~100mb of python to their browsers 🤣) with just a free/low cost static web host and a CDN.
Pyodide is basically a CPython distribution, and has a conda-like build chain, to get up a Linux-like system with numpy/pandas with emscripten. Unfortunately, its build chain is just conda-like... there's some work starting soon to see if this can actually be conda(-forge) so that we can start getting automated updates of thousands of packages, instead of one every pyodide release to update/add libraries. However: the ticket to get in the door for that python integration is ~20mb, per kernel. As such, we have been pushing back against using any python wasm as part of the "web server" that runs in the browser, instead re-implementing key parts of Meanwhile... On the "server" there are a number of standalone runtimes, such as wasmer and wasmtime, as well as things that are shooting for even greater security such as enarx. Wasmer, in particular, has many language-specific bindings, such as wasmer-python. The win here for jupyter-adjacent projects would be to not be chasing the moving target of python ABI complexity per-platform-per-python-per-wheel, and just be able to ship a single WASM blob that would execute anywhere, including the browser, but enjoy a performance profile closer (by order of mangitude) to C-level code than python code. |
I'm not sure it's even theoretically possible to parse commonMark as context free grammar? (See e.g. https://roopc.net/posts/2014/markdown-cfg/). Let alone with any syntax extensions |
Cheers, will check it out! |
right, I'll grant that even "old high markdown" is basically the social media engrish of markup languages. But there are grammars and then there are grammars. for syntax highlighting, especially in a narrative language, it just needs to be good enough and fast enough and be really good at handling broken state. Indeed, having a lenient grammar with terminals like And there's no helping things like footnote-style markdown refs. But even something block level would be a fairly big step up for portability, especially for the case of embedding multiple syntax modes inside other syntaxes.
in lark, at least, one can make extensible grammars... but if that particular feature isn't portable to other implementations it wouldn't be as much fun. and i would not wish runtime antlr generation on anyone! |
Right, from the reading that I've done (given that I've not had time to look at it myself yet), writing a formal grammar for Markdown is a very difficult challenge. The author of this link makes a few other comments elsewhere, and essentially their argument is:
There are definitely a number of different concerns/priorities in this thread. As I see them, we have:
Maybe some of these concerns do not need to be solved any time soon. But, if we allow ourselves the opportunity to consider them, we have:
I did note that roopc implemented a Markdown specification for a modified Markdown. However, once you start having variations (plugins) on this specification, it would be difficult to resolve how ambiguities should be handled. The easiest and most robust solution that I can see to that is to just have a canonical implementation and decree that that is the right way to parse it. My gut feeling is that the best direction for Jupyter projects as a whole is to:
We are already doing this in part with EB + jupyterlab-markup: both use Markdown-it / markdown-it ports, and (assuming conformance) that is more consistency than the range of Markdown renders in use by different platforms (JupyterLab/notebook with Markedjs, colab? kaggle? GitHub renderer?) If we don't consider an implementation-defined spec, then the next best thing is a big test suite defining implementation. |
That's a lot of stuff, and sorry for encouraging wandering off down the wasm path. I look forward to a future where a user-driven set of choices are documented and honored by the tools (a la #13) but feel like "conformance" is a very big word to use in use case, and definitely out of scope of a PR that answers the title/description of this issue. Basically, after said PR was merged, installing a future 1.x release of this extension would extend the existing JupyterLab 3.x editing experience to highlight some of the new syntax it supports rendering e.g. mermaid (now part of GFM), without breaking the experience provided by other extension authors (e.g. LSP, modes from other languages, collaborative editing with presence). Ideally, this would be managed in a way that downstreams of this plugin could also add additional features... but maybe #40 would demand this anyway. Even having gross block-level modes, as supported by the existing cm5+ipythongfm would be sufficient for today's notebook markdown cell editing experience and markdown documents of reasonable size, like a project's README, and whole (jupyter) books are again a whole other beast. |
Just as an additional point of reference, you also now have https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens I guess this is similar to overlays, in that it is not intended to provide the full highlighting just enhance it. |
Markdown/CommonMark are hard to syntax highlight anyway, and a number of the plugins in #10 change the syntax in a way that is not covered by the existing
ipythongfm
mode. CodeMirror has been a sticky wicket in supporting Lab3 #11 as CodeMirror's approach doesn'tThe
diagrams
modes like the ...and...
syntaxes would be covered by new modes, as existing mode already defers fenced blocks, but other things like
footnote
anddeflist
would need a fair amount of massaging.Generally, there should be a mechanism for a plugin to confidently add (and test) new syntax highlighting features. Hopefully this wouldn't mean rewriting the mode, but who knows!
The text was updated successfully, but these errors were encountered: