-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support formatting jupyter notebooks #1218
Comments
I assume Ruff would work with nbQA, but I too would like it to support Jupyter out-of-the-box. Looking into it... |
Ruff was actually integrated into nbQA in the latest release, so you can now run (e.g.):
(Would still like to have a first-party integration at some point.) |
See a parallel effort over in the Ruff VSCode extension, where LSP support for analyzing entire Jupyter notebooks seems blocked by It would be nice to have native support in Ruff as well! But in the short run, it looks like LSP support may have less resistance. |
The current implementation can lint jupyter files, but it can't apply fixes yet or write them back to file. I've written instructions in the top comment and can mentor if someone wants to tackle this. |
@konstin I could take this on |
@sladyn98 - Sorry for the churn, I already chatted with @dhruvmanila about taking this one on! |
@charliermarsh you can assign this to me to avoid any future confusion :) |
## Summary Add support for applying auto-fixes in Jupyter Notebook. ### Solution Cell offsets are the boundaries for each cell in the concatenated source code. They are represented using `TextSize`. It includes the start and end offset as well, thus creating a range for each cell. These offsets are updated using the `SourceMap` markers. ### SourceMap `SourceMap` contains markers constructed from each edits which tracks the original source code position to the transformed positions. The following drawing might make it clear: ![SourceMap visualization](https://github.com/astral-sh/ruff/assets/67177269/3c94e591-70a7-4b57-bd32-0baa91cc7858) The center column where the dotted lines are present are the markers included in the `SourceMap`. The `Notebook` looks at these markers and updates the cell offsets after each linter loop. If you notice closely, the destination takes into account all of the markers before it. The index is constructed only when required as it's only used to render the diagnostics. So, a `OnceCell` is used for this purpose. The cell offsets, cell content and the index will be updated after each iteration of linting in the mentioned order. The order is important here as the content is updated as per the new offsets and index is updated as per the new content. ## Limitations ### 1 Styling rules such as the ones in `pycodestyle` will not be applicable everywhere in Jupyter notebook, especially at the cell boundaries. Let's take an example where a rule suggests to have 2 blank lines before a function and the cells contains the following code: ```python import something # --- def first(): pass def second(): pass ``` (Again, the comment is only to visualize cell boundaries.) In the concatenated source code, the 2 blank lines will be added but it shouldn't actually be added when we look in terms of Jupyter notebook. It's as if the function `first` is at the start of a file. `nbqa` solves this by recording newlines before and after running `autopep8`, then running the tool and restoring the newlines at the end (refer nbQA-dev/nbQA#807). ## Test Plan Three commands were run in order with common flags (`--select=ALL --no-cache --isolated`) to isolate which stage the problem is occurring: 1. Only diagnostics 2. Fix with diff (`--fix --diff`) 3. Fix (`--fix`) ### https://github.com/facebookresearch/segment-anything ``` ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 3 98 0 94 4 |- Python 3 513 468 4 41 (Total) 611 468 98 45 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/segment-anything/**/*.ipynb --fix ... Found 180 errors (89 fixed, 91 remaining). ``` ### https://github.com/openai/openai-cookbook ``` ------------------------------------------------------------------------------- Jupyter Notebooks 65 0 0 0 0 |- Markdown 64 3475 12 2507 956 |- Python 65 9700 7362 1101 1237 (Total) 13175 7374 3608 2193 =============================================================================== ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/openai-cookbook/**/*.ipynb --fix error: Failed to parse /path/to/openai-cookbook/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb:cell 4:29:18: unexpected token '-' ... Found 4227 errors (2165 fixed, 2062 remaining). ``` ### https://github.com/tensorflow/docs ``` ------------------------------------------------------------------------------- Jupyter Notebooks 150 0 0 0 0 |- Markdown 1 55 0 46 9 |- Python 1 402 289 60 53 (Total) 457 289 106 62 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-docs/**/*.ipynb --fix error: Failed to parse /path/to/tensorflow-docs/site/en/guide/extension_type.ipynb:cell 80:1:1: unexpected token Indent error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/eager/custom_layers.ipynb:cell 20:1:1: unexpected token Indent error: Failed to parse /path/to/tensorflow-docs/site/en/guide/data.ipynb:cell 175:5:14: unindent does not match any outer indentation level error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/representation/unicode.ipynb:cell 30:1:1: unexpected token Indent ... Found 12726 errors (5140 fixed, 7586 remaining). ``` ### https://github.com/tensorflow/models ``` ------------------------------------------------------------------------------- Jupyter Notebooks 46 0 0 0 0 |- Markdown 1 11 0 6 5 |- Python 1 328 249 19 60 (Total) 339 249 25 65 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-models/**/*.ipynb --fix ... Found 4856 errors (2690 fixed, 2166 remaining). ``` resolves: #1218 fixes: #4556
Sorry, not yet completed. |
What's the outstanding work here? Testing? |
|
## Summary Add roundtrip support for Jupyter notebook. 1. Read the notebook 2. Extract out the source code content 3. Use it to update the notebook itself (should be exactly the same [^1]) 4. Serialize into JSON and print it to stdout ## Test Plan `cargo run --all-features --bin ruff_dev --package ruff_dev -- round-trip <path/to/notebook.ipynb>` <details><summary>Example output:</summary> <p> ``` { "cells": [ { "cell_type": "markdown", "id": "f3c286e9-fa52-4440-816f-4449232f199a", "metadata": {}, "source": [ "# Ruff Test" ] }, { "cell_type": "markdown", "id": "a2b7bc6c-778a-4b07-86ae-dde5a2d9511e", "metadata": {}, "source": [ "Markdown block before the first import" ] }, { "cell_type": "code", "id": "5e3ef98e-224c-450a-80e6-be442ad50907", "metadata": { "tags": [] }, "source": "", "execution_count": 1, "outputs": [] }, { "cell_type": "code", "id": "6bced3f8-e0a4-450c-ae7c-f60ad5671ee9", "metadata": {}, "source": "import contextlib\n\nwith contextlib.suppress(ValueError):\n print()\n", "outputs": [] }, { "cell_type": "code", "id": "d7102cfd-5bb5-4f5b-a3b8-07a7b8cca34c", "metadata": {}, "source": "import random\n\nrandom.randint(10, 20)", "outputs": [] }, { "cell_type": "code", "id": "88471d1c-7429-4967-898f-b0088fcb4c53", "metadata": {}, "source": "foo = 1\nif foo < 2:\n msg = f\"Invalid foo: {foo}\"\n raise ValueError(msg)", "outputs": [] } ], "metadata": { "kernelspec": { "display_name": "Python (ruff-playground)", "name": "ruff-playground", "language": "python" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 } ``` </p> </details> [^1]: The type in JSON might be different (#4665 (comment)) Part of #1218
## Summary Add support for applying auto-fixes in Jupyter Notebook. ### Solution Cell offsets are the boundaries for each cell in the concatenated source code. They are represented using `TextSize`. It includes the start and end offset as well, thus creating a range for each cell. These offsets are updated using the `SourceMap` markers. ### SourceMap `SourceMap` contains markers constructed from each edits which tracks the original source code position to the transformed positions. The following drawing might make it clear: ![SourceMap visualization](https://github.com/astral-sh/ruff/assets/67177269/3c94e591-70a7-4b57-bd32-0baa91cc7858) The center column where the dotted lines are present are the markers included in the `SourceMap`. The `Notebook` looks at these markers and updates the cell offsets after each linter loop. If you notice closely, the destination takes into account all of the markers before it. The index is constructed only when required as it's only used to render the diagnostics. So, a `OnceCell` is used for this purpose. The cell offsets, cell content and the index will be updated after each iteration of linting in the mentioned order. The order is important here as the content is updated as per the new offsets and index is updated as per the new content. ## Limitations ### 1 Styling rules such as the ones in `pycodestyle` will not be applicable everywhere in Jupyter notebook, especially at the cell boundaries. Let's take an example where a rule suggests to have 2 blank lines before a function and the cells contains the following code: ```python import something # --- def first(): pass def second(): pass ``` (Again, the comment is only to visualize cell boundaries.) In the concatenated source code, the 2 blank lines will be added but it shouldn't actually be added when we look in terms of Jupyter notebook. It's as if the function `first` is at the start of a file. `nbqa` solves this by recording newlines before and after running `autopep8`, then running the tool and restoring the newlines at the end (refer nbQA-dev/nbQA#807). ## Test Plan Three commands were run in order with common flags (`--select=ALL --no-cache --isolated`) to isolate which stage the problem is occurring: 1. Only diagnostics 2. Fix with diff (`--fix --diff`) 3. Fix (`--fix`) ### https://github.com/facebookresearch/segment-anything ``` ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 3 98 0 94 4 |- Python 3 513 468 4 41 (Total) 611 468 98 45 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/segment-anything/**/*.ipynb --fix ... Found 180 errors (89 fixed, 91 remaining). ``` ### https://github.com/openai/openai-cookbook ``` ------------------------------------------------------------------------------- Jupyter Notebooks 65 0 0 0 0 |- Markdown 64 3475 12 2507 956 |- Python 65 9700 7362 1101 1237 (Total) 13175 7374 3608 2193 =============================================================================== ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/openai-cookbook/**/*.ipynb --fix error: Failed to parse /path/to/openai-cookbook/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb:cell 4:29:18: unexpected token '-' ... Found 4227 errors (2165 fixed, 2062 remaining). ``` ### https://github.com/tensorflow/docs ``` ------------------------------------------------------------------------------- Jupyter Notebooks 150 0 0 0 0 |- Markdown 1 55 0 46 9 |- Python 1 402 289 60 53 (Total) 457 289 106 62 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-docs/**/*.ipynb --fix error: Failed to parse /path/to/tensorflow-docs/site/en/guide/extension_type.ipynb:cell 80:1:1: unexpected token Indent error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/eager/custom_layers.ipynb:cell 20:1:1: unexpected token Indent error: Failed to parse /path/to/tensorflow-docs/site/en/guide/data.ipynb:cell 175:5:14: unindent does not match any outer indentation level error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/representation/unicode.ipynb:cell 30:1:1: unexpected token Indent ... Found 12726 errors (5140 fixed, 7586 remaining). ``` ### https://github.com/tensorflow/models ``` ------------------------------------------------------------------------------- Jupyter Notebooks 46 0 0 0 0 |- Markdown 1 11 0 6 5 |- Python 1 328 249 19 60 (Total) 339 249 25 65 ------------------------------------------------------------------------------- ``` ```console $ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-models/**/*.ipynb --fix ... Found 4856 errors (2690 fixed, 2166 remaining). ``` resolves: #1218 fixes: #4556
## Summary Add roundtrip support for Jupyter notebook. 1. Read the notebook 2. Extract out the source code content 3. Use it to update the notebook itself (should be exactly the same [^1]) 4. Serialize into JSON and print it to stdout ## Test Plan `cargo run --all-features --bin ruff_dev --package ruff_dev -- round-trip <path/to/notebook.ipynb>` <details><summary>Example output:</summary> <p> ``` { "cells": [ { "cell_type": "markdown", "id": "f3c286e9-fa52-4440-816f-4449232f199a", "metadata": {}, "source": [ "# Ruff Test" ] }, { "cell_type": "markdown", "id": "a2b7bc6c-778a-4b07-86ae-dde5a2d9511e", "metadata": {}, "source": [ "Markdown block before the first import" ] }, { "cell_type": "code", "id": "5e3ef98e-224c-450a-80e6-be442ad50907", "metadata": { "tags": [] }, "source": "", "execution_count": 1, "outputs": [] }, { "cell_type": "code", "id": "6bced3f8-e0a4-450c-ae7c-f60ad5671ee9", "metadata": {}, "source": "import contextlib\n\nwith contextlib.suppress(ValueError):\n print()\n", "outputs": [] }, { "cell_type": "code", "id": "d7102cfd-5bb5-4f5b-a3b8-07a7b8cca34c", "metadata": {}, "source": "import random\n\nrandom.randint(10, 20)", "outputs": [] }, { "cell_type": "code", "id": "88471d1c-7429-4967-898f-b0088fcb4c53", "metadata": {}, "source": "foo = 1\nif foo < 2:\n msg = f\"Invalid foo: {foo}\"\n raise ValueError(msg)", "outputs": [] } ], "metadata": { "kernelspec": { "display_name": "Python (ruff-playground)", "name": "ruff-playground", "language": "python" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 } ``` </p> </details> [^1]: The type in JSON might be different (#4665 (comment)) Part of #1218
## Summary Ability to perform integration test on Jupyter notebooks Part of #1218 ## Test Plan `cargo test`
Meta issue: #5188 |
black supports formatting jupyter notebook with the jupyter extra
black[jupyter]
. If installed like this, it formats jupyter notebooks just like python files. It would be nice if ruff could similarly lint and fix jupyter notebooks.Steps to implement this:
serde_json::Value
. See also the black implementation.jupyter_notebook
feature and include.ipynb
files in ruff by defaultThe text was updated successfully, but these errors were encountered: