WIP: base64 encoding for numpy arrays #2943

jonmmease · 2020-11-29T18:04:01Z

Overview

This is a WIP PR to support encoding n-dimensional numpy arrays using a base64 encoding convention rather than as lists during JSON serialization.

This requires a corresponding plotly.js WIP PR at plotly/plotly.js#5230, but I've committed a plotly.js.min bundle from that PR so that folks can test this branch without building plotly.js.

Activation with future flag

In order to trigger base64 encoding, you need to import the b64_encoding future flag before import plotly. e.g.

from _plotly_future_ import b64_encoding
import plotly.express as px

Encoding Format

When activated, numpy arrays will be serializaed to JSON objects with dtype, shape, and bvals keys. dtype is a numpy compatible dtype string, shape can be a scalar integer (for a 1-d array) or a list of N integers (for an N-d array). bvals is a base64 encoded string representing the underlying binary array buffer.

Note that N-d arrays are arranged in row-major ordering (Like C and Python), not column major (like Fortran and MATLAB).

Here is an example of how the array np.arange(3, dtype="int16") is encoded:

{
  "bvals": "AAABAAIA",
  "dtype": "int16",
  "shape": [3]
}

When this new branch of plotly.js encounters this, it will decode the base64 string in a contiguous ArrayBuffer and wrap that with a Int16Array typed array with value new Int16Array([0, 1, 2]).

Multi dimensional arrays are also supported. For example, np.arange(6, dtype="float32").reshape(2, 3) will be encoded into:

{
  "bvals": "AAAAAAAAgD8AAABAAABAQAAAgEAAAKBA",
  "dtype": "float32",
  "shape": [2, 3]
}

Plotly.js will then decode this into a contiguous ArrayBuffer. Then two Float32Array instances will be created as views onto this ArrayBuffer (no copy is performed here), and these will be nested inside a regular Array. The final decoded value will be equal to:

[new Float32Array([0, 1, 2]), new Float32Array(3, 4, 5)]

Limitations

int64 and uint64 arrays are not supported (they will still be converted to lists). This is because there are no native Int64Array and Uint64Array instances in JavaScript. Interesting side note here is that the reason for these omissions is that, under the hood, JavaScript represents all numbers as 64-bit floats, so it can only faithfully represent integers up to 2^53 - 1.

Call for help with QA

I've interactively played around with scatter, scattergl, scatter3d, heatmap, image, isosurface, and volume traces and I think most everything is working there. But there are surely things that aren't currently working so I'd appreciate any early QA help that folks could pitch in with.

In particular, we should check that all properties that support arrays work properly with these base64 encoded array specifications. Feel free to add comments to this PR with any issues you run into. Remember to enable the feature with the future flag as described above!

Call for help with benchmarking

I haven't done any rigorous benchmarking yet, but early notebook %timeit results look promising. For figures that involve arrays in the tens of thousands to a couple of millions of elements, I've been seeing json encoding speedups of ~2x to ~20x. This is measuring the fig.to_json() method in isolation.

It would also be nice to measure whether there is a speedup on the JavaScript side from the start of decoding process to the display of a figure, but I don't know off hand how to measure this. So any help here would be greatly appreciated.

One thing to note is that with this paradigm, there is a substantial encoding time and space savings associated with lowering the precision of arrays. For a lot of plots, dropping down to float32 or float16 won't change the appearance and will encode substantially faster than float64.

I would especially appreciate json encoding benchmarks on figures that folks have built for real world use cases. And if there are performance regressions anywhere, it would be good to be aware of that.

Where rendering should work

After activating the future flag, this branch can be tested in these contexts:

classic notebook
standalone html files (fig.write_html)
Kaleido image export (fig.write_image(engine="kaleido"))

It will not work (yet) in JupyterLab, vscode, or nteract because they supply there own version of plotly.js instead of using the one bundled with plotly.py

Testing with Dash

To test this branch with Dash, copy the plotly.min.js file from packages/python/plotly/plotly/package_data/ to the assets folder of your Dash application. And again, make sure to activate the future flag described above before importing plotly or dash.

Note, there may be adverse impacts on non-Graph dash components that currently accept numpy/pandas data structures. Please post a comment here if you run into anything in dash that breaks when this feature is active.

CCs

CCing folks who have expressed interest in this feature in the past:

@nicolaskruchten @emmanuelle @archmoj @almarklein @chriddyp @alexcjohnson @cboulay @Marc-Andre-Rivet

CCing folks involved with other plotly.js wrappers as support for this base64 encoding could be added there as well once this plotly.js update is merged and released. If you want to play around with this in your own wrapper, you can grab the corresponding plotly.min.js file from packages/python/plotly/plotly/package_data/ directory of this PR.

@rpkyle @sglyon @igiagkiozis @kMutagene @waralex

Thanks!

emmanuelle · 2020-11-30T12:50:01Z

Thanks @jonmmease this is so cool!

I started testing various things, I noticed that the following code (after importing the future flag) results in a blank output in a Jupyter notebook in Firefox

import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin",
           marginal_x="box", trendline="ols")
fig.show()

with the following error message in the console

TypeError: tt.concat is not a function

sglyon · 2020-11-30T13:36:24Z

Very cool!

I'm a little curious on the decision to use base64 instead of possible alternatives. Any insights you could share?

jonmmease · 2020-11-30T14:07:13Z

Thanks for checking it out @sglyon

I'm a little curious on the decision to use base64 instead of possible alternatives

To be honest, base64 seemed like a natural way to do this and I didn't really investigate other approaches. What alternatives do you have in mind?

At the risk of stating the obvious, the requirements for the encoding are that it needs to encode into utf-8 and there needs to be a pretty light-weight library/algorithm on the JavaScript side to decode it into an ArrayBuffer. And it needs to be JSON string compatible.

archmoj · 2020-11-30T16:49:42Z

Thanks @jonmmease this is so cool!

I started testing various things, I noticed that the following code (after importing the future flag) results in a blank output in a Jupyter notebook in Firefox
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin",
           marginal_x="box", trendline="ols")
fig.show()
with the following error message in the console
TypeError: tt.concat is not a function

@emmanuelle It is possible that typedArrays are not handled in some places/traces in plotly.js code.
If you could replicate this in a codepen please open a plotly.js bug report.
Thank you!

almarklein · 2020-11-30T16:54:03Z

It should also be considered that a user may use numpy arrays where you'd not expect them, e.g. RangeSlider.value. So in places where this can happen, the JS should ideally not assume Array.

edit: or require users to use lists in those cases

jonmmease · 2020-11-30T17:54:43Z

Thanks for the example @emmanuelle, I'll look into it.

it should also be considered that a user may use numpy arrays where you'd not expect them, e.g. RangeSlider.value. So in places where this can happen, the JS should ideally not assume Array.

This is a good point. The range slider example should be pretty easy to handle because it has a different plotly.js schema type than other data arrays. I think I'd like to try to handle this in plotly.js. Will make a note.

nicolaskruchten · 2020-11-30T18:04:36Z

Re Chart Studio, it's pretty important that in _fullData we continue to have access to the 'normal' array-looking things, otherwise react-chart-editor will break badly.

nicolaskruchten · 2020-11-30T18:05:19Z

Also, the chart-studio package will need to keep sending normal arrays I think.

jonmmease · 2020-11-30T18:13:29Z

Re Chart Studio, it's pretty important that in _fullData we continue to have access to the 'normal' array-looking things, otherwise react-chart-editor will break badly.

Does a typed array count as a "normal array looking thing?". This is currently what is stored in _fullData. If that's not good enough, maybe we need a config option or something to tell plotly.js to convert the typed array specifications to regular arrays internally instead of typed arrays. Either way, will it cause an issue if .data stores a TypedArray specification object and ._fullData stores an array?

As I recall, chart_studio uploads shouldn't change because we're already traversing the figure, extracting the arrays into a Grid, updating Figure to reference data in the grid, and then uploading both.

emmanuelle · 2020-11-30T20:29:47Z

@mojtaba the codepen is https://codepen.io/emmanuelle-plotly/pen/dypovyL but you would need to pass it the correct URI of the plotly.js bundle.

For the python folks I simplified the example to be

import plotly.express as px
fig = px.scatter(x=np.array([1, 2., 3, 4, 5, 6]), y=np.array([1, 2., 1, 2, 1, 3]), marginal_x='box')
fig.show()

Interestingly, no problems with arrays of ints like

import plotly.express as px
fig = px.scatter(x=np.array([1, 2, 3, 4, 5, 6]), y=np.array([1, 2, 1, 2, 1, 3]), marginal_x='box')
fig.show()

or with other types of marginals like

import plotly.express as px
fig = px.scatter(x=np.array([1, 2., 3, 4, 5, 6]), y=np.array([1, 2, 1, 2, 1, 3]), 
                 marginal_x='violin')
fig.show()

jonmmease · 2020-11-30T20:39:58Z

Thanks @emmanuelle

Interestingly, no problems with arrays of ints like

This is probably because int64 arrays aren't encoded this way (they are still converted to lists) due to limitations in JavaScript typed arrays (see limitations section in the overview).

or with other types of marginals like...

Cool, probably just a small fix needed in the box trace, which isn't one I had tried previously.

almarklein · 2020-11-30T21:29:59Z

Maybe it's worth considering only applying the conversion to specific cases (opt-in), of which it is know that they are handled correctly at the client. I'm not sure how though ... and you'd probably want custom code to be able to make use of it as well.

nicolaskruchten · 2020-12-01T15:57:08Z

@jonmmease does the fact that we can use utf-8 instead of being restricted to ascii mean that we can use something more efficient than base64?

jonmmease · 2020-12-01T16:03:59Z

@jonmmease does the fact that we can use utf-8 instead of being restricted to ascii mean that we can use something more efficient than base64?

I haven't done much research here, but it looks like base85 may be an option for a 25% space savings over base64.

There's also base91 (http://base91.sourceforge.net/), but that includes the double quote character so wouldn't work easily inside a JSON string.

Not sure what else there might be, but most important thing would be identifying fast libraries for the Python encoding and JavaScript decoding.

nicolaskruchten · 2020-12-01T16:14:59Z

Base85 still targets 7-bit ASCII as far as I can tell... I wonder if there's not something out there that will use all the funky 4-byte unicode characters :)

jonmmease · 2020-12-01T16:21:11Z

Ohh, yeah. I see. Like the fractions and latin letters from https://www.utf8-chartable.de/.

alexcjohnson · 2020-12-01T19:31:11Z

If you look at the exact bit representation of UTF-8, multi-byte characters get less and less information dense. Single-byte characters have info in 7 of 8 bits, but all longer ones use at least 2 bits per byte for structure - so I think anything beyond 7-bit ASCII will be a net loss compared to base 64.

alexcjohnson · 2020-12-01T20:24:12Z

Furthermore given how simple the b64 encoding is (bit-for-bit translation to the binary data once you've converted each character to its 6 bits) I'd worry about the decoding speed of any of the non-power-of-2 encodings (b85, b91) and we clearly can't use all 128 options in the 7-bit base ASCII. So that makes me think b64 is the best option overall.

sdementen · 2020-12-02T06:06:14Z

Maybe Arrow could be relevant as n64 alternative https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit

jonmmease · 2020-12-02T12:52:22Z

Maybe Arrow could be relevant as n64 alternative https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit

I think arrow should definitely have a roll in the plotly/dash serialization story, but for this particular project we're focusing on more efficient representation of arrays within the current plain-text JSON figure specifications. I'm not aware of anything in Arrow that's focused on plain-text encoding of arrays, but if there is that would be great to know!

jonmmease · 2020-12-04T18:06:13Z

@emmanuelle The box plot marginal issue should be fixed now.

jonmmease added 5 commits November 28, 2020 15:04

Add b64_encoding future flag

2c55bd5

WIP numpy base64 encoder

45d4f7b

Plotly.js doesn't support 64-bit (unsiged)integer typed arrays

3a5930f

Add WIP plotly.js build

11c39dd

blacken

a11ac5d

jonmmease mentioned this pull request Nov 29, 2020

ENH: Optional dependencies for accelerating JSON serialization #2944

Closed

jonmmease mentioned this pull request Nov 30, 2020

Accept objects for encoded typedarrays in data_array valType plotly/plotly.js#5230

Merged

Update WIP plotly.js build

a80ee15

jonmmease mentioned this pull request Dec 5, 2020

JSON encoding refactor and orjson encoding #2955

Merged

1 task

xhluca mentioned this pull request Apr 14, 2021

Add binary transport plotly/dash-vtk#42

Closed

jonmmease closed this Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: base64 encoding for numpy arrays #2943

WIP: base64 encoding for numpy arrays #2943

jonmmease commented Nov 29, 2020 •

edited

Loading

emmanuelle commented Nov 30, 2020

sglyon commented Nov 30, 2020

jonmmease commented Nov 30, 2020 •

edited

Loading

archmoj commented Nov 30, 2020

almarklein commented Nov 30, 2020 •

edited

Loading

jonmmease commented Nov 30, 2020

nicolaskruchten commented Nov 30, 2020

nicolaskruchten commented Nov 30, 2020

jonmmease commented Nov 30, 2020

emmanuelle commented Nov 30, 2020

jonmmease commented Nov 30, 2020

almarklein commented Nov 30, 2020

nicolaskruchten commented Dec 1, 2020

jonmmease commented Dec 1, 2020

nicolaskruchten commented Dec 1, 2020

jonmmease commented Dec 1, 2020 •

edited

Loading

alexcjohnson commented Dec 1, 2020

alexcjohnson commented Dec 1, 2020

sdementen commented Dec 2, 2020

jonmmease commented Dec 2, 2020

jonmmease commented Dec 4, 2020

WIP: base64 encoding for numpy arrays #2943

WIP: base64 encoding for numpy arrays #2943

Conversation

jonmmease commented Nov 29, 2020 • edited Loading

Overview

Activation with future flag

Encoding Format

Limitations

Call for help with QA

Call for help with benchmarking

Where rendering should work

Testing with Dash

CCs

emmanuelle commented Nov 30, 2020

sglyon commented Nov 30, 2020

jonmmease commented Nov 30, 2020 • edited Loading

archmoj commented Nov 30, 2020

almarklein commented Nov 30, 2020 • edited Loading

jonmmease commented Nov 30, 2020

nicolaskruchten commented Nov 30, 2020

nicolaskruchten commented Nov 30, 2020

jonmmease commented Nov 30, 2020

emmanuelle commented Nov 30, 2020

jonmmease commented Nov 30, 2020

almarklein commented Nov 30, 2020

nicolaskruchten commented Dec 1, 2020

jonmmease commented Dec 1, 2020

nicolaskruchten commented Dec 1, 2020

jonmmease commented Dec 1, 2020 • edited Loading

alexcjohnson commented Dec 1, 2020

alexcjohnson commented Dec 1, 2020

sdementen commented Dec 2, 2020

jonmmease commented Dec 2, 2020

jonmmease commented Dec 4, 2020

jonmmease commented Nov 29, 2020 •

edited

Loading

jonmmease commented Nov 30, 2020 •

edited

Loading

almarklein commented Nov 30, 2020 •

edited

Loading

jonmmease commented Dec 1, 2020 •

edited

Loading