Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: base64 encoding for numpy arrays #2943

Closed
wants to merge 6 commits into from

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Nov 29, 2020

Overview

This is a WIP PR to support encoding n-dimensional numpy arrays using a base64 encoding convention rather than as lists during JSON serialization.

This requires a corresponding plotly.js WIP PR at plotly/plotly.js#5230, but I've committed a plotly.js.min bundle from that PR so that folks can test this branch without building plotly.js.

Activation with future flag

In order to trigger base64 encoding, you need to import the b64_encoding future flag before import plotly. e.g.

from _plotly_future_ import b64_encoding
import plotly.express as px

Encoding Format

When activated, numpy arrays will be serializaed to JSON objects with dtype, shape, and bvals keys. dtype is a numpy compatible dtype string, shape can be a scalar integer (for a 1-d array) or a list of N integers (for an N-d array). bvals is a base64 encoded string representing the underlying binary array buffer.

Note that N-d arrays are arranged in row-major ordering (Like C and Python), not column major (like Fortran and MATLAB).

Here is an example of how the array np.arange(3, dtype="int16") is encoded:

{
  "bvals": "AAABAAIA",
  "dtype": "int16",
  "shape": [3]
}

When this new branch of plotly.js encounters this, it will decode the base64 string in a contiguous ArrayBuffer and wrap that with a Int16Array typed array with value new Int16Array([0, 1, 2]).

Multi dimensional arrays are also supported. For example, np.arange(6, dtype="float32").reshape(2, 3) will be encoded into:

{
  "bvals": "AAAAAAAAgD8AAABAAABAQAAAgEAAAKBA",
  "dtype": "float32",
  "shape": [2, 3]
}

Plotly.js will then decode this into a contiguous ArrayBuffer. Then two Float32Array instances will be created as views onto this ArrayBuffer (no copy is performed here), and these will be nested inside a regular Array. The final decoded value will be equal to:

[new Float32Array([0, 1, 2]), new Float32Array(3, 4, 5)]

Limitations

int64 and uint64 arrays are not supported (they will still be converted to lists). This is because there are no native Int64Array and Uint64Array instances in JavaScript. Interesting side note here is that the reason for these omissions is that, under the hood, JavaScript represents all numbers as 64-bit floats, so it can only faithfully represent integers up to 2^53 - 1.

Call for help with QA

I've interactively played around with scatter, scattergl, scatter3d, heatmap, image, isosurface, and volume traces and I think most everything is working there. But there are surely things that aren't currently working so I'd appreciate any early QA help that folks could pitch in with.

In particular, we should check that all properties that support arrays work properly with these base64 encoded array specifications. Feel free to add comments to this PR with any issues you run into. Remember to enable the feature with the future flag as described above!

Call for help with benchmarking

I haven't done any rigorous benchmarking yet, but early notebook %timeit results look promising. For figures that involve arrays in the tens of thousands to a couple of millions of elements, I've been seeing json encoding speedups of ~2x to ~20x. This is measuring the fig.to_json() method in isolation.

It would also be nice to measure whether there is a speedup on the JavaScript side from the start of decoding process to the display of a figure, but I don't know off hand how to measure this. So any help here would be greatly appreciated.

One thing to note is that with this paradigm, there is a substantial encoding time and space savings associated with lowering the precision of arrays. For a lot of plots, dropping down to float32 or float16 won't change the appearance and will encode substantially faster than float64.

I would especially appreciate json encoding benchmarks on figures that folks have built for real world use cases. And if there are performance regressions anywhere, it would be good to be aware of that.

Where rendering should work

After activating the future flag, this branch can be tested in these contexts:

  • classic notebook
  • standalone html files (fig.write_html)
  • Kaleido image export (fig.write_image(engine="kaleido"))

It will not work (yet) in JupyterLab, vscode, or nteract because they supply there own version of plotly.js instead of using the one bundled with plotly.py

Testing with Dash

To test this branch with Dash, copy the plotly.min.js file from packages/python/plotly/plotly/package_data/ to the assets folder of your Dash application. And again, make sure to activate the future flag described above before importing plotly or dash.

Note, there may be adverse impacts on non-Graph dash components that currently accept numpy/pandas data structures. Please post a comment here if you run into anything in dash that breaks when this feature is active.

CCs

CCing folks who have expressed interest in this feature in the past:

@nicolaskruchten @emmanuelle @archmoj @almarklein @chriddyp @alexcjohnson @cboulay @Marc-Andre-Rivet

CCing folks involved with other plotly.js wrappers as support for this base64 encoding could be added there as well once this plotly.js update is merged and released. If you want to play around with this in your own wrapper, you can grab the corresponding plotly.min.js file from packages/python/plotly/plotly/package_data/ directory of this PR.

@rpkyle @sglyon @igiagkiozis @kMutagene @waralex

Thanks!

@emmanuelle
Copy link
Contributor

Thanks @jonmmease this is so cool!

I started testing various things, I noticed that the following code (after importing the future flag) results in a blank output in a Jupyter notebook in Firefox

import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin",
           marginal_x="box", trendline="ols")
fig.show()

with the following error message in the console

TypeError: tt.concat is not a function

@sglyon
Copy link

sglyon commented Nov 30, 2020

Very cool!

I'm a little curious on the decision to use base64 instead of possible alternatives. Any insights you could share?

@jonmmease
Copy link
Contributor Author

jonmmease commented Nov 30, 2020

Thanks for checking it out @sglyon

I'm a little curious on the decision to use base64 instead of possible alternatives

To be honest, base64 seemed like a natural way to do this and I didn't really investigate other approaches. What alternatives do you have in mind?

At the risk of stating the obvious, the requirements for the encoding are that it needs to encode into utf-8 and there needs to be a pretty light-weight library/algorithm on the JavaScript side to decode it into an ArrayBuffer. And it needs to be JSON string compatible.

@archmoj
Copy link
Contributor

archmoj commented Nov 30, 2020

Thanks @jonmmease this is so cool!

I started testing various things, I noticed that the following code (after importing the future flag) results in a blank output in a Jupyter notebook in Firefox

import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin",
           marginal_x="box", trendline="ols")
fig.show()

with the following error message in the console

TypeError: tt.concat is not a function

@emmanuelle It is possible that typedArrays are not handled in some places/traces in plotly.js code.
If you could replicate this in a codepen please open a plotly.js bug report.
Thank you!

@almarklein
Copy link

almarklein commented Nov 30, 2020

It should also be considered that a user may use numpy arrays where you'd not expect them, e.g. RangeSlider.value. So in places where this can happen, the JS should ideally not assume Array.

edit: or require users to use lists in those cases

@jonmmease
Copy link
Contributor Author

Thanks for the example @emmanuelle, I'll look into it.

it should also be considered that a user may use numpy arrays where you'd not expect them, e.g. RangeSlider.value. So in places where this can happen, the JS should ideally not assume Array.

This is a good point. The range slider example should be pretty easy to handle because it has a different plotly.js schema type than other data arrays. I think I'd like to try to handle this in plotly.js. Will make a note.

@nicolaskruchten
Copy link
Contributor

Re Chart Studio, it's pretty important that in _fullData we continue to have access to the 'normal' array-looking things, otherwise react-chart-editor will break badly.

@nicolaskruchten
Copy link
Contributor

Also, the chart-studio package will need to keep sending normal arrays I think.

@jonmmease
Copy link
Contributor Author

Re Chart Studio, it's pretty important that in _fullData we continue to have access to the 'normal' array-looking things, otherwise react-chart-editor will break badly.

Does a typed array count as a "normal array looking thing?". This is currently what is stored in _fullData. If that's not good enough, maybe we need a config option or something to tell plotly.js to convert the typed array specifications to regular arrays internally instead of typed arrays. Either way, will it cause an issue if .data stores a TypedArray specification object and ._fullData stores an array?

As I recall, chart_studio uploads shouldn't change because we're already traversing the figure, extracting the arrays into a Grid, updating Figure to reference data in the grid, and then uploading both.

@emmanuelle
Copy link
Contributor

@mojtaba the codepen is https://codepen.io/emmanuelle-plotly/pen/dypovyL but you would need to pass it the correct URI of the plotly.js bundle.

For the python folks I simplified the example to be

import plotly.express as px
fig = px.scatter(x=np.array([1, 2., 3, 4, 5, 6]), y=np.array([1, 2., 1, 2, 1, 3]), marginal_x='box')
fig.show()

Interestingly, no problems with arrays of ints like

import plotly.express as px
fig = px.scatter(x=np.array([1, 2, 3, 4, 5, 6]), y=np.array([1, 2, 1, 2, 1, 3]), marginal_x='box')
fig.show()

or with other types of marginals like

import plotly.express as px
fig = px.scatter(x=np.array([1, 2., 3, 4, 5, 6]), y=np.array([1, 2, 1, 2, 1, 3]), 
                 marginal_x='violin')
fig.show()

@jonmmease
Copy link
Contributor Author

Thanks @emmanuelle

Interestingly, no problems with arrays of ints like

This is probably because int64 arrays aren't encoded this way (they are still converted to lists) due to limitations in JavaScript typed arrays (see limitations section in the overview).

or with other types of marginals like...

Cool, probably just a small fix needed in the box trace, which isn't one I had tried previously.

@almarklein
Copy link

Maybe it's worth considering only applying the conversion to specific cases (opt-in), of which it is know that they are handled correctly at the client. I'm not sure how though ... and you'd probably want custom code to be able to make use of it as well.

@nicolaskruchten
Copy link
Contributor

@jonmmease does the fact that we can use utf-8 instead of being restricted to ascii mean that we can use something more efficient than base64?

@jonmmease
Copy link
Contributor Author

@jonmmease does the fact that we can use utf-8 instead of being restricted to ascii mean that we can use something more efficient than base64?

I haven't done much research here, but it looks like base85 may be an option for a 25% space savings over base64.

There's also base91 (http://base91.sourceforge.net/), but that includes the double quote character so wouldn't work easily inside a JSON string.

Not sure what else there might be, but most important thing would be identifying fast libraries for the Python encoding and JavaScript decoding.

@nicolaskruchten
Copy link
Contributor

Base85 still targets 7-bit ASCII as far as I can tell... I wonder if there's not something out there that will use all the funky 4-byte unicode characters :)

@jonmmease
Copy link
Contributor Author

jonmmease commented Dec 1, 2020

Ohh, yeah. I see. Like the fractions and latin letters from https://www.utf8-chartable.de/.

@alexcjohnson
Copy link
Collaborator

If you look at the exact bit representation of UTF-8, multi-byte characters get less and less information dense. Single-byte characters have info in 7 of 8 bits, but all longer ones use at least 2 bits per byte for structure - so I think anything beyond 7-bit ASCII will be a net loss compared to base 64.

@alexcjohnson
Copy link
Collaborator

Furthermore given how simple the b64 encoding is (bit-for-bit translation to the binary data once you've converted each character to its 6 bits) I'd worry about the decoding speed of any of the non-power-of-2 encodings (b85, b91) and we clearly can't use all 128 options in the 7-bit base ASCII. So that makes me think b64 is the best option overall.

@sdementen
Copy link

Maybe Arrow could be relevant as n64 alternative https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit

@jonmmease
Copy link
Contributor Author

Maybe Arrow could be relevant as n64 alternative https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit

I think arrow should definitely have a roll in the plotly/dash serialization story, but for this particular project we're focusing on more efficient representation of arrays within the current plain-text JSON figure specifications. I'm not aware of anything in Arrow that's focused on plain-text encoding of arrays, but if there is that would be great to know!

@jonmmease
Copy link
Contributor Author

@emmanuelle The box plot marginal issue should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants