-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: base64 encoding for numpy arrays #2943
Conversation
Thanks @jonmmease this is so cool! I started testing various things, I noticed that the following code (after importing the future flag) results in a blank output in a Jupyter notebook in Firefox
with the following error message in the console
|
Very cool! I'm a little curious on the decision to use base64 instead of possible alternatives. Any insights you could share? |
Thanks for checking it out @sglyon
To be honest, base64 seemed like a natural way to do this and I didn't really investigate other approaches. What alternatives do you have in mind? At the risk of stating the obvious, the requirements for the encoding are that it needs to encode into utf-8 and there needs to be a pretty light-weight library/algorithm on the JavaScript side to decode it into an |
@emmanuelle It is possible that |
It should also be considered that a user may use numpy arrays where you'd not expect them, e.g. edit: or require users to use lists in those cases |
Thanks for the example @emmanuelle, I'll look into it.
This is a good point. The range slider example should be pretty easy to handle because it has a different plotly.js schema type than other data arrays. I think I'd like to try to handle this in plotly.js. Will make a note. |
Re Chart Studio, it's pretty important that in |
Also, the |
Does a typed array count as a "normal array looking thing?". This is currently what is stored in As I recall, |
@mojtaba the codepen is https://codepen.io/emmanuelle-plotly/pen/dypovyL but you would need to pass it the correct URI of the plotly.js bundle. For the python folks I simplified the example to be
Interestingly, no problems with arrays of ints like
or with other types of marginals like
|
Thanks @emmanuelle
This is probably because
Cool, probably just a small fix needed in the box trace, which isn't one I had tried previously. |
Maybe it's worth considering only applying the conversion to specific cases (opt-in), of which it is know that they are handled correctly at the client. I'm not sure how though ... and you'd probably want custom code to be able to make use of it as well. |
@jonmmease does the fact that we can use utf-8 instead of being restricted to ascii mean that we can use something more efficient than base64? |
I haven't done much research here, but it looks like base85 may be an option for a 25% space savings over base64. There's also base91 (http://base91.sourceforge.net/), but that includes the double quote character so wouldn't work easily inside a JSON string. Not sure what else there might be, but most important thing would be identifying fast libraries for the Python encoding and JavaScript decoding. |
Base85 still targets 7-bit ASCII as far as I can tell... I wonder if there's not something out there that will use all the funky 4-byte unicode characters :) |
Ohh, yeah. I see. Like the fractions and latin letters from https://www.utf8-chartable.de/. |
If you look at the exact bit representation of UTF-8, multi-byte characters get less and less information dense. Single-byte characters have info in 7 of 8 bits, but all longer ones use at least 2 bits per byte for structure - so I think anything beyond 7-bit ASCII will be a net loss compared to base 64. |
Furthermore given how simple the b64 encoding is (bit-for-bit translation to the binary data once you've converted each character to its 6 bits) I'd worry about the decoding speed of any of the non-power-of-2 encodings (b85, b91) and we clearly can't use all 128 options in the 7-bit base ASCII. So that makes me think b64 is the best option overall. |
Maybe Arrow could be relevant as n64 alternative https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit |
I think arrow should definitely have a roll in the plotly/dash serialization story, but for this particular project we're focusing on more efficient representation of arrays within the current plain-text JSON figure specifications. I'm not aware of anything in Arrow that's focused on plain-text encoding of arrays, but if there is that would be great to know! |
@emmanuelle The box plot marginal issue should be fixed now. |
Overview
This is a WIP PR to support encoding n-dimensional numpy arrays using a base64 encoding convention rather than as lists during JSON serialization.
This requires a corresponding plotly.js WIP PR at plotly/plotly.js#5230, but I've committed a
plotly.js.min
bundle from that PR so that folks can test this branch without building plotly.js.Activation with future flag
In order to trigger base64 encoding, you need to import the
b64_encoding
future flag before import plotly. e.g.Encoding Format
When activated, numpy arrays will be serializaed to JSON objects with
dtype
,shape
, andbvals
keys.dtype
is a numpy compatibledtype
string, shape can be a scalar integer (for a 1-d array) or a list of N integers (for an N-d array).bvals
is a base64 encoded string representing the underlying binary array buffer.Note that N-d arrays are arranged in row-major ordering (Like C and Python), not column major (like Fortran and MATLAB).
Here is an example of how the array
np.arange(3, dtype="int16")
is encoded:When this new branch of plotly.js encounters this, it will decode the base64 string in a contiguous
ArrayBuffer
and wrap that with aInt16Array
typed array with valuenew Int16Array([0, 1, 2])
.Multi dimensional arrays are also supported. For example,
np.arange(6, dtype="float32").reshape(2, 3)
will be encoded into:Plotly.js will then decode this into a contiguous
ArrayBuffer
. Then twoFloat32Array
instances will be created as views onto thisArrayBuffer
(no copy is performed here), and these will be nested inside a regularArray
. The final decoded value will be equal to:[new Float32Array([0, 1, 2]), new Float32Array(3, 4, 5)]
Limitations
int64
anduint64
arrays are not supported (they will still be converted to lists). This is because there are no nativeInt64Array
andUint64Array
instances in JavaScript. Interesting side note here is that the reason for these omissions is that, under the hood, JavaScript represents all numbers as 64-bit floats, so it can only faithfully represent integers up to2^53 - 1
.Call for help with QA
I've interactively played around with
scatter
,scattergl
,scatter3d
,heatmap
,image
,isosurface
, andvolume
traces and I think most everything is working there. But there are surely things that aren't currently working so I'd appreciate any early QA help that folks could pitch in with.In particular, we should check that all properties that support arrays work properly with these base64 encoded array specifications. Feel free to add comments to this PR with any issues you run into. Remember to enable the feature with the future flag as described above!
Call for help with benchmarking
I haven't done any rigorous benchmarking yet, but early notebook
%timeit
results look promising. For figures that involve arrays in the tens of thousands to a couple of millions of elements, I've been seeing json encoding speedups of ~2x to ~20x. This is measuring thefig.to_json()
method in isolation.It would also be nice to measure whether there is a speedup on the JavaScript side from the start of decoding process to the display of a figure, but I don't know off hand how to measure this. So any help here would be greatly appreciated.
One thing to note is that with this paradigm, there is a substantial encoding time and space savings associated with lowering the precision of arrays. For a lot of plots, dropping down to
float32
orfloat16
won't change the appearance and will encode substantially faster thanfloat64
.I would especially appreciate json encoding benchmarks on figures that folks have built for real world use cases. And if there are performance regressions anywhere, it would be good to be aware of that.
Where rendering should work
After activating the future flag, this branch can be tested in these contexts:
fig.write_html
)fig.write_image(engine="kaleido")
)It will not work (yet) in JupyterLab, vscode, or nteract because they supply there own version of plotly.js instead of using the one bundled with plotly.py
Testing with Dash
To test this branch with Dash, copy the
plotly.min.js
file frompackages/python/plotly/plotly/package_data/
to theassets
folder of your Dash application. And again, make sure to activate the future flag described above before importingplotly
ordash
.Note, there may be adverse impacts on non-
Graph
dash components that currently accept numpy/pandas data structures. Please post a comment here if you run into anything in dash that breaks when this feature is active.CCs
CCing folks who have expressed interest in this feature in the past:
@nicolaskruchten @emmanuelle @archmoj @almarklein @chriddyp @alexcjohnson @cboulay @Marc-Andre-Rivet
CCing folks involved with other plotly.js wrappers as support for this base64 encoding could be added there as well once this plotly.js update is merged and released. If you want to play around with this in your own wrapper, you can grab the corresponding
plotly.min.js
file frompackages/python/plotly/plotly/package_data/
directory of this PR.@rpkyle @sglyon @igiagkiozis @kMutagene @waralex
Thanks!