You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet, in Python:
#!/usr/bin/env python3importpandasaspdimporttimestart=time.time()
df=pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end=time.time()
print(f"Took {end-start}s to read with pyarrow")
start=time.time()
df=pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end=time.time()
print(f"Took {end-start}s to read with fastparquet")
<html><head><scripttype="module">constparquet=awaitimport("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");constbuffer_library=awaitimport("https://esm.sh/buffer");console.log(buffer_library)console.log(parquet)constURL="test/test-files/customer.impala.parquet";letresp=awaitfetch(URL)letbuffer=awaitresp.arrayBuffer()console.log(buffer)buffer=buffer_library.Buffer.from(buffer);constreader=awaitparquet.ParquetReader.openBuffer(buffer);//const reader = await parquet.ParquetReader.openUrl(URL);window.reader=readerconsole.log(reader)varstartTime=performance.now()letcursor=reader.getCursor();awaitcursor.next()console.log(`Time to read first row: ${(performance.now()-startTime)/1000}s`)letrecord=null;while(record=awaitcursor.next()){//console.log(record);}varendTime=performance.now()console.log(`Took ${(endTime-startTime)/1000}s to read ${URL}`)</script></head></html>
The console outputs:
Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet
Which is ~10x slower than Python
Any ideas on how to improve browser read performance?
The bulk of the time seems to spent reading the first row.
The text was updated successfully, but these errors were encountered:
While running a version of your script, we can save a bit with finally getting around to updating from buffer.slice to buffer.subarray (mostly saves on the stack). (I'll put up a PR for that)
The loading of the first row requires the loading of a lot of pages, that then it doesn't have to load in for rows 2+. So it is doing some up front work that I believe it mostly has to do. (Likely are some additional ways to optimize that)
The buffer shim could likely be replaced with just using the native js ArrayBuffer or a typed array, however that is a large refactor.
- Buffer.slice -> Buffer.subarray (and correct test that wasn't using
buffers)
- new Buffer(array) -> Buffer.from(array)
- Fix issue with `npm run serve`
Via looking into #117 As `subarray` is slightly faster in the browser
shim.
Hi,
I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo,
test/test-files/customer.impala.parquet
, in Python:outputs:
Whereas in the browser, using this test HTML/JS:
The console outputs:
Which is ~10x slower than Python
Any ideas on how to improve browser read performance?
The bulk of the time seems to spent reading the first row.
The text was updated successfully, but these errors were encountered: