fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

john-bodley · 2021-04-20T02:27:17Z

SUMMARY

Fixes #13879 via:

For Hive, rather than storing the CSV file as a text file (which is problematic since multiline fields are not supported), converts the CSV file to Parquet (binary format) which circumvents the issue and additionally should provide a more performant mechanism for querying.
Simplifying and streamlining the import flow. Now all engines first convert (read) the CSV/Excel file to a Pandas dataframe (via streaming—removing the need for temporarily storing a file) which which is materialized as a SQL database, thus the DB engine specs only require a df_to_sql method. In the case of Hive this is via a Parquet file. This should help with other issues ('Upload parquet' option #14020, [menu]combine upload CSV/Excel in menu #13834) by using a Pandas dataframe as the interim type after importing (reading) and before exporting (writing).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

CI.

ADDITIONAL INFORMATION

Has associated issue: [csv] allow csv cell with separate line #13879
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2021-04-20T07:22:25Z

Codecov Report

Merging #14240 (9f9b26e) into master (852e840) will decrease coverage by 0.04%.
The diff coverage is 97.43%.

@@            Coverage Diff             @@
##           master   #14240      +/-   ##
==========================================
- Coverage   76.97%   76.93%   -0.05%     
==========================================
  Files         953      953              
  Lines       48061    48109      +48     
  Branches     5971     5971              
==========================================
+ Hits        36997    37012      +15     
- Misses      10862    10895      +33     
  Partials      202      202

Flag	Coverage Δ
hive	`?`
mysql	`81.02% <97.43%> (+0.29%)`	⬆️
postgres	`81.06% <97.43%> (?)`
presto	`?`
python	`81.14% <97.43%> (-0.10%)`	⬇️
sqlite	`80.66% <94.87%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/db_engine_specs/hive.py	`70.80% <94.44%> (-20.04%)`	⬇️
superset/db_engine_specs/base.py	`87.68% <100.00%> (+0.18%)`	⬆️
superset/db_engine_specs/bigquery.py	`95.87% <100.00%> (-0.05%)`	⬇️
superset/views/database/views.py	`89.47% <100.00%> (+1.90%)`	⬆️
superset/db_engines/hive.py	`0.00% <0.00%> (-82.15%)`	⬇️
superset/db_engine_specs/__init__.py	`65.90% <0.00%> (-15.91%)`	⬇️
superset/db_engine_specs/presto.py	`83.36% <0.00%> (-6.95%)`	⬇️
superset/datasets/commands/importers/v1/utils.py	`57.14% <0.00%> (-1.77%)`	⬇️
superset/views/database/mixins.py	`81.03% <0.00%> (-1.73%)`	⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852e840...9f9b26e. Read the comment docs.

john-bodley · 2021-04-20T23:47:03Z

tests/csv_upload_tests.py

@@ -279,23 +280,13 @@ def test_import_csv(setup_csv_upload, create_csv_files):
    # make sure that john and empty string are replaced with None
    engine = get_upload_db().get_sqla_engine()
    data = engine.execute(f"SELECT * from {CSV_UPLOAD_TABLE}").fetchall()
-    if utils.backend() == "hive":


@bkyryliuk this special casing no longer seems necessary when using the Parquet file format.

ktmud

Nice refactor! Adding a couple of questions and suggestions..

ktmud · 2021-04-21T16:03:51Z

superset/views/database/views.py

+
+            df = pd.concat(chunk for chunk in chunks)
+
+            database.db_engine_spec.df_to_sql(


It seems passing database and table to db_engine_spec.df_to_sql would introduce circular dependencies. Could we maybe create a database level instance method instead?

database.fill_table(table, df) class Database: def fill_table(self, table_name, df, schema=None, **to_sql_kwargs): self.db_engine_spec.fill_table( table_name, df, schema=schema, **to_sql_kwargs )

@ktmud there's no circular dependencies and this was a similar signature to create_table_from_csv, i.e., passing the database (model) and table. This pattern is used elsewhere in the database engine specs.

There is no circular imports only because Database type was referred with a string "Database" instead of the imported class. Conceptually the dependency still seems circular. Database depends on DbEngineSpec but DbEngineSpec also has class methods that expects Database or Table as an input.

Not sure whether this is a big deal, but if we are only using one or two props from an input argument, it's probably better to just pass the prop values to avoid tight coupling as well as allowing the function to be used elsewhere where initializing the dependent class is not required.

ktmud · 2021-04-21T16:19:43Z

superset/views/database/views.py

@@ -156,48 +157,45 @@ def form_post(self, form: CsvToDatabaseForm) -> Response:
        ).name

        try:
-            utils.ensure_path_exists(config["UPLOAD_FOLDER"])
+            utils.ensure_path_exists(app.config["UPLOAD_FOLDER"])
            upload_stream_write(form.csv_file.data, uploaded_tmp_file_path)


Not related to this PR, but I think it may be possible to let pandas read directly from the stream, without writing a tempfile:

chunks = pd.read_csv(form.csv_file.data.stream)

@ktmud I think this is a good idea. I'll add this logic in a separate PR if that's ok. I'll apply the same logic to the Excel flow as well.

Actually @ktmud I opted to add the streaming logic in this PR given it was a fairly small additional change and recent iterations of this PR meant I needed to refactor some of Excel flow.

ktmud · 2021-04-21T16:20:05Z

superset/views/database/views.py

+                skiprows=form.skiprows.data,
+            )
+
+            df = pd.concat(chunk for chunk in chunks)


This should also work?

df = pd.concat(chunks)

@ktmud if we had pylint fully enabled I believe it would catch these unnecessary statements.

john-bodley · 2021-04-23T23:31:05Z

@ktmud would you mind re-reviewing this? I've addressed your comments.

ktmud

LGTM. Thanks for the iterations!

…to Hive (apache#14240) * fix(hive): Use parquet rather than textfile when uploading CSV files * [csv/excel]: Use stream rather than temporary file Co-authored-by: John Bodley <[email protected]>

pull-request-size bot added the size/L label Apr 20, 2021

john-bodley changed the title ~~fix(hive): Use parquet rather than textfile when uploading CSV files~~ fix(hive): Use parquet rather than textfile when uploading CSV files to Hive Apr 20, 2021

john-bodley force-pushed the john-bodley--fix-13879 branch from 07afbef to 7083b3c Compare April 20, 2021 02:37

john-bodley marked this pull request as ready for review April 20, 2021 02:37

john-bodley requested review from etr2460, graceguo-supercat and villebro April 20, 2021 02:38

john-bodley force-pushed the john-bodley--fix-13879 branch from 7083b3c to 3bba2f6 Compare April 20, 2021 03:27

pull-request-size bot added size/XL and removed size/L labels Apr 20, 2021

john-bodley force-pushed the john-bodley--fix-13879 branch 7 times, most recently from 7c87666 to 6e024e9 Compare April 20, 2021 07:03

john-bodley force-pushed the john-bodley--fix-13879 branch 4 times, most recently from 63bd6cc to b5501c5 Compare April 20, 2021 23:45

john-bodley requested review from bkyryliuk and betodealmeida April 20, 2021 23:45

john-bodley commented Apr 20, 2021

View reviewed changes

john-bodley force-pushed the john-bodley--fix-13879 branch 3 times, most recently from 8178586 to d345fd3 Compare April 21, 2021 09:05

ktmud reviewed Apr 21, 2021

View reviewed changes

john-bodley force-pushed the john-bodley--fix-13879 branch from d345fd3 to 681fa71 Compare April 21, 2021 18:45

fix(hive): Use parquet rather than textfile when uploading CSV files

6b33a08

john-bodley force-pushed the john-bodley--fix-13879 branch from 681fa71 to 6b33a08 Compare April 21, 2021 18:49

[csv/excel]: Use stream rather than temporary file

9f9b26e

ktmud approved these changes Apr 24, 2021

View reviewed changes

ktmud reviewed Apr 24, 2021

View reviewed changes

john-bodley merged commit b0f8f6b into apache:master Apr 24, 2021

john-bodley deleted the john-bodley--fix-13879 branch April 24, 2021 06:17

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

john-bodley commented Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 20, 2021 •

edited

Loading

john-bodley Apr 20, 2021

bkyryliuk Apr 26, 2021

ktmud left a comment

ktmud Apr 21, 2021 •

edited

Loading

john-bodley Apr 21, 2021 •

edited

Loading

ktmud Apr 21, 2021

ktmud Apr 21, 2021

john-bodley Apr 21, 2021 •

edited

Loading

john-bodley Apr 24, 2021

ktmud Apr 21, 2021

john-bodley Apr 21, 2021

john-bodley commented Apr 23, 2021

ktmud left a comment


		df = pd.concat(chunk for chunk in chunks)

		database.db_engine_spec.df_to_sql(

fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

Conversation

john-bodley commented Apr 20, 2021 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

codecov bot commented Apr 20, 2021 • edited Loading

Codecov Report

john-bodley Apr 20, 2021

Choose a reason for hiding this comment

bkyryliuk Apr 26, 2021

Choose a reason for hiding this comment

ktmud left a comment

Choose a reason for hiding this comment

ktmud Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

john-bodley Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

ktmud Apr 21, 2021

Choose a reason for hiding this comment

ktmud Apr 21, 2021

Choose a reason for hiding this comment

john-bodley Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

john-bodley Apr 24, 2021

Choose a reason for hiding this comment

ktmud Apr 21, 2021

Choose a reason for hiding this comment

john-bodley Apr 21, 2021

Choose a reason for hiding this comment

john-bodley commented Apr 23, 2021

ktmud left a comment

Choose a reason for hiding this comment

john-bodley commented Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 20, 2021 •

edited

Loading

ktmud Apr 21, 2021 •

edited

Loading

john-bodley Apr 21, 2021 •

edited

Loading

john-bodley Apr 21, 2021 •

edited

Loading