Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240

Merged
merged 2 commits into from
Apr 24, 2021

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Apr 20, 2021

SUMMARY

Fixes #13879 via:

  1. For Hive, rather than storing the CSV file as a text file (which is problematic since multiline fields are not supported), converts the CSV file to Parquet (binary format) which circumvents the issue and additionally should provide a more performant mechanism for querying.
  2. Simplifying and streamlining the import flow. Now all engines first convert (read) the CSV/Excel file to a Pandas dataframe (via streaming—removing the need for temporarily storing a file) which which is materialized as a SQL database, thus the DB engine specs only require a df_to_sql method. In the case of Hive this is via a Parquet file. This should help with other issues ('Upload parquet' option #14020, [menu]combine upload CSV/Excel in menu #13834) by using a Pandas dataframe as the interim type after importing (reading) and before exporting (writing).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

CI.

ADDITIONAL INFORMATION

  • Has associated issue: [csv] allow csv cell with separate line #13879
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley changed the title fix(hive): Use parquet rather than textfile when uploading CSV files fix(hive): Use parquet rather than textfile when uploading CSV files to Hive Apr 20, 2021
@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch from 07afbef to 7083b3c Compare April 20, 2021 02:37
@john-bodley john-bodley marked this pull request as ready for review April 20, 2021 02:37
@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch from 7083b3c to 3bba2f6 Compare April 20, 2021 03:27
@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch 7 times, most recently from 7c87666 to 6e024e9 Compare April 20, 2021 07:03
@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #14240 (9f9b26e) into master (852e840) will decrease coverage by 0.04%.
The diff coverage is 97.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14240      +/-   ##
==========================================
- Coverage   76.97%   76.93%   -0.05%     
==========================================
  Files         953      953              
  Lines       48061    48109      +48     
  Branches     5971     5971              
==========================================
+ Hits        36997    37012      +15     
- Misses      10862    10895      +33     
  Partials      202      202              
Flag Coverage Δ
hive ?
mysql 81.02% <97.43%> (+0.29%) ⬆️
postgres 81.06% <97.43%> (?)
presto ?
python 81.14% <97.43%> (-0.10%) ⬇️
sqlite 80.66% <94.87%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/hive.py 70.80% <94.44%> (-20.04%) ⬇️
superset/db_engine_specs/base.py 87.68% <100.00%> (+0.18%) ⬆️
superset/db_engine_specs/bigquery.py 95.87% <100.00%> (-0.05%) ⬇️
superset/views/database/views.py 89.47% <100.00%> (+1.90%) ⬆️
superset/db_engines/hive.py 0.00% <0.00%> (-82.15%) ⬇️
superset/db_engine_specs/__init__.py 65.90% <0.00%> (-15.91%) ⬇️
superset/db_engine_specs/presto.py 83.36% <0.00%> (-6.95%) ⬇️
superset/datasets/commands/importers/v1/utils.py 57.14% <0.00%> (-1.77%) ⬇️
superset/views/database/mixins.py 81.03% <0.00%> (-1.73%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852e840...9f9b26e. Read the comment docs.

@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch 4 times, most recently from 63bd6cc to b5501c5 Compare April 20, 2021 23:45
@@ -279,23 +280,13 @@ def test_import_csv(setup_csv_upload, create_csv_files):
# make sure that john and empty string are replaced with None
engine = get_upload_db().get_sqla_engine()
data = engine.execute(f"SELECT * from {CSV_UPLOAD_TABLE}").fetchall()
if utils.backend() == "hive":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkyryliuk this special casing no longer seems necessary when using the Parquet file format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch 3 times, most recently from 8178586 to d345fd3 Compare April 21, 2021 09:05
Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor! Adding a couple of questions and suggestions..


df = pd.concat(chunk for chunk in chunks)

database.db_engine_spec.df_to_sql(
Copy link
Member

@ktmud ktmud Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems passing database and table to db_engine_spec.df_to_sql would introduce circular dependencies. Could we maybe create a database level instance method instead?

database.fill_table(table, df)

class Database:
    def fill_table(self, table_name, df, schema=None, **to_sql_kwargs):
       self.db_engine_spec.fill_table(
            table_name, df, schema=schema, **to_sql_kwargs
        )

Copy link
Member Author

@john-bodley john-bodley Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud there's no circular dependencies and this was a similar signature to create_table_from_csv, i.e., passing the database (model) and table. This pattern is used elsewhere in the database engine specs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no circular imports only because Database type was referred with a string "Database" instead of the imported class. Conceptually the dependency still seems circular. Database depends on DbEngineSpec but DbEngineSpec also has class methods that expects Database or Table as an input.

Not sure whether this is a big deal, but if we are only using one or two props from an input argument, it's probably better to just pass the prop values to avoid tight coupling as well as allowing the function to be used elsewhere where initializing the dependent class is not required.

@@ -156,48 +157,45 @@ def form_post(self, form: CsvToDatabaseForm) -> Response:
).name

try:
utils.ensure_path_exists(config["UPLOAD_FOLDER"])
utils.ensure_path_exists(app.config["UPLOAD_FOLDER"])
upload_stream_write(form.csv_file.data, uploaded_tmp_file_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I think it may be possible to let pandas read directly from the stream, without writing a tempfile:

chunks = pd.read_csv(form.csv_file.data.stream)

Copy link
Member Author

@john-bodley john-bodley Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud I think this is a good idea. I'll add this logic in a separate PR if that's ok. I'll apply the same logic to the Excel flow as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @ktmud I opted to add the streaming logic in this PR given it was a fairly small additional change and recent iterations of this PR meant I needed to refactor some of Excel flow.

skiprows=form.skiprows.data,
)

df = pd.concat(chunk for chunk in chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work?

df = pd.concat(chunks)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud if we had pylint fully enabled I believe it would catch these unnecessary statements.

@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch from d345fd3 to 681fa71 Compare April 21, 2021 18:45
@john-bodley john-bodley force-pushed the john-bodley--fix-13879 branch from 681fa71 to 6b33a08 Compare April 21, 2021 18:49
@john-bodley
Copy link
Member Author

@ktmud would you mind re-reviewing this? I've addressed your comments.

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the iterations!

@john-bodley john-bodley merged commit b0f8f6b into apache:master Apr 24, 2021
@john-bodley john-bodley deleted the john-bodley--fix-13879 branch April 24, 2021 06:17
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021
…to Hive (apache#14240)

* fix(hive): Use parquet rather than textfile when uploading CSV files

* [csv/excel]: Use stream rather than temporary file

Co-authored-by: John Bodley <[email protected]>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XL 🚢 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[csv] allow csv cell with separate line
4 participants