-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240
fix(hive): Use parquet rather than textfile when uploading CSV files to Hive #14240
Conversation
07afbef
to
7083b3c
Compare
7083b3c
to
3bba2f6
Compare
7c87666
to
6e024e9
Compare
Codecov Report
@@ Coverage Diff @@
## master #14240 +/- ##
==========================================
- Coverage 76.97% 76.93% -0.05%
==========================================
Files 953 953
Lines 48061 48109 +48
Branches 5971 5971
==========================================
+ Hits 36997 37012 +15
- Misses 10862 10895 +33
Partials 202 202
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
63bd6cc
to
b5501c5
Compare
@@ -279,23 +280,13 @@ def test_import_csv(setup_csv_upload, create_csv_files): | |||
# make sure that john and empty string are replaced with None | |||
engine = get_upload_db().get_sqla_engine() | |||
data = engine.execute(f"SELECT * from {CSV_UPLOAD_TABLE}").fetchall() | |||
if utils.backend() == "hive": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkyryliuk this special casing no longer seems necessary when using the Parquet file format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
8178586
to
d345fd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor! Adding a couple of questions and suggestions..
|
||
df = pd.concat(chunk for chunk in chunks) | ||
|
||
database.db_engine_spec.df_to_sql( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems passing database
and table
to db_engine_spec.df_to_sql
would introduce circular dependencies. Could we maybe create a database level instance method instead?
database.fill_table(table, df)
class Database:
def fill_table(self, table_name, df, schema=None, **to_sql_kwargs):
self.db_engine_spec.fill_table(
table_name, df, schema=schema, **to_sql_kwargs
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktmud there's no circular dependencies and this was a similar signature to create_table_from_csv
, i.e., passing the database (model) and table. This pattern is used elsewhere in the database engine specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no circular imports only because Database
type was referred with a string "Database"
instead of the imported class. Conceptually the dependency still seems circular. Database
depends on DbEngineSpec
but DbEngineSpec
also has class methods that expects Database
or Table
as an input.
Not sure whether this is a big deal, but if we are only using one or two props from an input argument, it's probably better to just pass the prop values to avoid tight coupling as well as allowing the function to be used elsewhere where initializing the dependent class is not required.
superset/views/database/views.py
Outdated
@@ -156,48 +157,45 @@ def form_post(self, form: CsvToDatabaseForm) -> Response: | |||
).name | |||
|
|||
try: | |||
utils.ensure_path_exists(config["UPLOAD_FOLDER"]) | |||
utils.ensure_path_exists(app.config["UPLOAD_FOLDER"]) | |||
upload_stream_write(form.csv_file.data, uploaded_tmp_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but I think it may be possible to let pandas read directly from the stream, without writing a tempfile:
chunks = pd.read_csv(form.csv_file.data.stream)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktmud I think this is a good idea. I'll add this logic in a separate PR if that's ok. I'll apply the same logic to the Excel flow as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually @ktmud I opted to add the streaming logic in this PR given it was a fairly small additional change and recent iterations of this PR meant I needed to refactor some of Excel flow.
superset/views/database/views.py
Outdated
skiprows=form.skiprows.data, | ||
) | ||
|
||
df = pd.concat(chunk for chunk in chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also work?
df = pd.concat(chunks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktmud if we had pylint
fully enabled I believe it would catch these unnecessary statements.
d345fd3
to
681fa71
Compare
681fa71
to
6b33a08
Compare
@ktmud would you mind re-reviewing this? I've addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the iterations!
…to Hive (apache#14240) * fix(hive): Use parquet rather than textfile when uploading CSV files * [csv/excel]: Use stream rather than temporary file Co-authored-by: John Bodley <[email protected]>
SUMMARY
Fixes #13879 via:
df_to_sql
method. In the case of Hive this is via a Parquet file. This should help with other issues ('Upload parquet' option #14020, [menu]combine upload CSV/Excel in menu #13834) by using a Pandas dataframe as the interim type after importing (reading) and before exporting (writing).BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
CI.
ADDITIONAL INFORMATION