Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(trino): Add functionality to upload data #29164

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Jun 10, 2024

SUMMARY

This PR adds functionality to allow Trino to upload data via a CSV which previously resided as custom logic at Airbnb. This is based on the logic outlined in the HiveEngineSpec.

Note I'm not certain whether this is best housed in the TrinoEngineSpec or PrestoBaseEngineSpec. Airbnb currently only supports Trino and thus I wouldn't be able to provide end-to-end testing for Presto.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added tests—mimicking the Hive logic.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Copy link

codecov bot commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.73%. Comparing base (76d897e) to head (97d011d).
Report is 1094 commits behind head on master.

Files with missing lines Patch % Lines
superset/db_engine_specs/trino.py 95.83% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #29164       +/-   ##
===========================================
+ Coverage   60.48%   83.73%   +23.24%     
===========================================
  Files        1931      518     -1413     
  Lines       76236    37547    -38689     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31441    -14673     
+ Misses      28017     6106    -21911     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.93% <38.46%> (-0.23%) ⬇️
javascript ?
mysql 77.22% <96.15%> (?)
postgres 77.35% <96.15%> (?)
presto 53.55% <38.46%> (-0.25%) ⬇️
python 83.73% <96.15%> (+20.24%) ⬆️
sqlite 76.80% <96.15%> (?)
unit 59.19% <38.46%> (+1.57%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@john-bodley john-bodley force-pushed the john-bodley--presto-df-to-sql branch from 2680b22 to 97d011d Compare June 10, 2024 22:08
@john-bodley john-bodley requested a review from justinpark June 10, 2024 22:19
@john-bodley john-bodley marked this pull request as ready for review June 10, 2024 22:41
@dosubot dosubot bot added data:connect:trino Related to Trino data:csv Related to import/export of CSVs labels Jun 10, 2024
@@ -132,6 +132,7 @@ gevent = ["gevent>=23.9.1"]
gsheets = ["shillelagh[gsheetsapi]>=1.2.18, <2"]
hana = ["hdbcli==2.4.162", "sqlalchemy_hana==0.4.0"]
hive = [
"boto3",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imported within the HiveEngineSpec and thus likely should be defined as a depedency.

@@ -154,7 +155,7 @@ pinot = ["pinotdb>=0.3.3, <0.4"]
playwright = ["playwright>=1.37.0, <2"]
postgres = ["psycopg2-binary==2.9.6"]
presto = ["pyhive[presto]>=0.6.5"]
trino = ["trino>=0.328.0"]
trino = ["boto3", "trino>=0.328.0"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -79,6 +79,12 @@ def upload_to_s3(filename: str, upload_prefix: str, table: Table) -> str:
)

s3 = boto3.client("s3")

# The location is merely an S3 prefix and thus we first need to ensure that there is
Copy link
Member Author

@john-bodley john-bodley Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered this issue when testing locally, i.e., given that the location field is a directory as opposed to a file any file within said directory is slurped into the table, hence why we should first ensure that the location is empty.

@john-bodley john-bodley merged commit 53798c7 into apache:master Jun 13, 2024
65 of 66 checks passed
@john-bodley john-bodley deleted the john-bodley--presto-df-to-sql branch June 13, 2024 15:55
@michael-s-molina michael-s-molina mentioned this pull request Nov 25, 2024
9 tasks
michael-s-molina added a commit to michael-s-molina/superset that referenced this pull request Nov 25, 2024
michael-s-molina added a commit that referenced this pull request Nov 25, 2024
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 4.1.0 labels Nov 27, 2024
sadpandajoe pushed a commit that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels data:connect:trino Related to Trino data:csv Related to import/export of CSVs size/L 🚢 4.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants