Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support querying files directly in posit connect with duckdb #197

Open
machow opened this issue Apr 12, 2023 · 1 comment
Open

Support querying files directly in posit connect with duckdb #197

machow opened this issue Apr 12, 2023 · 1 comment

Comments

@machow
Copy link
Collaborator

machow commented Apr 12, 2023

Since pins-python uses fsspec under the hood, users are able to query pins data directly using duckdb's fsspec integration.

While #193 allows duckdb to query CSV pins on posit connect, parquet files cannot be queried. This is likely because duckdb needs to scan parquet headers.

Below, I provide examples, but first--here is a snippet to enable logging to stdout:

import logging
import sys

root = logging.getLogger("pins")
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
root.addHandler(handler)

Querying parquet pins on s3 (for reference)

First, here is how you connect to a temporary s3 board, and return info on a file:

# create a temporary board, with the contents of the pins-compat test board ----
# note that my s3 credentials are in a .env file, see .env.dev
from dotenv import load_dotenv
load_dotenv()

bb = BoardBuilder("s3")
board = bb.create_tmp_board("pins/tests/pins-compat")

# display info for a csv file ----
board.fs.info(f"s3://{board.board}/df_csv/20220214T163718Z-eceac/df_csv.csv")
{'ETag': '"e6e2bc89538baa1ee31e3294efcb1d82"',
 'LastModified': datetime.datetime(2023, 4, 12, 15, 22, 41, tzinfo=tzutc()),
 'size': 20,
 'name': 'ci-pins/222afb60-8e19-4cc0-b1a5-80098d0f410d/df_csv/20220214T163718Z-eceac/df_csv.csv',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'text/csv'}

Next, we'll add a parquet pin

from pins.data import mtcars

board.pin_write(mtcars, "df_parquet", type="parquet")
board.pin_versions("df_parquet")
              created   hash                 version
0 2023-04-12 11:30:57  69d97  20230412T113057Z-69d97

Finally, we'll query directly in duckdb

import duckdb

duckdb.register_filesystem(board.fs)

# query via duckdb! ----
data_path = f"s3://{board.board}/df_parquet/20230412T113057Z-69d97/df_parquet.parquet"
duckdb.execute(f"SELECT mpg FROM read_parquet('{data_path}')").df()

Querying parquet in pins

import pins
import duckdb

# note that my connect credentials are in a .env file
from dotenv import load_dotenv
load_dotenv()

# connect to board, register fs to duckdb ----
board = pins.board_connect("https://colorado.posit.co/rsc")
duckdb.register_filesystem(board.fs)

# look up bundle id ----
board.pin_meta("michael.chow/mtcars3")

# query with duckdb ----
duckdb.execute(
     "SELECT * FROM read_parquet('rsc://michael.chow/mtcars3/72103/mtcars3.parquet')"
).df()
InvalidInputException: Invalid Input Error: No magic bytes found at end of file 'rsc://michael.chow/mtcars3/72103/mtcars3.parquet'

I think the issue has to do with how we're returning info on the file.

@machow
Copy link
Collaborator Author

machow commented Apr 12, 2023

So the good news is I'm able to query parquet files now. The bad news is that it seems to be making 1 fs.exists call, and 7 fs.info calls? Because the Posit API requires us to make many API calls to convert the human friendly paths (e.g. michael.chow/mtcars3/72103/mtcars3.parquet to guids), each fs.info call requires

  • /__api__/v1/users - times the number of pages needed to walk this endpoint
  • /__api__/v1/content - to look up the content guid
  • /__api__/v1/content/{guid}/bundles/{id}: to ensure the bundle exists
  • /content/{guid}/_rev{id}/{fname}: to get file name info

Note that because we have a cache wrapped around the filesystem, 2 of the calls also open the file (likely the first to read the parquet header, and the second to fetch relevant data?).

Here's the full log:

# fs.exists ----

2023-04-12 12:30:34,755 - pins.rsconnect.fs - DEBUG - exists
2023-04-12 12:30:34,756 - pins.rsconnect.fs - DEBUG - info: michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:34,756 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:34,910 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:35,063 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:35,241 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:35,424 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}

# fs.info ----

2023-04-12 12:30:35,597 - pins.rsconnect.fs - DEBUG - info: michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:35,598 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:35,768 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:35,927 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:36,119 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:36,321 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}
2023-04-12 12:30:36,500 - pins.cache - INFO - cache file: /Users/machow/Library/Caches/pins-py/rsc_0c1c9f784f62118a2f2361e8cff9105dfe5066d03ce45997d504c6950e8f5b6b/michael.chow+mtcars3/72103/mtcars3.parquet

# fs.info ----

2023-04-12 12:30:36,502 - pins.rsconnect.fs - DEBUG - info: rsc://michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:36,503 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:36,656 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:36,832 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:37,037 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:37,226 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}


# fs.info ----

2023-04-12 12:30:37,385 - pins.rsconnect.fs - DEBUG - info: rsc://michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:37,386 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:37,537 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:37,704 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:37,881 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:38,088 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}


# fs.info ----

2023-04-12 12:30:38,243 - pins.rsconnect.fs - DEBUG - info: rsc://michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:38,244 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:38,397 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:38,548 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:38,739 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:38,952 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}


# fs.info ----

2023-04-12 12:30:39,124 - pins.rsconnect.fs - DEBUG - info: michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:39,125 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:39,286 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:39,436 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:39,622 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:39,821 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}
2023-04-12 12:30:39,982 - pins.cache - INFO - cache file: /Users/machow/Library/Caches/pins-py/rsc_0c1c9f784f62118a2f2361e8cff9105dfe5066d03ce45997d504c6950e8f5b6b/michael.chow+mtcars3/72103/mtcars3.parquet


# fs.info ----

2023-04-12 12:30:40,048 - pins.rsconnect.fs - DEBUG - info: rsc://michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:40,048 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:40,193 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:40,344 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:40,532 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:40,736 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}


# fs.info ----

2023-04-12 12:30:40,909 - pins.rsconnect.fs - DEBUG - info: rsc://michael.chow/mtcars3/72103/mtcars3.parquet
2023-04-12 12:30:40,910 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True}}
2023-04-12 12:30:41,079 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/users -- {'params': {'prefix': 'michael.chow', 'walk_pages': True, 'page_number': 2}}
2023-04-12 12:30:41,232 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content -- {'params': {'owner_guid': 'c31bd134-4d4a-4275-92b1-3e7f8046c03a', 'name': 'mtcars3'}}
2023-04-12 12:30:41,423 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/__api__/v1/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/bundles/72103 -- {}
2023-04-12 12:30:41,615 - pins.rsconnect.api - DEBUG - RSConnect API GET: https://colorado.posit.co/rsc/content/6e1b8ea7-aafb-462a-b644-d4a62951ec85/_rev72103/mtcars3.parquet -- {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant