-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support select .. from 'data.parquet'
files in SQL from any SessionContext
(optionally)
#4850
Comments
select .. from 'data.parquet'
files in SQLselect .. from 'data.parquet'
files in SQL from any SessionContext
(optionally)
I'm happy to take on this issue if no one else is already working on it! |
Enhancement: Instead of only relying on file extension name (as per the current implementation) we could use some inspiration from duckdb for loading/importing data such that the user can indicate the format to use (https://duckdb.org/docs/data/csv) Eg: "select ... from read_csv('filename', options)" instead of plain "select .. from 'filename'".. This would be helpful in cases where files are named "hitdata.tsv" (https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-overview.html?lang=en) |
I would love to see a function like |
IMHO, these are table functions. I wonder does datafusion support table function now? |
I agree with @unconsolable |
@alamb Are there any processes here? I'd like to try this. |
Hi @holicc -- no there is no progress on this issue that I know of. Given the conversation above, it seems like there are two options:
Have you given any thought to the approach to take? |
@alamb Maybe both options are good. Like DuckDB (see documentation: http://duckdb.org/docs/archive/0.8.1/data/csv/overview):
|
Having both makes sense to me. I would say making a table function is likely a large piece of design work as there is no pre-existing model to follow for table functions (we probably need some sort of registry and pass that expands them out to some sort of TableProvider |
I found out that we are using the |
yes, we could I would recommend thinking a little more holistically about table functions Specifically adding them to the registry: https://docs.rs/datafusion/latest/datafusion/execution/registry/trait.FunctionRegistry.html And then teaching our SQL planner how to look up those functions |
IMHO, Add a new trait: #[async_trait]
pub trait TableFunction: Sync + Send {
async fn execute(
&self,
table: impl Into<OwnedTableReference>,
state: &SessionState,
) -> Result<Arc<dyn TableProvider>>;
} add new method to pub trait FunctionRegistry {
...
/// Returns a reference to the table function named `name`.
fn table_function(&self, name: &str) -> Result<Arc<dyn TableFunction>>;
} |
I think one big difference is that a table function can take some number of parameters
The table function is The SQL parser will give the argument as an So perhaps we could make the #[async_trait]
pub trait TableFunction: Sync + Send {
async fn execute(
&self,
arg: Expr,
state: &SessionState,
) -> Result<Arc<dyn TableProvider>>;
} To take that arbitrary argument in |
And what about multiple arguments? Eg: read_csv('blah.csv', delimiter = ';', ... ) |
Good point @timvw -- we would probably want to allow that in the API. I tried it out and it does appear that the sql parser supports that kind of syntax 👍 ❯ select foo from read_parquet('foo.parquet', 'bar');
Error during planning: table 'datafusion.public.read_parquet' not found |
@alamb I'm having trouble inferring the schema because I can't get a SessionState from the ContextProvider. Can you help me? let listing_options = CsvReadOptions::default().to_listing_options(&self.config);
let url = ListingTableUrl::parse(file_path)?;
let cfg = ListingTableConfig::new(url)
.with_listing_options(listing_options)
.with_schema(Arc::new(arrow_schema::Schema::empty()));
// FIXME How to get a SessionState?
cfg.infer_schema(state);
let table = ListingTable::try_new(cfg)?;
let source = Arc::new(DefaultTableSource::new(Arc::new(table))); |
Hi @holicc -- I don't think it is possible to do this in the SQL parser level (as the the Thus I think the resolution from a table name to a ListingTable may have to happen later in the process. In the |
cc @goldmedal here is an issue that describes supporting reading style urls as tables
Probably a good first step would be to move the datafusion/datafusion-cli/src/catalog.rs Line 79 in 088ad01
SessionContext )
Then a second step would be to add an option (like the information_schema tables) that would enable installing the |
HI @alamb, I'm working on it. I think I'll create PRs for each steps. First PR: Move to core.After roughly surveyed, I found I need to move not only DynamicFileCatalogProvider but also something object_store-related.
Second PR: Add OptionI haven't survey it. Seems I can refer to how information_schema implemented. |
@alamb spark SQL syntax works like so:
what do you think?
|
@edmondop I think this would be a great to add as an example / thing to implement as an extension. |
I wanted to confirm I understood correctly what options we are picking. It seems to me the following are viable:
We are taking option 1 right now, is that right @alamb ? |
I am not quite sure what to do here to be honest |
Update here is that @goldmedal made a PR for this issue #10745 However the initial PR brings many dependencies (like aws crates) to datafusion core that is likely not great. I had some suggestions on how we could split up the code to keep the dynamic file provider in the core whil ekeeping aws etc out: #10745 (comment) |
Completed in #10986 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Similarly to #4580, I think systems built with datafusion would like to allow their users to quickly explore a parquet file with minimal typing
Today have to type a verbose
CREATE EXTERNAL TABLE
... commandIt is critical that this feature can be enabled/disabled so that DataFusion can provide read only access (rather than access to the file system as that would be a security hole)
I am marking this as a good first issue because I think all the code needed exists and there is a solution sketch below -- it should be a matter of coding that doesn't require huge existing knowledge of the datafusion codebase and would be a good exercise in getting familiar
@unconsolable added this ability into
datafusion-cli
as part of #4838 (❤️ )Describe the solution you'd like
I would like to be able to select directly from files (parquet, or other) from any datafusion session context, controlled by a setting. For example
Suggested Solution Sketch
Add a new config setting
files_as_tables
similar toinformation_schema
: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/common/src/config.rs#L167-L169Add code to make a
ListingTable
inresolve_table_ref
: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/src/execution/context.rs#L1551-L1560 (follow the model in https://github.com/apache/arrow-datafusion/pull/4838/files#diff-6353c2268d4d11abf8c1b8804a263db74a3b765a7302fc61caea3924256b52c7R142-R155)Move implementation from datafusion-cli; remove provider added in support
select .. FROM 'parquet.file'
in datafusion-cli #4838 and use new setting instead https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion-cli/src/main.rs#L100Add slt tests, similar to existing ones (should be able to refer to existing .parquet / .csv files in testing directories): https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/tests/sqllogictests/test_files/information_schema.slt#L46
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Here is how information schema works, for reference.
The text was updated successfully, but these errors were encountered: