Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add casting from arbitrary arrow types #5016

Closed
Tracked by #4460
alamb opened this issue Jan 21, 2023 · 14 comments · Fixed by #5166
Closed
Tracked by #4460

Add casting from arbitrary arrow types #5016

alamb opened this issue Jan 21, 2023 · 14 comments · Fixed by #5166
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jan 21, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While porting some tests to sqllogictest I want t ensure the same coverage for specific types.

For example

https://github.com/apache/arrow-datafusion/blob/350cb47289a76e579b221fe374e4cf09db332569/datafusion/core/tests/sql/mod.rs#L1233

It is easy to cast to some SQL datatype:

-- casts `x` to `bigint` which maps to DataType::Int64
select x::bigint from foo;
-- casts `x` to timestamp (DatType::Timestamp)
select x::timestamp from foo;

However, it is not possible to use SQL to name certain arrow types (e.g TimestampMicrosecondasTIMESTAMP` maps to TimestampNanos

I would like a way to convert an expression to an arbitrary arrow type.

Describe the solution you'd like
I would love to do something like

arrow_cast(source, target_type)

Where target_type is a string that describes an arrow tyoe

For example:

--  casts x to a Int8 (which I don't think is possible in sql)
select arrow_cast(x, 'Int8') from foo;
--  casts x to a LargeBinary
select arrow_cast(x, 'LargeBinary') from foo;

I would like the values accepted as 'target_datatype' to t same as returned by the arrow_typeof() function that goes the other direction (expression to string that represents Arrow type)

❯ select arrow_typeof('5');
+------------------------+
| arrowtypeof(Utf8("5")) |
+------------------------+
| Utf8                   |
+------------------------+
1 row in set. Query took 0.028 seconds.
❯ select arrow_typeof(5);
+-----------------------+
| arrowtypeof(Int64(5)) |
+-----------------------+
| Int64                 |
+-----------------------+

Describe alternatives you've considered

It would be nice to have some way to specify Arrow types in SQL datatype syntax. Perhaps like CAST X as CUSTOM TYPE 'Int64'. This would allow using functions such astry_cast, as well as creating tables with specific column types.

However, I am not really sure how to do this in the parser

Additional context
See #4460 for more details

@alamb alamb added the enhancement New feature or request label Jan 21, 2023
@alamb alamb mentioned this issue Jan 21, 2023
28 tasks
@alamb alamb added the good first issue Good for newcomers label Jan 21, 2023
@alamb
Copy link
Contributor Author

alamb commented Jan 21, 2023

I think this is apretty good first issue as it is well specified and you can follow the model of arrow_typeof to add a new function and there is an arrow cast kernel that does all the data type conversions

@comphead
Copy link
Contributor

@alamb I'd like taking this

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2023

Thanks @comphead !

@melgenek
Copy link
Contributor

It would be also interesting to support "custom" types in CREATE TABLE statements. This would help to migrate tests like this one https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/aggregates.rs#L913-L976 to sqllogictest.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2023

It would be also interesting to support "custom" types in CREATE TABLE statements. This would help to migrate tests like this one https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/aggregates.rs#L913-L976 to sqllogictest.

I was thinking we could use CREATE TABLE ... AS SELECT ... which creates columns based on the types of the SELECT query

So I think we could make a int8 column in a table like

CREATE TABLE foo as select arrow_cast(c', 'Int8') as c_int8 from foo

@comphead
Copy link
Contributor

@alamb in parser we operate with expressions which works with arrow types.
I understand the necessity of having arrow_typeof which seems to be a synonym for PG pg_typeof function.

But for the potential arrow_cast function, we will return Expression as well with Arrow datatype inside, how it differs from usual cast? Perhaps you have the analogue in PG? I went through https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-cast/ but didn't find anything similar

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2023

But for the potential arrow_cast function, we will return Expression as well with Arrow datatype inside, how it differs from usual cast?

It differs because normal casting CAST(x as target_type) or x::target_type can only cast to SQL types (not Arrow types). DataFusion has a mapping from SQL Type --> Arrow type but there are Arrow types without any corresponding SQL type.

Perhaps you have the analogue in PG? I went through https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-cast/ but didn't find anything similar

I don't think there is any analog in postgres; The difference is that you can refer to all types in postgres directly using SQL. Since posgres types are sql types there is no analog to this ticket's proposal in SQL

@comphead
Copy link
Contributor

@alamb please correct me if I'm wrong so

select arrow_cast(x, 'Int8') from foo;

will have the same output as

select x::tinyint from foo;

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2023

@alamb please correct me if I'm wrong so

@comphead you are correct. Here is an example I don't think you can write with SQL:

select arrow_cast(x, 'TimestampMillisecond')

because

select x::timestamp

defaults to TimestampNanosecond

@alamb
Copy link
Contributor Author

alamb commented Feb 2, 2023

@comphead I wonder if you plan to work on this issue soon? It would be very helpful for writing sqllogictests for dictionary arrays (e.g. #5151). I may take a shot at writing it up this weekend if you haven't had a chance

@comphead
Copy link
Contributor

comphead commented Feb 2, 2023

@alamb sorry, haven't started it yet. made couple of drafts, but thats it. If this is a priority thing, please feel free to take this on

@alamb
Copy link
Contributor Author

alamb commented Feb 2, 2023

@alamb sorry, haven't started it yet. made couple of drafts, but thats it. If this is a priority thing, please feel free to take this on

Thanks @comphead -- I will post on this ticket if/when I start work

@alamb
Copy link
Contributor Author

alamb commented Feb 2, 2023

I am starting to hack on this

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2023

#5166 is now ready for review (I really want this function so I can more easily write / port tests and reproduce bugs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants