Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp mapping #95

Open
matasello opened this issue Jul 4, 2024 · 2 comments
Open

Timestamp mapping #95

matasello opened this issue Jul 4, 2024 · 2 comments

Comments

@matasello
Copy link

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0
  1. First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field
    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
  1. Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

  1. Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64
┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │
  1. And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ?
Thanks

@loicalleyne
Copy link

@domoritz
I've run into the
json2parquet uses arrow::json::reader::infer_json_schema_from_seekable

It doesn't look like arrow-rs Arrow-json collect_field_types_from_object` does any kind of timestamp inference at all.
https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/schema.rs#L88

The arrow-rs arrow-json has a low-level decoder that seems has some kind of support for coercing types to timestamp however I'm not sure how that would work and whether enabling timestamp detection to schema inference would need to be done in json2parquet or in arrow-rs.
https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/mod.rs

@domoritz
Copy link
Owner

Hmm, thanks for looking into this. I won't have time to look into this deeply anytime soon but I'd be more than happy to review a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants