Timestamp mapping #95

matasello · 2024-07-04T08:06:24Z

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0

First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field

    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },

Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64

┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │

And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ?
Thanks

The text was updated successfully, but these errors were encountered:

loicalleyne · 2024-10-24T21:51:25Z

@domoritz
I've run into the
json2parquet uses arrow::json::reader::infer_json_schema_from_seekable

It doesn't look like arrow-rs Arrow-json collect_field_types_from_object` does any kind of timestamp inference at all.
https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/schema.rs#L88

The arrow-rs arrow-json has a low-level decoder that seems has some kind of support for coercing types to timestamp however I'm not sure how that would work and whether enabling timestamp detection to schema inference would need to be done in json2parquet or in arrow-rs.
https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/mod.rs

domoritz · 2024-10-25T00:13:53Z

Hmm, thanks for looking into this. I won't have time to look into this deeply anytime soon but I'd be more than happy to review a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp mapping #95

Timestamp mapping #95

matasello commented Jul 4, 2024

loicalleyne commented Oct 24, 2024

domoritz commented Oct 25, 2024

Timestamp mapping #95

Timestamp mapping #95

Comments

matasello commented Jul 4, 2024

loicalleyne commented Oct 24, 2024

domoritz commented Oct 25, 2024