Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align describe with py polars #819

Merged
merged 2 commits into from
Jan 13, 2024

Conversation

lkarthee
Copy link
Member

@lkarthee lkarthee commented Jan 13, 2024

use nil expression to generate output similar to Py Polars describe

df =
  DF.new(
    number: [1, 2, nil, 3],
    list: [[], [], [], []],
    null: [nil, nil, nil, nil],
    string: ["a", "b", "c", "nil"],
    date: [~D[2021-01-01], ~D[1999-12-31], nil, ~D[2023-01-01]],
    time: [~T[00:02:03.000212], ~T[00:05:04.000456], ~T[00:07:04.000776], nil],
    datetime: [nil, ~N[2021-01-01 00:00:00], ~N[1999-12-31 00:00:00], ~N[2023-12-13 17:38:00]],
    duration: [nil, ~N[2020-01-01 00:00:00], ~N[1999-11-30 00:00:00], ~N[2023-12-12 17:38:00]]
  )

df = DF.mutate(df, duration: datetime - duration)
DF.print(df)
+------------------------------------------------------------------------------------------------------------------------+
|                                       Explorer DataFrame: [rows: 4, columns: 8]                                        |
+--------+--------------+--------+----------+------------+-----------------+----------------------------+----------------+
| number |     list     |  null  |  string  |    date    |      time       |          datetime          |    duration    |
| <s64>  | <list[null]> | <null> | <string> |   <date>   |     <time>      |       <datetime[μs]>       | <duration[μs]> |
+========+==============+========+==========+============+=================+============================+================+
| 1      |              |        | a        | 2021-01-01 | 00:02:03.000212 |                            |                |
+--------+--------------+--------+----------+------------+-----------------+----------------------------+----------------+
| 2      |              |        | b        | 1999-12-31 | 00:05:04.000456 | 2021-01-01 00:00:00.000000 | 366d           |
+--------+--------------+--------+----------+------------+-----------------+----------------------------+----------------+
|        |              |        | c        |            | 00:07:04.000776 | 1999-12-31 00:00:00.000000 | 31d            |
+--------+--------------+--------+----------+------------+-----------------+----------------------------+----------------+
| 3      |              |        | nil      | 2023-01-01 |                 | 2023-12-13 17:38:00.000000 | 1d             |
+--------+--------------+--------+----------+------------+-----------------+----------------------------+----------------+


df = df |> DF.describe() |> DF.print(limit: 9)

+-------------------------------------------------------------------------------------------------------------------+
|                                     Explorer DataFrame: [rows: 9, columns: 9]                                     |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| describe  | number |   list   |   null   |  string  |   date   |   time   |          datetime          | duration |
| <string>  | <f64>  | <string> | <string> | <string> | <string> | <string> |          <string>          | <string> |
+===========+========+==========+==========+==========+==========+==========+============================+==========+
| count     | 3.0    | 4        | 0        | 4        | 3        | 3        | 3                          | 3        |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| nil_count | 1.0    | 0        | 4        | 0        | 1        | 1        | 1                          | 1        |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| mean      | 2.0    |          |          |          |          |          |                            |          |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| std       | 1.0    |          |          |          |          |          |                            |          |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| min       | 1.0    |          |          |          |          |          | 1999-12-31 00:00:00.000000 | 1d       |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| 25%       | 2.0    |          |          |          |          |          |                            |          |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| 50%       | 2.0    |          |          |          |          |          |                            |          |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| 75%       | 3.0    |          |          |          |          |          |                            |          |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+
| max       | 3.0    |          |          |          |          |          | 2023-12-13 17:38:00.000000 | 366d     |
+-----------+--------+----------+----------+----------+----------+----------+----------------------------+----------+

@josevalim
Copy link
Member

One last question, should we try to preserve the type for all columns? Like we keep it as a time column for time columns? Or is that not possible?

@lkarthee
Copy link
Member Author

Am I afraid that is not possible in vertical format as count , nil_count are {:s, 64} - where as min and max will be type of column. Can we try horizontal instead of vertical - i.e. transpose of current table? We add dtype as a column and only max and min will be string.

@josevalim
Copy link
Member

Can we try horizontal instead of vertical - i.e. transpose of current table? We add dtype as a column and only max and min will be string.

That's a very good question. But if we transpose it will be pretty much a summarize? So how useful would that be?

I wonder if we should have a function called transpose_by(column) that performs a transposition, using the given string column as the name of the columns. If we do such, then describe can be the summarize bits (and it is horizontal, so we preserve data type information) and, if you want Pandas style results, you can simply transpose it? Thoughts @billylanchantin @cigrainger?

@josevalim
Copy link
Member

Anyway, this PR is good to me, so we can move the describe+transpose to a separate PR. :)

@billylanchantin
Copy link
Member

  1. Yes let's merge this PR as is.
  2. I'm on board the idea of transposing the result of describe. Though as noted: even with a transpose, most of the columns won't have a consistent dtype. (mean, std, min, max, percentiles are all dtype-specific).

That's a very good question. But if we transpose it will be pretty much a summarize? So how useful would that be?

TBH I view describe as summarise but with pre-selected aggregates. So this is totally fine with me.

I wonder if we should have a function called transpose_by(column) that performs a transposition, using the given string column as the name of the columns.

We might have that already with the pivot_wider/pivot_longer functions. Though I may have misunderstood what that function will do.

@josevalim josevalim merged commit f06700c into elixir-explorer:main Jan 13, 2024
3 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@josevalim
Copy link
Member

Alright, let's try transposing it then. And i think you are right, transpose_by is one of the pivot functions.

@lkarthee lkarthee deleted the describe_func_restore branch January 14, 2024 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants