compute mode on grouped DataFrame #452

kylewhite21 · 2022-12-16T21:25:51Z

I have much more experience with R and Python and love the "dataframe" support in Elixir.

I'm trying to compute the mode in a grouped DataFrame and hit a error with calling frequencies/1 on a LazySeries. (I also can't call to_list/1 to do the mode in native Elixir.) A minimal working example is below.

I have some ideas for workarounds outside of Explorer, but don't want to re-implement the grouping. Or perhaps there is a way to do it within Explorer currently.

Any ideas?

alias(Explorer.DataFrame, [as: DF])
alias(Explorer.Series, [as: S])
require Explorer.DataFrame

defmodule Stats do
    def mode(series) do
      series
      |> S.frequencies()
      |> DF.arrange(desc: counts, asc: values)
      |> DF.head(1)
      |> DF.to_rows()
      |> Enum.at(0)
      |> Map.get("values")
    end
end

df = DF.new(gp: ["a", "a", "b", "b", "a", "b", "b"], val: [1, 2, 3, 4, 1, 4, 1])
df_group = DF.group_by(df, "gp")

# mode on series works
df["val"] |> Stats.mode()

# summarise works here
#   #Explorer.DataFrame<
#     Polars[2 x 3]
#     gp string ["a", "b"]
#     min integer [1, 1]
#     max integer [2, 4]
#   >
DF.summarise_with(df_group, fn x ->
  min = S.min(x["val"])
  max = S.max(x["val"])
  [min: min, max: max]
end)

# fails
#   ** (RuntimeError) cannot perform frequencies/1 operation on Explorer.Backend.LazySeries.
#   Query operations work on lazy series and those support only a subset of series operations
DF.summarise_with(df_group, fn x ->
  min = S.min(x["val"])
  max = S.max(x["val"])
  mode = Stats.mode(x["val"])
  [min: min, max: max, mode: mode]
end)

The text was updated successfully, but these errors were encountered:

josevalim · 2022-12-17T00:16:16Z

Hi @kylewhite21! So inside summarize_with (and friends), we have a lazy series. This is used to build a query so we can aggregate very efficiently.

Because we are building a query, we don't have access to a handful of functions but that unfortunately include frequencies. Luckily, I believe we can implement mode ourselves and expose it in Series, so you get this functionality.

You could also summarize off-band:

df
|> DF.summarize(min: min(val), max: max(val))
|> DF.put(:mode, Stats.mode(df["val"]))

cigrainger · 2022-12-17T03:08:43Z

@josevalim is there a reason we're not using this? I just noticed we don't have it for eager either. Putting together a PR now. @kylewhite21 would a mode function solve the problem?

Edit: I think the challenge is that there can be multiple modes or none. I'm investigating how pandas and dplyr handle it, especially with groups.

kylewhite21 · 2022-12-21T03:09:21Z

@josevalim makes sense. I'd be happy to compute the mode out of band, but I need it to work on the groupings (df_group from my example). I think your example would return the "grand" summary metrics?

@cigrainger a mode function would solve my problem if it works on grouped DataFrames. I saw the comments in #453 and the need for a list. 👍

Thanks for the quick replies!

cigrainger · 2023-11-12T13:52:45Z

Well it took the better part of a year, but #453 is about ready now that #725 is merged! I'll close this once #453 is merged.

cigrainger mentioned this issue Dec 17, 2022

Add mode #453

Merged

cigrainger closed this as completed in #453 Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute mode on grouped DataFrame #452

compute mode on grouped DataFrame #452

kylewhite21 commented Dec 16, 2022

josevalim commented Dec 17, 2022

cigrainger commented Dec 17, 2022 •

edited

Loading

kylewhite21 commented Dec 21, 2022

cigrainger commented Nov 12, 2023

compute mode on grouped DataFrame #452

compute mode on grouped DataFrame #452

Comments

kylewhite21 commented Dec 16, 2022

josevalim commented Dec 17, 2022

cigrainger commented Dec 17, 2022 • edited Loading

kylewhite21 commented Dec 21, 2022

cigrainger commented Nov 12, 2023

cigrainger commented Dec 17, 2022 •

edited

Loading