Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute mode on grouped DataFrame #452

Closed
kylewhite21 opened this issue Dec 16, 2022 · 4 comments · Fixed by #453
Closed

compute mode on grouped DataFrame #452

kylewhite21 opened this issue Dec 16, 2022 · 4 comments · Fixed by #453

Comments

@kylewhite21
Copy link

I have much more experience with R and Python and love the "dataframe" support in Elixir.

I'm trying to compute the mode in a grouped DataFrame and hit a error with calling frequencies/1 on a LazySeries. (I also can't call to_list/1 to do the mode in native Elixir.) A minimal working example is below.

I have some ideas for workarounds outside of Explorer, but don't want to re-implement the grouping. Or perhaps there is a way to do it within Explorer currently.

Any ideas?

alias(Explorer.DataFrame, [as: DF])
alias(Explorer.Series, [as: S])
require Explorer.DataFrame

defmodule Stats do
    def mode(series) do
      series
      |> S.frequencies()
      |> DF.arrange(desc: counts, asc: values)
      |> DF.head(1)
      |> DF.to_rows()
      |> Enum.at(0)
      |> Map.get("values")
    end
end

df = DF.new(gp: ["a", "a", "b", "b", "a", "b", "b"], val: [1, 2, 3, 4, 1, 4, 1])
df_group = DF.group_by(df, "gp")

# mode on series works
df["val"] |> Stats.mode()

# summarise works here
#   #Explorer.DataFrame<
#     Polars[2 x 3]
#     gp string ["a", "b"]
#     min integer [1, 1]
#     max integer [2, 4]
#   >
DF.summarise_with(df_group, fn x ->
  min = S.min(x["val"])
  max = S.max(x["val"])
  [min: min, max: max]
end)

# fails
#   ** (RuntimeError) cannot perform frequencies/1 operation on Explorer.Backend.LazySeries.
#   Query operations work on lazy series and those support only a subset of series operations
DF.summarise_with(df_group, fn x ->
  min = S.min(x["val"])
  max = S.max(x["val"])
  mode = Stats.mode(x["val"])
  [min: min, max: max, mode: mode]
end)
@josevalim
Copy link
Member

Hi @kylewhite21! So inside summarize_with (and friends), we have a lazy series. This is used to build a query so we can aggregate very efficiently.

Because we are building a query, we don't have access to a handful of functions but that unfortunately include frequencies. Luckily, I believe we can implement mode ourselves and expose it in Series, so you get this functionality.

You could also summarize off-band:

df
|> DF.summarize(min: min(val), max: max(val))
|> DF.put(:mode, Stats.mode(df["val"]))

@cigrainger
Copy link
Member

cigrainger commented Dec 17, 2022

@josevalim is there a reason we're not using this? I just noticed we don't have it for eager either. Putting together a PR now. @kylewhite21 would a mode function solve the problem?

Edit: I think the challenge is that there can be multiple modes or none. I'm investigating how pandas and dplyr handle it, especially with groups.

@cigrainger cigrainger mentioned this issue Dec 17, 2022
@kylewhite21
Copy link
Author

@josevalim makes sense. I'd be happy to compute the mode out of band, but I need it to work on the groupings (df_group from my example). I think your example would return the "grand" summary metrics?

@cigrainger a mode function would solve my problem if it works on grouped DataFrames. I saw the comments in #453 and the need for a list. 👍

Thanks for the quick replies!

@cigrainger
Copy link
Member

Well it took the better part of a year, but #453 is about ready now that #725 is merged! I'll close this once #453 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants