Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine datasets in Concise data frame #102

Closed
JoranAngevaare opened this issue Jul 31, 2023 · 0 comments · Fixed by #129
Closed

Combine datasets in Concise data frame #102

JoranAngevaare opened this issue Jul 31, 2023 · 0 comments · Fixed by #129

Comments

@JoranAngevaare
Copy link
Owner

JoranAngevaare commented Jul 31, 2023

Concise data frame, combine extended dataframe

Currently a lot of bookkeeping is done per deataset, per mask, and so forth. A simple alternative (which is much more readable) is to concatenate the datasets for similar attributes.

Here is a MWE:

import glob
import os
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
from collections import defaultdict

class ConciseDataFrame():
    
    delimiter = ', '
    def __init__(self, df, group=None):
        self.df = df
        self.group = group or ('method', 'cluster', 'variant_label', 'cluster_i', 'variable_id', 'version', 'figure')
    
    def concise(self):
        rows = [row.to_dict() for _, row in self.df.iterrows()]
        matched_rows = self.match_rows(rows)
        combined_rows = [self.combine_rows(r, self.delimiter) for r in matched_rows]
        df_ret = pd.DataFrame(combined_rows)
        return self.rename_s(df_ret)
    
    def rename_s(self, df):
        rename_dict = {k: f'{k}(s)' for k in self.group}
        return df.rename(columns=rename_dict)
    
    @staticmethod
    def combine_rows(rows, delimiter):
        ret = {}
        for k in rows[0].keys():
            val = sorted(list(set(r[k] for r in rows)))
            if len(val) == 1:
                ret[k] = val[0]
            else:
                ret[k] = delimiter.join([str(v) for v in val])
        return ret
    
    def match_rows(self, rows):
        groups = []
        for row in rows:
            if any(row in g for g in groups):
                continue
            groups.append([row])
            for other_row in rows:
                if row == other_row:
                    continue
                for k, v in row.items():
                    if k in self.group:
                        continue
                    if other_row.get(k) != v:
                        break
                else:
                    groups[-1].append(other_row)
        return groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant