Parallelize dython.nominal.associations #117

KonradHoeffner · 2022-02-19T14:18:47Z

I am trying out Dython for the first time and as written on the homepage I am using:

from dython.nominal import associations
associations(data)

However this has been running for several minutes now (data is around 300x1700) and from htop it looks like it only uses one of my 10/20 (real/hyperthreading) CPU cores. Is it possible to parallelize this to increase the speed?

The text was updated successfully, but these errors were encountered:

shakedzy · 2022-02-20T22:42:27Z

Hey @KonradHoeffner - parallelizing would be a great idea, and contributions would be more than welcomed!

KonradHoeffner · 2022-02-22T10:08:22Z

@shakedzy: Thanks for the invitations! Unfortunately I have zero experience with both parallelism and library development in Python.

I still cloned the repository to take a look if there is an easy change to make and installed both the normal and dev dependencies, however the tests failed because dython was not installed. I then installed dython via pip but I am not sure if that is working as intended, as then the tests would test the version installed from pip and not the one locally, is that right?
It would be helpful to have some more documentation on how to setup and test the project for developers in either README.md or CONTRIBUTING.md, but maybe that is only needed for people not experienced with Python library development like me who probably can't contribute much anyways.

shakedzy · 2022-02-22T11:35:26Z

Hey @KonradHoeffner - please open a bug report and add all this there, so we won't make different issues here.

Anyway, the best way to run and test the version you have installed is by cloning the repo, and then from within the dython version run:

pip install -e .

KonradHoeffner · 2022-02-22T11:46:14Z

OK, I moved the bug report from here into two new issues #119 and #118.

ThomasBury · 2022-09-02T09:27:29Z

I worked on this for one of my pro projects. I have broken down the association functions into smaller functions and parallelized them.
It is currently a bit less flexible (no option yet for user-defined functions, but that could be easy to fix) but it allows us to weigh all the association/correlation measures. On my machine (old laptop, 8 cores), it ran 3x faster than the original function.

Would you be interested @shakedzy in another PR? My approach is different from @mahieyin-rahmun.

the components, Theil's U, etc, are the ones being parallelized (parallelization is not done in the association function)
weighted version of the measures (this can be made vanilla, without weighting the observation)
using pandas (pandas is fast ^^), the na strategies using lists are a huge bottleneck
the overall is closer to PEP standards and easier to read (flat is better than nested)
plotting the association matrix would be done using another function (following PEP, smaller functions are preferred)

shakedzy · 2022-09-03T20:25:11Z

Hey @ThomasBury - first of all, thanks for your contribution :) I'm wondering here if @mahieyin-rahmun thumbs-up means he's into your code too :)
I think adding your parallelization of the components functions is a good idea no matter what, as these are often used directly, and not through associations.
If there a an easy fix for allowing user-defined functions, then why not giving it a shot? Perhaps we can test both your version and @mahieyin-rahmun 's version, to see which one works best. What do you think?
If you do decide to open a PR, please mind the comments I gave @mahieyin-rahmun, regarding having only one associations method, and enabling multiprocessing using a flag.

mahieyin-rahmun · 2022-09-04T04:40:54Z

Yeah, I think it would be interesting to see both versions, and if a combination of both versions (multiprocessing on top of parallelized component functions) can give an even better performance gain.

ThomasBury · 2022-09-14T19:03:38Z

Hey, I'm late on the ball. I'm currently refactoring the code when my other duties allow me to do so. I'll try to benchmark the same way. The other features are working also fine. I'll provide an example as illustration when I'll open the pr (beginning of next week normally). That being said, great job 👍

KonradHoeffner · 2022-09-15T06:58:13Z

@ThomasBury: I adapted the benchmark code from @mahieyin-rahmun to the parameter renaming and to generic core count:

from dython.nominal import associations
import numpy as np
import pandas as pd
import time
import random
import matplotlib.pyplot as plt

random.seed(42)
np.random.seed(42)

MAX_CORES = 16
RUNS = 5
NUM_COLS = 86

def get_random_categorical_array():
    categorical_data_array = [
        ["A", "B", "C"],
        ["W", "X", "Y", "Z"],
        ["Yes", "No"],
        ["M", "N", "O", "P"]
    ]
    return random.choice(categorical_data_array)


def benchmark_function(func, dataframe, **kwargs):
    times = []

    for run_count in range(RUNS):
        print(f"Run: {run_count}")
        start = time.time()
        results = func(dataframe, compute_only=True, **kwargs)
        end = time.time()
        print(f"Time taken: {(end - start):.2f} seconds")
        times.append(end - start)

    print(
        f"Average time taken by {func}: {(sum(times) / len(times)):.2f} seconds")

    return times


def run_benchmark():
    bar_chart_data = np.random.randn(32001, NUM_COLS)

    dataframe = pd.DataFrame(
        data=bar_chart_data[1:, 1:],
        index=bar_chart_data[1:, 0],
        columns=bar_chart_data[0, 1:]
    )

    # 75% of the columns are nominal
    for index in range(int(NUM_COLS * 0.75)):
        dataframe.iloc[:, index] = np.random.choice(
            get_random_categorical_array(), len(dataframe))

    times_vanialla = benchmark_function(
        associations, dataframe, multiprocessing=False)
    times_cores = dict()
    for i in range(2,MAX_CORES+1,2):
        times_cores[i] = benchmark_function(associations, dataframe, multiprocessing=True, max_cpu_cores=i)

    # plot the results
    fig, axes = plt.subplots(1, 2, figsize=(12, 6), dpi=150)
    axes[0].plot(times_vanialla, label="Vanilla")
    for i in range(2,MAX_CORES+1,2):
        axes[0].plot(times_cores[i], label=str(i)+" cores")
    axes[0].set_title("Time taken by associations function (5 runs)")
    axes[0].set_xlabel("Run count")
    axes[0].set_ylabel("Time taken (seconds)")
    axes[0].set_xticks(np.arange(len(times_vanialla)),
                       np.arange(1, len(times_vanialla)+1))
    axes[0].legend()

    bar_chart_data = {
        "Vanilla": np.mean(times_vanialla),
    }
    for i in range(2,MAX_CORES+1,2):
        bar_chart_data[str(i)+" cores"] = np.mean(times_cores[i])
    axes[1].bar(bar_chart_data.keys(), bar_chart_data.values(), width=0.5)
    axes[1].set_title(
        "Average time taken by associations function (across 5 runs)")
    axes[1].set_ylabel("Time taken (seconds)")
    plt.tight_layout()
    plt.show()
    plt.savefig('benchmark.svg', bbox_inches='tight')

if __name__ == "__main__":
    run_benchmark()

shakedzy · 2022-09-15T09:57:41Z

@ThomasBury & @KonradHoeffner - no worries and no pressure :) whenever you have the time, simply open a new PR.

KonradHoeffner added the enhancement New feature or request label Feb 19, 2022

shakedzy added the help wanted Extra attention is needed label Feb 22, 2022

mahieyin-rahmun mentioned this issue Aug 18, 2022

(feat) parallellize associations() function #132

Merged

shakedzy added this to the Version 0.7.2 milestone Sep 8, 2022

shakedzy closed this as completed in #132 Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize dython.nominal.associations #117

Parallelize dython.nominal.associations #117

KonradHoeffner commented Feb 19, 2022

shakedzy commented Feb 20, 2022

KonradHoeffner commented Feb 22, 2022 •

edited

Loading

shakedzy commented Feb 22, 2022

KonradHoeffner commented Feb 22, 2022

ThomasBury commented Sep 2, 2022 •

edited

Loading

shakedzy commented Sep 3, 2022

mahieyin-rahmun commented Sep 4, 2022 •

edited

Loading

ThomasBury commented Sep 14, 2022

KonradHoeffner commented Sep 15, 2022

shakedzy commented Sep 15, 2022

Parallelize dython.nominal.associations #117

Parallelize dython.nominal.associations #117

Comments

KonradHoeffner commented Feb 19, 2022

shakedzy commented Feb 20, 2022

KonradHoeffner commented Feb 22, 2022 • edited Loading

shakedzy commented Feb 22, 2022

KonradHoeffner commented Feb 22, 2022

ThomasBury commented Sep 2, 2022 • edited Loading

shakedzy commented Sep 3, 2022

mahieyin-rahmun commented Sep 4, 2022 • edited Loading

ThomasBury commented Sep 14, 2022

KonradHoeffner commented Sep 15, 2022

shakedzy commented Sep 15, 2022

KonradHoeffner commented Feb 22, 2022 •

edited

Loading

ThomasBury commented Sep 2, 2022 •

edited

Loading

mahieyin-rahmun commented Sep 4, 2022 •

edited

Loading