Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize dython.nominal.associations #117

Closed
KonradHoeffner opened this issue Feb 19, 2022 · 10 comments · Fixed by #132
Closed

Parallelize dython.nominal.associations #117

KonradHoeffner opened this issue Feb 19, 2022 · 10 comments · Fixed by #132
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@KonradHoeffner
Copy link

I am trying out Dython for the first time and as written on the homepage I am using:

from dython.nominal import associations
associations(data)

However this has been running for several minutes now (data is around 300x1700) and from htop it looks like it only uses one of my 10/20 (real/hyperthreading) CPU cores. Is it possible to parallelize this to increase the speed?

@KonradHoeffner KonradHoeffner added the enhancement New feature or request label Feb 19, 2022
@shakedzy
Copy link
Owner

Hey @KonradHoeffner - parallelizing would be a great idea, and contributions would be more than welcomed!

@KonradHoeffner
Copy link
Author

KonradHoeffner commented Feb 22, 2022

@shakedzy: Thanks for the invitations! Unfortunately I have zero experience with both parallelism and library development in Python.

I still cloned the repository to take a look if there is an easy change to make and installed both the normal and dev dependencies, however the tests failed because dython was not installed. I then installed dython via pip but I am not sure if that is working as intended, as then the tests would test the version installed from pip and not the one locally, is that right?
It would be helpful to have some more documentation on how to setup and test the project for developers in either README.md or CONTRIBUTING.md, but maybe that is only needed for people not experienced with Python library development like me who probably can't contribute much anyways.

@shakedzy
Copy link
Owner

Hey @KonradHoeffner - please open a bug report and add all this there, so we won't make different issues here.

Anyway, the best way to run and test the version you have installed is by cloning the repo, and then from within the dython version run:

pip install -e .

@shakedzy shakedzy added the help wanted Extra attention is needed label Feb 22, 2022
@KonradHoeffner
Copy link
Author

OK, I moved the bug report from here into two new issues #119 and #118.

@ThomasBury
Copy link

ThomasBury commented Sep 2, 2022

I worked on this for one of my pro projects. I have broken down the association functions into smaller functions and parallelized them.
It is currently a bit less flexible (no option yet for user-defined functions, but that could be easy to fix) but it allows us to weigh all the association/correlation measures. On my machine (old laptop, 8 cores), it ran 3x faster than the original function.
image

Would you be interested @shakedzy in another PR? My approach is different from @mahieyin-rahmun.

  • the components, Theil's U, etc, are the ones being parallelized (parallelization is not done in the association function)
  • weighted version of the measures (this can be made vanilla, without weighting the observation)
  • using pandas (pandas is fast ^^), the na strategies using lists are a huge bottleneck
  • the overall is closer to PEP standards and easier to read (flat is better than nested)
  • plotting the association matrix would be done using another function (following PEP, smaller functions are preferred)

@shakedzy
Copy link
Owner

shakedzy commented Sep 3, 2022

Hey @ThomasBury - first of all, thanks for your contribution :) I'm wondering here if @mahieyin-rahmun thumbs-up means he's into your code too :)
I think adding your parallelization of the components functions is a good idea no matter what, as these are often used directly, and not through associations.
If there a an easy fix for allowing user-defined functions, then why not giving it a shot? Perhaps we can test both your version and @mahieyin-rahmun 's version, to see which one works best. What do you think?
If you do decide to open a PR, please mind the comments I gave @mahieyin-rahmun, regarding having only one associations method, and enabling multiprocessing using a flag.

@mahieyin-rahmun
Copy link
Contributor

mahieyin-rahmun commented Sep 4, 2022

Yeah, I think it would be interesting to see both versions, and if a combination of both versions (multiprocessing on top of parallelized component functions) can give an even better performance gain.

@shakedzy shakedzy added this to the Version 0.7.2 milestone Sep 8, 2022
@ThomasBury
Copy link

Hey, I'm late on the ball. I'm currently refactoring the code when my other duties allow me to do so. I'll try to benchmark the same way. The other features are working also fine. I'll provide an example as illustration when I'll open the pr (beginning of next week normally). That being said, great job 👍

@KonradHoeffner
Copy link
Author

@ThomasBury: I adapted the benchmark code from @mahieyin-rahmun to the parameter renaming and to generic core count:

from dython.nominal import associations
import numpy as np
import pandas as pd
import time
import random
import matplotlib.pyplot as plt

random.seed(42)
np.random.seed(42)

MAX_CORES = 16
RUNS = 5
NUM_COLS = 86

def get_random_categorical_array():
    categorical_data_array = [
        ["A", "B", "C"],
        ["W", "X", "Y", "Z"],
        ["Yes", "No"],
        ["M", "N", "O", "P"]
    ]
    return random.choice(categorical_data_array)


def benchmark_function(func, dataframe, **kwargs):
    times = []

    for run_count in range(RUNS):
        print(f"Run: {run_count}")
        start = time.time()
        results = func(dataframe, compute_only=True, **kwargs)
        end = time.time()
        print(f"Time taken: {(end - start):.2f} seconds")
        times.append(end - start)

    print(
        f"Average time taken by {func}: {(sum(times) / len(times)):.2f} seconds")

    return times


def run_benchmark():
    bar_chart_data = np.random.randn(32001, NUM_COLS)

    dataframe = pd.DataFrame(
        data=bar_chart_data[1:, 1:],
        index=bar_chart_data[1:, 0],
        columns=bar_chart_data[0, 1:]
    )

    # 75% of the columns are nominal
    for index in range(int(NUM_COLS * 0.75)):
        dataframe.iloc[:, index] = np.random.choice(
            get_random_categorical_array(), len(dataframe))

    times_vanialla = benchmark_function(
        associations, dataframe, multiprocessing=False)
    times_cores = dict()
    for i in range(2,MAX_CORES+1,2):
        times_cores[i] = benchmark_function(associations, dataframe, multiprocessing=True, max_cpu_cores=i)

    # plot the results
    fig, axes = plt.subplots(1, 2, figsize=(12, 6), dpi=150)
    axes[0].plot(times_vanialla, label="Vanilla")
    for i in range(2,MAX_CORES+1,2):
        axes[0].plot(times_cores[i], label=str(i)+" cores")
    axes[0].set_title("Time taken by associations function (5 runs)")
    axes[0].set_xlabel("Run count")
    axes[0].set_ylabel("Time taken (seconds)")
    axes[0].set_xticks(np.arange(len(times_vanialla)),
                       np.arange(1, len(times_vanialla)+1))
    axes[0].legend()

    bar_chart_data = {
        "Vanilla": np.mean(times_vanialla),
    }
    for i in range(2,MAX_CORES+1,2):
        bar_chart_data[str(i)+" cores"] = np.mean(times_cores[i])
    axes[1].bar(bar_chart_data.keys(), bar_chart_data.values(), width=0.5)
    axes[1].set_title(
        "Average time taken by associations function (across 5 runs)")
    axes[1].set_ylabel("Time taken (seconds)")
    plt.tight_layout()
    plt.show()
    plt.savefig('benchmark.svg', bbox_inches='tight')

if __name__ == "__main__":
    run_benchmark()

@shakedzy
Copy link
Owner

@ThomasBury & @KonradHoeffner - no worries and no pressure :) whenever you have the time, simply open a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants