-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize dython.nominal.associations #117
Comments
Hey @KonradHoeffner - parallelizing would be a great idea, and contributions would be more than welcomed! |
@shakedzy: Thanks for the invitations! Unfortunately I have zero experience with both parallelism and library development in Python. I still cloned the repository to take a look if there is an easy change to make and installed both the normal and dev dependencies, however the tests failed because dython was not installed. I then installed dython via pip but I am not sure if that is working as intended, as then the tests would test the version installed from pip and not the one locally, is that right? |
Hey @KonradHoeffner - please open a bug report and add all this there, so we won't make different issues here. Anyway, the best way to run and test the version you have installed is by cloning the repo, and then from within the
|
I worked on this for one of my pro projects. I have broken down the Would you be interested @shakedzy in another PR? My approach is different from @mahieyin-rahmun.
|
Hey @ThomasBury - first of all, thanks for your contribution :) I'm wondering here if @mahieyin-rahmun thumbs-up means he's into your code too :) |
Yeah, I think it would be interesting to see both versions, and if a combination of both versions (multiprocessing on top of parallelized component functions) can give an even better performance gain. |
Hey, I'm late on the ball. I'm currently refactoring the code when my other duties allow me to do so. I'll try to benchmark the same way. The other features are working also fine. I'll provide an example as illustration when I'll open the pr (beginning of next week normally). That being said, great job 👍 |
@ThomasBury: I adapted the benchmark code from @mahieyin-rahmun to the parameter renaming and to generic core count: from dython.nominal import associations
import numpy as np
import pandas as pd
import time
import random
import matplotlib.pyplot as plt
random.seed(42)
np.random.seed(42)
MAX_CORES = 16
RUNS = 5
NUM_COLS = 86
def get_random_categorical_array():
categorical_data_array = [
["A", "B", "C"],
["W", "X", "Y", "Z"],
["Yes", "No"],
["M", "N", "O", "P"]
]
return random.choice(categorical_data_array)
def benchmark_function(func, dataframe, **kwargs):
times = []
for run_count in range(RUNS):
print(f"Run: {run_count}")
start = time.time()
results = func(dataframe, compute_only=True, **kwargs)
end = time.time()
print(f"Time taken: {(end - start):.2f} seconds")
times.append(end - start)
print(
f"Average time taken by {func}: {(sum(times) / len(times)):.2f} seconds")
return times
def run_benchmark():
bar_chart_data = np.random.randn(32001, NUM_COLS)
dataframe = pd.DataFrame(
data=bar_chart_data[1:, 1:],
index=bar_chart_data[1:, 0],
columns=bar_chart_data[0, 1:]
)
# 75% of the columns are nominal
for index in range(int(NUM_COLS * 0.75)):
dataframe.iloc[:, index] = np.random.choice(
get_random_categorical_array(), len(dataframe))
times_vanialla = benchmark_function(
associations, dataframe, multiprocessing=False)
times_cores = dict()
for i in range(2,MAX_CORES+1,2):
times_cores[i] = benchmark_function(associations, dataframe, multiprocessing=True, max_cpu_cores=i)
# plot the results
fig, axes = plt.subplots(1, 2, figsize=(12, 6), dpi=150)
axes[0].plot(times_vanialla, label="Vanilla")
for i in range(2,MAX_CORES+1,2):
axes[0].plot(times_cores[i], label=str(i)+" cores")
axes[0].set_title("Time taken by associations function (5 runs)")
axes[0].set_xlabel("Run count")
axes[0].set_ylabel("Time taken (seconds)")
axes[0].set_xticks(np.arange(len(times_vanialla)),
np.arange(1, len(times_vanialla)+1))
axes[0].legend()
bar_chart_data = {
"Vanilla": np.mean(times_vanialla),
}
for i in range(2,MAX_CORES+1,2):
bar_chart_data[str(i)+" cores"] = np.mean(times_cores[i])
axes[1].bar(bar_chart_data.keys(), bar_chart_data.values(), width=0.5)
axes[1].set_title(
"Average time taken by associations function (across 5 runs)")
axes[1].set_ylabel("Time taken (seconds)")
plt.tight_layout()
plt.show()
plt.savefig('benchmark.svg', bbox_inches='tight')
if __name__ == "__main__":
run_benchmark() |
@ThomasBury & @KonradHoeffner - no worries and no pressure :) whenever you have the time, simply open a new PR. |
I am trying out Dython for the first time and as written on the homepage I am using:
However this has been running for several minutes now (data is around 300x1700) and from htop it looks like it only uses one of my 10/20 (real/hyperthreading) CPU cores. Is it possible to parallelize this to increase the speed?
The text was updated successfully, but these errors were encountered: