A Python package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation.
Ofer Mansour | Suvarna Moharir | Subing Cao | Manuel Maldonado |
---|
Data understanding and cleaning represent 60% of a data scientist's time in any given project (source). The goal with this package is to simplify this process, and allow for more efficient use of time while working on some of the main procedures done in an exploratory data analysis (EDA) (outlier identification, data visualization, correlation, missing data imputation).
pip install -i https://test.pypi.org/simple/ pyedahelper
Function Name | Input | Output | Description |
---|---|---|---|
fast_outlier_id | 3 parameters: dataframe, a list of columns to be included in analysis,method to be used to identify outliers ("Z-score algorithm" or "Interquantile Range") | dataframe with included columns and outlier values identified, and % of counts considered as outliers for each analyzed column | Given a dataframe, a list of given columns are analyzed in search for outlier values and return a dataframe summarizing the outliers values found and indicating which % of the counts are affected by this outlier(s) |
fast_plot | 4 parameters: dataframe, name of X column, name of y column, plot name | Plot object | Given a dataframe, the columns to be considered X an Y respectively, and the desired plot; the function computes and returns the specified plot |
fast_corr | 2 parameters: dataframe, list of columns to be analyzed, | correlation plot object | Calculates the correlation of all specified columns and generates a plot visualizing the correlation coefficients. |
fast_missing_impute | 3 parameters: dataframe, a string specifying the missing data treatment method,list of columns to be treated | new dataframe without missing values in the specified columns | Given a dataframe and a list of columns in that dataframe, missing values are identified and treated as specified in the missing data treatment method |
The package can analyze the values of a given column list, and identify outliers using either the ZScore algorithm or interquantile range algorithm. You can find more references regarding these algorithms here: Z-score and Interquartile.
from pyedahelper import pyedahelper
sample = {"col_a": [5000, 50, 6, 8, float("nan"), 10, 5, 2, 3]}
sample_data = pd.DataFrame(sample)
pyedahelper.fast_outlier_id(sample_data, cols="All", method="z-score", threshold_low_freq=0.05)
Output:
column_name | type | no_nans | perc_nans | outlier_method | no_outliers | perc_outliers | outlier_values | |
---|---|---|---|---|---|---|---|---|
0 | col_a | float64 | 1 | 0.11 | Z-Score | 1 | 0.12 | 5000 |
pyedahelper
can also quickly create scatter, line or bar plots from a pandas dataframe, using the Altair library. As an example, using the iris dataset:
from pyedahelper import pyedahelper
import seaborn as sns
iris = sns.load_dataset('iris')
pyedahelper.fast_plot(df=iris, x='sepal_length', y='sepal_width', plot_type='scatter')
Output:
The package can also create correlation matrix easily, by inputting a pandas dataframe and desired columns. As an example, using the iris dataset:
from pyedahelper import pyedahelper
import seaborn as sns
iris = sns.load_dataset('iris')
pyedahelper.fast_corr(df=iris, col_name=['sepal_length', 'sepal_width', 'petal_length'])
Output:
Finally, pyedahelper
can impute values to missing data, with method choices of either remove (removes all rows with missing data), mean, median, or mode imputation.
from pyedahelper import pyedahelper
sample = {"col_a": [50, 50, 6, 8, float("nan")],
"col_b": ["the", "quick", float("nan"), "quick", "fox"]
}
sample_data = pd.DataFrame(sample)
pyedahelper.fast_missing_impute(df=sample_data, method="mode", cols=["col_a", "col_b"])
Output:
col_a | col_b | |
---|---|---|
0 | 50 | the |
1 | 50 | quick |
2 | 6 | quick |
3 | 8 | quick |
4 | 50 | fox |
At this time, there are multiple packages that are used during EDA with a similar functionality in both R and Python. Nevertheless, most of these existing packages require multiple steps or provide results that could be simplified.
In the pyedahelper
package, the focus is to minimize the code a user uses to generate significant conclusions in relation to: outliers, missing data treatment, data visualization, correlation computing, and visualization.
In the following table we have summarized existing packages that are related to the procedures that are simplified in the pyedahelper
package.
EDA Procedure related | Language | Existing Packages/Functions |
---|---|---|
Outlier identification | Python | Box Plot Visualization |
Outlier identification | Python | Z-Score |
Outlier identification | Python | Interquantile Range |
Missing Value Treatment | Python | Pandas Droping NaN Values |
Missing Value Treatment | Python | Simple Imputer Values |
Missing Value Treatment | Python | Iterative Imputer |
Correlation Visualization | Python | Seaborn Heatmap |
Data Visualization | Python | Altair |
How will pyedahelper
compare to the previous existing packages/functions?
The pyedahelper
package aims to provide an user friendly experience by reducing the code needed to conduct an exploratory data analysis, specifically for identifying outliers, imputing missing data, and generating visualizations for relations and correlations
The fast_plot()
function leverages the Altair library in Python, however it improves on it by giving the user the ease to change plot type by changing an argument, and including error handling to ensure appropriate column types for certain plots. Also, the seaborn Python package has similar functions in creating the correlation matrix. However, the 'fast_corr()function for correlation analysis provides a more user-friendly (less coding) experience and makes it easier to select the columns (features) for the analysis. It will filter out of the categorical columns and only perform the analysis on the numeric columns. On the other hand, the Python packages
sklearn.imputeand
autoimputehave a similar function to imputing missing data. However, the
fast_missing_impute()function is likely more convenient for the user as it involves less coding, requiring the user to simply select the method of imputation and the columns with missing data. Finally, in relation to outlier identification, the
fast_outlier_id()` function will create an integral solution by mixing current existing methods into a single function. It will automatize the usage of Z-score and Interquantile methods to identify outliers.
- python == 3.7
- pandas == 1.0.1
- altair == 4.0.1
- statistics == 1.0.3
- seaborn == 0.10.0
- matplotlib == 3.2.0
- numpy == 1.18.1
- scipy == 1.4.1
The official documentation is hosted on Read the Docs: https://pyedahelper.readthedocs.io/en/latest/
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.