GitHub - UBC-MDS/pyedahelper at 0.1.15

3 Branches 31 Tags

Name	Name	Last commit message	Last commit date
Latest commit actions-user Update versions Mar 25, 2020 e475d8e · Mar 25, 2020 History 163 Commits
.github/workflows	.github/workflows	Update release.yml	Mar 13, 2020
docs	docs	generated and render docs for local viewing	Mar 3, 2020
img	img	modified for usage section of README	Mar 13, 2020
pyedahelper	pyedahelper	0.1.15	Mar 25, 2020
tests	tests	updating tests with PEP8	Mar 25, 2020
.gitignore	.gitignore	Added code for function fast_outliers_id	Mar 6, 2020
.readthedocs.yml	.readthedocs.yml	initial project set-up	Feb 26, 2020
CONDUCT.md	CONDUCT.md	updated conduct file for project	Feb 28, 2020
CONTRIBUTING.md	CONTRIBUTING.md	edited links to appropriate repo	Feb 29, 2020
CONTRIBUTORS.md	CONTRIBUTORS.md	added github links	Mar 13, 2020
LICENSE	LICENSE	initial project set-up	Feb 26, 2020
README.md	README.md	changes made for feedback given	Mar 25, 2020
poetry.lock	poetry.lock	Updated dependencies, added semantics	Mar 13, 2020
pyproject.toml	pyproject.toml	Update versions	Mar 25, 2020

Repository files navigation

pyedahelper

A Python package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation.

Authors

Ofer Mansour	Suvarna Moharir	Subing Cao	Manuel Maldonado

Project Overview

Data understanding and cleaning represent 60% of a data scientist's time in any given project (source). The goal with this package is to simplify this process, and allow for more efficient use of time while working on some of the main procedures done in an exploratory data analysis (EDA) (outlier identification, data visualization, correlation, missing data imputation).

Installation:

pip install -i https://test.pypi.org/simple/ pyedahelper

Functions

Function Name	Input	Output	Description
fast_outlier_id	3 parameters: dataframe, a list of columns to be included in analysis,method to be used to identify outliers ("Z-score algorithm" or "Interquantile Range")	dataframe with included columns and outlier values identified, and % of counts considered as outliers for each analyzed column	Given a dataframe, a list of given columns are analyzed in search for outlier values and return a dataframe summarizing the outliers values found and indicating which % of the counts are affected by this outlier(s)
fast_plot	4 parameters: dataframe, name of X column, name of y column, plot name	Plot object	Given a dataframe, the columns to be considered X an Y respectively, and the desired plot; the function computes and returns the specified plot
fast_corr	2 parameters: dataframe, list of columns to be analyzed,	correlation plot object	Calculates the correlation of all specified columns and generates a plot visualizing the correlation coefficients.
fast_missing_impute	3 parameters: dataframe, a string specifying the missing data treatment method,list of columns to be treated	new dataframe without missing values in the specified columns	Given a dataframe and a list of columns in that dataframe, missing values are identified and treated as specified in the missing data treatment method

Usage

The package can analyze the values of a given column list, and identify outliers using either the ZScore algorithm or interquantile range algorithm. You can find more references regarding these algorithms here: Z-score and Interquartile.

from pyedahelper import pyedahelper

sample = {"col_a": [5000, 50, 6, 8, float("nan"), 10, 5, 2, 3]}

sample_data = pd.DataFrame(sample)
pyedahelper.fast_outlier_id(sample_data, cols="All", method="z-score", threshold_low_freq=0.05)

Output:

	column_name	type	no_nans	perc_nans	outlier_method	no_outliers	perc_outliers	outlier_values
0	col_a	float64	1	0.11	Z-Score	1	0.12	5000

pyedahelper can also quickly create scatter, line or bar plots from a pandas dataframe, using the Altair library. As an example, using the iris dataset:

from pyedahelper import pyedahelper
import seaborn as sns

iris = sns.load_dataset('iris')
pyedahelper.fast_plot(df=iris, x='sepal_length', y='sepal_width', plot_type='scatter')

Output:

The package can also create correlation matrix easily, by inputting a pandas dataframe and desired columns. As an example, using the iris dataset:

from pyedahelper import pyedahelper
import seaborn as sns

iris = sns.load_dataset('iris')
pyedahelper.fast_corr(df=iris, col_name=['sepal_length', 'sepal_width', 'petal_length'])

Output:

Finally, pyedahelper can impute values to missing data, with method choices of either remove (removes all rows with missing data), mean, median, or mode imputation.

from pyedahelper import pyedahelper

sample = {"col_a": [50, 50, 6, 8, float("nan")],
          "col_b": ["the", "quick", float("nan"), "quick", "fox"]
           }
sample_data = pd.DataFrame(sample)

pyedahelper.fast_missing_impute(df=sample_data, method="mode", cols=["col_a", "col_b"])

Output:

	col_a	col_b
0	50	the
1	50	quick
2	6	quick
3	8	quick
4	50	fox

Alignment with Python / R Ecosystems

At this time, there are multiple packages that are used during EDA with a similar functionality in both R and Python. Nevertheless, most of these existing packages require multiple steps or provide results that could be simplified.

In the pyedahelper package, the focus is to minimize the code a user uses to generate significant conclusions in relation to: outliers, missing data treatment, data visualization, correlation computing, and visualization.

In the following table we have summarized existing packages that are related to the procedures that are simplified in the pyedahelper package.

EDA Procedure related	Language	Existing Packages/Functions
Outlier identification	Python	Box Plot Visualization
Outlier identification	Python	Z-Score
Outlier identification	Python	Interquantile Range
Missing Value Treatment	Python	Pandas Droping NaN Values
Missing Value Treatment	Python	Simple Imputer Values
Missing Value Treatment	Python	Iterative Imputer
Correlation Visualization	Python	Seaborn Heatmap
Data Visualization	Python	Altair

How will pyedahelper compare to the previous existing packages/functions?

The pyedahelper package aims to provide an user friendly experience by reducing the code needed to conduct an exploratory data analysis, specifically for identifying outliers, imputing missing data, and generating visualizations for relations and correlations

The fast_plot() function leverages the Altair library in Python, however it improves on it by giving the user the ease to change plot type by changing an argument, and including error handling to ensure appropriate column types for certain plots. Also, the seaborn Python package has similar functions in creating the correlation matrix. However, the 'fast_corr()function for correlation analysis provides a more user-friendly (less coding) experience and makes it easier to select the columns (features) for the analysis. It will filter out of the categorical columns and only perform the analysis on the numeric columns. On the other hand, the Python packagessklearn.imputeandautoimputehave a similar function to imputing missing data. However, thefast_missing_impute()function is likely more convenient for the user as it involves less coding, requiring the user to simply select the method of imputation and the columns with missing data. Finally, in relation to outlier identification, thefast_outlier_id()` function will create an integral solution by mixing current existing methods into a single function. It will automatize the usage of Z-score and Interquantile methods to identify outliers.

Dependencies

Documentation

The official documentation is hosted on Read the Docs: https://pyedahelper.readthedocs.io/en/latest/

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyedahelper

Authors

Project Overview

Installation:

Functions

Usage

Alignment with Python / R Ecosystems

Dependencies

Documentation

Credits

About

Releases 17

Packages

Contributors 4

Languages

License

UBC-MDS/pyedahelper

Folders and files

Latest commit

History

Repository files navigation

pyedahelper

Authors

Project Overview

Installation:

Functions

Usage

Alignment with Python / R Ecosystems

Dependencies

Documentation

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 17

Packages 0

Contributors 4

Languages

Packages