Applying Machine Learning to Understand Water Security and Water Access Inequality in Underserved Colonia Communities
This is the source code for reproducing results shown in the Colonias paper
Original dataset is obtained from Rural Community Assistance Partnership (RCAP) on the GIS web platform named Phase II Colonia Web Map
Colonias.gdb
: Oringal dataset from RCAPcolonias_Y_norm.csv
: preprocessed dataset for colonias with public water services from the original dataset (Colonias.gdb
)colonias_N_norm.csv
: preprocessed dataset for colonias without public water services from the original dataset (Colonias.gdb
)parameters_Y.csv
: clustering results (Silhouette Score
) under different damping factors for colonias with public water servicesparameters_N.csv
: clustering results (Silhouette Score
) under different damping factors for colonias without public water servicescolonias_Y_norm_labeled.csv
:colonias_Y_norm.csv
attached with the optimal clustering labels for colonias with public water servicescolonias_N_norm_labeled.csv
:colonias_N_norm.csv
attached with the optimal clustering labels for colonias without public water services
colonias_N_norm.csv
and colonias_Y_norm.csv
are inputs of Affinity Propagation algorithm.
Selected attributes and corresponding descriptions are as follows:
- Python 2.7+
- scikit-learn == 0.21
- gower 0.1.2
- pandas
- numpy
- graphviz 0.20.1
- matplotlib
You can reproduce our workflow by following steps below.
ap_optimal_param.ipynb
: this code is to compare clustering results (Silhouette Score
) under differentdamping
factors anditeration
s. You can generateparameters_Y.csv
andparameters_N.csv
accordingly (can be found under the folderdataset/
).params_SS.ipynb
: plotSilhouette Score
values under different damping factors and iterations for colonis with/without public water services.ap_get_labels.ipynb
: according to step 1, optimal parameters with the highestSilhouette Score
will be choosen as theinput
. It outputs clustered labels under the best parameters, which will be saved incsv
files.- Attach labels with original
csv
files (colonias_N_norm.csv
,colonias_Y_norm.csv
) into newcsv
files (colonias_N_norm_labeled
,colonias_Y_norm_labeled
). (You also can update label information on the same csv files) decision_tree.ipynb
: Generate decision tree for clustering results of colonias with/without public water services
Under the folder maps/
, you can visualize clustering results on the map to understand the geographical distribution of water security in colonias.
maps/jsons
: this folder contains water security information with clustering labels and geographical locations for colonias with/without public water services.maps/shapefiles
: this folder contains shapefiles of country boundaries and county outlines for 4 colonial states (Arizona, California, New Mexico, and Texas)