Skip to content

This is the source code for reproducing results shown in the Colonias paper

Notifications You must be signed in to change notification settings

ASUcicilab/Colonias_Water_ML

Repository files navigation

Applying Machine Learning to Understand Water Security and Water Access Inequality in Underserved Colonia Communities

This is the source code for reproducing results shown in the Colonias paper

Dataset

Original dataset is obtained from Rural Community Assistance Partnership (RCAP) on the GIS web platform named Phase II Colonia Web Map

image

  1. Colonias.gdb: Oringal dataset from RCAP
  2. colonias_Y_norm.csv: preprocessed dataset for colonias with public water services from the original dataset (Colonias.gdb)
  3. colonias_N_norm.csv: preprocessed dataset for colonias without public water services from the original dataset (Colonias.gdb)
  4. parameters_Y.csv: clustering results (Silhouette Score) under different damping factors for colonias with public water services
  5. parameters_N.csv: clustering results (Silhouette Score) under different damping factors for colonias without public water services
  6. colonias_Y_norm_labeled.csv: colonias_Y_norm.csv attached with the optimal clustering labels for colonias with public water services
  7. colonias_N_norm_labeled.csv: colonias_N_norm.csv attached with the optimal clustering labels for colonias without public water services

colonias_N_norm.csv and colonias_Y_norm.csv are inputs of Affinity Propagation algorithm. Selected attributes and corresponding descriptions are as follows:

image

Code Usage

Dependencies

  • Python 2.7+
  • scikit-learn == 0.21
  • gower 0.1.2
  • pandas
  • numpy
  • graphviz 0.20.1
  • matplotlib

You can reproduce our workflow by following steps below.

  1. ap_optimal_param.ipynb: this code is to compare clustering results (Silhouette Score) under different damping factors and iterations. You can generate parameters_Y.csv and parameters_N.csv accordingly (can be found under the folder dataset/).
  2. params_SS.ipynb: plot Silhouette Score values under different damping factors and iterations for colonis with/without public water services.
  3. ap_get_labels.ipynb: according to step 1, optimal parameters with the highest Silhouette Score will be choosen as the input. It outputs clustered labels under the best parameters, which will be saved in csv files.
  4. Attach labels with original csv files (colonias_N_norm.csv, colonias_Y_norm.csv) into new csv files (colonias_N_norm_labeled, colonias_Y_norm_labeled). (You also can update label information on the same csv files)
  5. decision_tree.ipynb: Generate decision tree for clustering results of colonias with/without public water services

Visualize clustering results on Map

Under the folder maps/, you can visualize clustering results on the map to understand the geographical distribution of water security in colonias.

  1. maps/jsons: this folder contains water security information with clustering labels and geographical locations for colonias with/without public water services.
  2. maps/shapefiles: this folder contains shapefiles of country boundaries and county outlines for 4 colonial states (Arizona, California, New Mexico, and Texas)

About

This is the source code for reproducing results shown in the Colonias paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published