Machine Learning class work:
-
First we convert de xml file with the traffic incidences in Bizkaia to a csv file.
Script: Sesion1/xml_to_csv.py
Input: Data/IncidenciasTDTHist.xml
Output: Data/IncidenciasTDTGeo.csv -
We select the accidents incidences of the csv with all the incidences.
Script: Sesion2/extract_accidents.py
Input: Data/IncidenciasTDTGeo.csv
Output: Data/Accidents.csv -
Next, we apply the DBScan algorithm for do clusters with the diferents accidents. Each cluster defines a zone of accidents.
Script: Sesion2/DBSCAN_accidents.py
Input: Data/Accidents.csv
Output: Data/Accidents_with_zones_dbscan.csv -
We apply the Spectral and K-means algorithms to accidents.csv. Each cluster defines a zone of accidents. We not consider the Spectral results because it haven't sense. Script: Sesion3/Spectral-kmeans.py
Input: Data/Accidents.csv
Output: Data/Accidents_with_zones_kmeans.csv -
We define different zones based on the features of the accidents that has ocurred in that zone.
Script: Sesion4/extract_features_from_zones.py
Input: Data/Accidents_with_zones_dbscan.csv, Data/Accidents_with_zones_kmeans.csv
Output: Data/Zonas_dbscan.csv, Data/Zonas_kmeans.csv -
We apply the PCA algorithm to the zones and do hierarchical clustering with the results. Next we define zone groups with their features.
Script: Sesion4/PCA_hierarchical.py
Input: Data/Zonas_dbscan.csv
Output: Data/Grupos_zonas.csv -
We modify the script of Step 2 for filter the accidents better. So, we do all a second time with the accidents filtered.
-
We select the works of the initial incidences csv file for predict its zones. Scripts: Sesion5/extract_works.py
Input: Data/IncidenciasTDTGeo.csv
Output: Data/Works.csv -
We create a KNN model trained with the accidents ant its zones for predict the zone of the works. Script: Sesion5/KNN-works.py
Input: Data/Works.csv
Output: Works_zones.csv -
We predict the cluster group of each zone with a decision tree and extract the feature relevance. Because the implementation of decision trees has a random factor, we use Random Forest several times in order to improve the acuraccy of the feature relevance. Script: Sesion6/DecisionTree.py
Script: Sesion6/RandomForest.py
Input: Data/Zones_labels.csv -
Next, we filter the works in 2007 for remove the repeated works. Script: Sesion7/FilterWorks.py
Input: Data/Works2007.csv
Output: Data/Works2007_filtered.csv -
Next, we use the zones and the works of 2007 for create a prediction model with Decision Tree algorithm for predict the number of works in each zone. Script: Sesion7/WorksPrediction.py
Input: /Data/Works2007_filtered.csv
Output: /Data/Zones_with_number_works.csv, /Data/Zones_with_discrete_works.csv
- Each Script hava a Jupyter Notebook with comments.