ARRBHack-2018

data hacking for the ARRB Hack - solving transport and infrastructure related problems

Intro aka TLDR;

This hackathon project was done in about 24 hours for the 'Cycling' and 'Safety' challenges. It tries to solve the problem of cyclist safety via a route planning webapp. If you are not keen on all the technical details, I have made a presentation about the key things. Also, there's a prototype running for the Brisbane area - disclaimer: safety calculations take about 90-120s, so leave that loading running on the page.

Content

Problem and results
Tooling and solution process
Future improvements

Problem and results

I started to do some research around safety statistics and cyclists a few days before the hackathon. I also got familiar with the Queensland open data that was available for this topic and identified a problem that was appealing to me.

More experienced cyclists have a bigger chance to be in right in a fatal or serious incident.

From this problem, I identified that commuters who cycle everyday could be in a quite large risk, as both the demographic of this group and the behavior would fit into the problem. I decided to use machine learning to predict the potential risk a cyclist has on a route in a given time based on historical data.

So the question I proposed was: assuming that every point on the road will have an accident at one point in time, what is the likelihood that the accident is a severe incident and a cyclist is involved?

Over a few hours, I iterated over 11 ML models that I've built using AWS. These models had different feature sets, and were based on logistic regression to determine whether a portion of a route would have an accident or not. Below are the results that came out of these iterations:

Model	Number of features	Precision	Recall	Accuracy	F1
1	14	0.4286	0.003	0.982	0.006
2	14	0.5	0.019	0.9804	0.0038
3	14	1	0	0.9804	0
4	13	0.4706	0.0074	0.9803	0.0146
5	12	0.625	0.049	0.9813	0.0909
6	10	0.5	0.0019	0.9811	0.0038
7	8	0.5463	0.046	0.9535	0.0849
8	12	0.6713	0.668	0.6773	0.6696
9	8	0.82	0.8169	0.7389	0.8184
10	10	0.6605	0.6757	0.6754	0.668
11	10	0.6621	0.6608	0.6634	0.6614

About the process

This repo uses public / open source datasets and Jupyter notebooks to process and visualise them.

If you haven't used Jupyter before, I recommend this easy getting started guide: Jupyter Notebook Beginner Guide.

I used the prebuilt ML model deployments of AWS as it was a quick and easy way to get started with the process, and produce an endpoint that I can use for the prototype.

After finding a suitable ML model candidate, I exposed its endpoint within the environment set up for the project, and used serverless (framework) to run the data pipeline around it.

There are a few lambdas in the process:

calculate-route lambda function takes the parameters from the request that comes from the front end, submits them to the HERE api route planning endpoint, and pushes back the route to the front end. Based on this route that came back from HERE, it also submits a batch request to get address for each geolocation within the route.
here-get-batch lambda function gets triggered, when the HERE api sends an email update about the status of the batch process with the link to download the results from. This lambda downloads the contents to an S3 bucket.
prepare-prediction-request lambda function does the feature engineering for the model that I have previously chosen, with the data collected from the front end and the batch address requests. It saves all as a .csv file, which then gets fed into the model.
get-final-json lambda function wakes up when the ML model has finished the prediction and has an output ready. This lambda then turns the prediction result to a format that the front end can deal with, and saves it as a json to an S3 bucket. While all this was happening, the front end has been polling for this json file the whole time, based on an id that was originally returned by the first lambda.

Future improvements

Speed is a major issue, I should really build and deploy my own model so that it's easier to keep track of times. Some things are a little black box-y about AWS ML now, which means that about 70% of the time actually goes for the prediction.
Improving the model. When I started this project, I didn't know much about other type of solutions, so I think today I would try out SVM instead of a simple logistic regression.
Better commiting / commenting and generally presenting the process. I found it challenging that data exploration is a chunky process and it's hard to commit often in a meaningful way.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
transform		transform
.gitignore		.gitignore
01-cyclist_fsi.ipynb		01-cyclist_fsi.ipynb
02-cyclist_fsi_remapped_features.ipynb		02-cyclist_fsi_remapped_features.ipynb
03-cyclists_fsi_no_geolocation.ipynb		03-cyclists_fsi_no_geolocation.ipynb
04-cyclists_any_no_geolocation.ipynb		04-cyclists_any_no_geolocation.ipynb
05-cycle_fsi_filtered_undersampling.ipynb		05-cycle_fsi_filtered_undersampling.ipynb
06-cyclists_FSI_reduced_features_with_address.ipynb		06-cyclists_FSI_reduced_features_with_address.ipynb
LICENSE		LICENSE
README.md		README.md
crash-locations-data-wrangling.ipynb		crash-locations-data-wrangling.ipynb
cyclists_crash_locations.ipynb		cyclists_crash_locations.ipynb
exploring-data-fsi-accidents.ipynb		exploring-data-fsi-accidents.ipynb
road_casualties_qld.ipynb		road_casualties_qld.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARRBHack-2018

Intro aka TLDR;

Content

Problem and results

More experienced cyclists have a bigger chance to be in right in a fatal or serious incident.

About the process

Future improvements

About

Releases

Packages

Contributors 2

Languages

License

sandra-arato/ARRBHack-2018

Folders and files

Latest commit

History

Repository files navigation

ARRBHack-2018

Intro aka TLDR;

Content

Problem and results

More experienced cyclists have a bigger chance to be in right in a fatal or serious incident.

About the process

Future improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages