This repository mirrors the driving schools available at imt-ip.pt. It's a collection of scripts that scrape the website and respective school files.
The goal is to have a 1-1 mapping between each school URL and a file under /parsed-data
.
For example, parsed-data/castelo-branco/fundao/035c3e16-a-cereja.edn
looks like:
{:address
"Loteamento Quinta do Espírito Santo, lote 10 r/c Frente 6230-329 FUNDÃO",
:name "A Cereja",
:href-id #uuid "035c3e16-af4a-38d1-9233-2173e6767f72",
:distrito "Castelo Branco",
:imt-href
"https://www.imt-ip.pt/sites/IMTT/Portugues/EnsinoConducao/LocalizacaoEscolasConducao/Paginas/ACereja005.aspx",
:concelho "Fundão",
:nec 668,
:cp7 "6230-329",
:id #uuid "2ab49907-1192-3122-9793-687bc6136515",
:concelho-href
"https://www.imt-ip.pt/sites/IMTT/Portugues/EnsinoConducao/LocalizacaoEscolasConducao/Paginas/LocalizacaoEscolasConducao.aspx?Distrito=CasteloBranco&Concelho=Fund%C3%A3o"}
Which mirrors the URL under imt-href
:
https://www.imt-ip.pt/sites/IMTT/Portugues/EnsinoConducao/LocalizacaoEscolasConducao/Paginas/ACereja005.aspx
The file path looks like this:
/parsed-data/[:district]/[:municipality]/[:digest]-[:name].edn
:district
, :municipality
, and :name
are self-explanatory.
From the user's point of view, schools are organized geographically: districts > county/municipality (concelhos) > schools. If you click on imt-ip.pt, you're shown a map of the country, then you have to click on the district, municipality, and then you're shown a list of schools.
The URL of each school has a unique portion, in the example above it's ACereja005
. This is loosely associated with the school name
and license number nec
; however, the rules are a bit fuzzy. Instead, what we did was digest the URL (UUID v3 with an MD5 algorithm) and take the first eight characters. In the example above, that is 035c3e16
.
There are no guarantees that each URL corresponds to one school. It's possible that the name
and nec
can change for the same URL. To keep our data model consistent, we've appended the name to the file path. Each name change corresponds to a new school effectively.
So, for instance, in the example above, if that school name is updated to :name "Mega School"
, the new filename becomes 035c3e16-mega-school.edn
. All other attribute changes are considered updates rather than inserts.
With each new data fetch, we check which files weren't updated and add a key representing the time of the last fetch. For example:
{:address "Rua Homem Cristo Filho, n.º 62 – B, 3804-501 Aveiro",
:archived-last-seen-at "2022-05-07T14:52:45Z", ;;<-----
:name "OK Condutor",
:href-id #uuid "abb64eb1-d043-3530-a3db-9edaecc18000",
:distrito "Aveiro",
:imt-href
"https://www.imt-ip.pt/sites/IMTT/Portugues/EnsinoConducao/LocalizacaoEscolasConducao/Paginas/HomemCristo.aspx",
:concelho "Aveiro",
:nec 1426,
:cp7 "3804-501",
:id #uuid "6dd53139-1797-3b3f-918a-57e7f2c695a5",
:concelho-href
"https://www.imt-ip.pt/sites/IMTT/Portugues/EnsinoConducao/LocalizacaoEscolasConducao/Paginas/LocalizacaoEscolasConducao.aspx?Distrito=Aveiro&Concelho=Aveiro"}
:archived-last-seen-at
means that this school is "archived," and the last fetch was on May 7th, 2022.
Each file in snapshots
is a list of all schools online at that moment. We compare the new snapshot to the files in parsed-data
to derive :archived-last-seen-at
.
Check last-fetch.txt
for the timestamp (epoch in milliseconds).
With each fetch, we create a PR with the new data. The description has a resume of what's changing.
Schools have an associated integer value that is unique for the most part. This is the school license number provided by IMT, sometimes called the "school number", "nec", or alvará. However, this number appears associated with multiple schools with different numbers, for multiple reasons. For instance, it might be that a school closed down and the same license was granted to a different school. Because of this, we added a UUID deterministically generated from
UUID (school name + license nr. + address)
In some cases, information on the IMT school profile might not be accurate, for instance, the address might be an old one. Manually fixing these instances is outside the scope of the project. However, in cases where the address has a glaring mistake, we manually fix it with overwrites.edn
. Here are a few examples:
- Lago Azul zip code:
3620
should be3260
.
duplicates.txt
is a list of active schools with the same license number, meaning, duplicates in the last snapshot.
This was originally done for passaprimeira.xyz.
passaprimeira.xyz aims to bring transparency to the driving school business in Portugal by using publicly available data and presenting it in an accessible format. Originally, it was only a web app, but we decided to release some of the data and code so that others could benefit from it.
To reproduce the parsing and data processing, you need babashka
and nbb
.
npm i
make sure temp folder exists.
bb run-all
Check bb.edn
for a breakdown of the process.
bb nrepl
for .clj
files and
nbb nrepl-server :port 1337
for *.cljs
files
The code is MIT licensed - basically, you can do what you want with the code, just give this project credit for it.
The data being reproduced here is assumed to be in the public domain. Additionally, when I stated the purpose in the FOI I filled with IMT, no objection was made on the grounds of it being made public.
According to the Open Data Directive, countries are encouraged to make public data accessible, regardless of the end use:
clearly obliged member states to ‘encourage public sector bodies and public undertakings to produce and make available documents [...] in accordance with the principle of “open by design and by default’’.
Here, "encouraging" is the key word. Given that no explicit consent or license was given, in principle, there's always a chance that this project is using data beyond the scope of its intended use. We hope to show that initiatives like this bring about positive changes and that they further encourage government bodies to release data with explicit Open Data licenses