Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EPC data to Colouring London #896

Draft
wants to merge 64 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
94a2934
add requirements for parquet file handling
edwardchalstrey1 Aug 18, 2022
f73a201
add epc script for conversion and loading
edwardchalstrey1 Aug 18, 2022
3de8c3f
add documentation for adding EPC data
edwardchalstrey1 Aug 18, 2022
d8a3814
update requirements needed for EPC data conversion
edwardchalstrey1 Aug 18, 2022
164d6f3
add drop
edwardchalstrey1 Aug 18, 2022
3e4a2b5
use \copy
edwardchalstrey1 Aug 18, 2022
528f8e8
use COPY
edwardchalstrey1 Aug 18, 2022
4f80fba
add extra column
edwardchalstrey1 Aug 18, 2022
96d20e6
update instructions
edwardchalstrey1 Aug 18, 2022
070f1f0
convert to bash
edwardchalstrey1 Aug 18, 2022
8ad2479
fix script
edwardchalstrey1 Aug 18, 2022
f896397
create temp table
edwardchalstrey1 Aug 18, 2022
8f670fb
fix \copy
edwardchalstrey1 Aug 18, 2022
cc5833a
add column for index
edwardchalstrey1 Aug 18, 2022
5ffe08d
refactor/simplify
edwardchalstrey1 Aug 18, 2022
9dd3954
allow for 'INVALID!' energy rating
edwardchalstrey1 Aug 18, 2022
c034493
even longer
edwardchalstrey1 Aug 18, 2022
f7bf845
make sure there is a header
edwardchalstrey1 Aug 18, 2022
d91b843
try again last commit
edwardchalstrey1 Aug 18, 2022
2928c11
change uprn column type
edwardchalstrey1 Aug 18, 2022
9761d0f
change floor_level to varchar
edwardchalstrey1 Aug 18, 2022
43fd67b
fin last commit
edwardchalstrey1 Aug 18, 2022
9097f30
fin last commit
edwardchalstrey1 Aug 18, 2022
3d081ee
update readme instructions
edwardchalstrey1 Aug 18, 2022
5dddf6f
add todo list
edwardchalstrey1 Aug 18, 2022
54a40d3
move todos to PR
edwardchalstrey1 Aug 26, 2022
6c9ff8c
ignore current_energy_rating with INVALID
edwardchalstrey1 Aug 26, 2022
c28b670
begin floor level func
edwardchalstrey1 Aug 26, 2022
fd1f830
convert most values to int apart from basement and mid floor
edwardchalstrey1 Aug 26, 2022
9d7db0b
add comments
edwardchalstrey1 Aug 26, 2022
19f2c0a
rename
edwardchalstrey1 Aug 26, 2022
c365967
add test
edwardchalstrey1 Aug 26, 2022
8a1880e
rename
edwardchalstrey1 Aug 26, 2022
a0da424
add separate module for funcs
edwardchalstrey1 Aug 26, 2022
d879939
fix test import
edwardchalstrey1 Aug 26, 2022
1c71737
restore removed in error
edwardchalstrey1 Aug 26, 2022
5638524
test passes
edwardchalstrey1 Aug 26, 2022
dfa5c14
fix test
edwardchalstrey1 Aug 26, 2022
8055651
add function being tested
edwardchalstrey1 Aug 26, 2022
bce6b11
refactor
edwardchalstrey1 Aug 26, 2022
3fbfb03
add missing comma
edwardchalstrey1 Aug 26, 2022
59cfedf
update script to map func
edwardchalstrey1 Aug 26, 2022
80ede3f
update rule
edwardchalstrey1 Aug 26, 2022
a177c0d
basement rule
edwardchalstrey1 Aug 26, 2022
4e249d2
cover all floor level variations
edwardchalstrey1 Aug 26, 2022
12b78ab
update docstring
edwardchalstrey1 Aug 26, 2022
6d83ffa
assume floor level is 0 by default
edwardchalstrey1 Aug 26, 2022
c0ed161
refactor so int always returned
edwardchalstrey1 Aug 26, 2022
48707b6
include int and None in test
edwardchalstrey1 Aug 26, 2022
4d78dfd
pass tests
edwardchalstrey1 Aug 26, 2022
77c07c5
update data cleaning script
edwardchalstrey1 Aug 26, 2022
4b181f4
allow return of None for floor level
edwardchalstrey1 Aug 26, 2022
85a83ce
add test for construction_to_int
edwardchalstrey1 Aug 26, 2022
dbf5d46
add func for construction year
edwardchalstrey1 Aug 26, 2022
a61093d
add clean of CONSTRUCTION_AGE_BAND
edwardchalstrey1 Aug 26, 2022
3a83312
make uprn bigint
edwardchalstrey1 Aug 31, 2022
9cd8644
floor level integer
edwardchalstrey1 Aug 31, 2022
ed06d34
change construction_age_band to smallint to match date_year
edwardchalstrey1 Aug 31, 2022
244b918
add emoji
edwardchalstrey1 Aug 31, 2022
35feedc
add to contents
edwardchalstrey1 Aug 31, 2022
ac6fff0
permissions
edwardchalstrey1 Aug 31, 2022
bd75b89
Ensure CONSTRUCTION_AGE_BAND and FLOOR_LEVEL int not float
edwardchalstrey1 Aug 31, 2022
0506ef5
convert UPRN to int
edwardchalstrey1 Sep 1, 2022
c93358f
flake8
edwardchalstrey1 Sep 1, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion etl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The scripts in this directory are used to extract, transform and load (ETL) the
- :penguin: [Making data available to Ubuntu](#penguin-making-data-available-to-ubuntu)
- :new_moon: [Creating a Colouring London database from scratch](#new_moon-creating-a-colouring-london-database-from-scratch)
- :full_moon: [Updating the Colouring London database with new OS data](#full_moon-updating-the-colouring-london-database-with-new-os-data)
- ⚡ [Adding EPC data](#-adding-epc-data)

# :arrow_down: Downloading Ordnance Survey data

Expand Down Expand Up @@ -175,4 +176,32 @@ Mark buildings with geometries not present in the update as demolished.

**TODO:** Update this after PR [#794](https://github.com/colouring-cities/colouring-london/pull/794)

Run the Colouring London [deployment scripts](https://github.com/colouring-cities/colouring-london-config#deployment).
Run the Colouring London [deployment scripts](https://github.com/colouring-cities/colouring-london-config#deployment).

# ⚡ Adding EPC data

Download the EPC data.

```
git clone https://github.com/iagw/colouring-cities
```

Copy `gla-epc-subset.zstd.parquet` into `colouring-london/etl`.

```
cp /path/to/gla-epc-subset.zstd.parquet /path/to/colouring-london/etl
```

Run a conversion to csv (make sure you have an up to date Python 3 environment and pip installation and run `pip install -r requirements.txt` first if you haven't already).

```
python clean_epc_data.py
```

This should have created a csv in the `/etl` dir called `'gla-epc-subset.csv'`.

Create a new table for the EPC data and load the csv data into it (if you didn't already, don't forget to change the permissions so this file can be run `chmod +x *.sh`)

```
./load_epc.sh
```
3 changes: 2 additions & 1 deletion etl/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from .filter_mastermap import filter_mastermap
from .filter_mastermap import filter_mastermap
from .epc_cleaning_functions import floor_level_to_int, construction_to_int
36 changes: 36 additions & 0 deletions etl/clean_epc_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# # Instructions
#
# 1. Download the GLA EPC data from GitHub in parquet format:
# github.com/iagw/colouring-cities/blob/master/gla-epc-subset.zstd.parquet
# 2. Place the file in `colouring-london/etl`
# 3. Run this script to convert it to CSV for easy loading into Postgres

import pandas as pd
from epc_cleaning_functions import floor_level_to_int, construction_to_int

gla = pd.read_parquet('gla-epc-subset.zstd.parquet')

# Remove invalid CURRENT_ENERGY_RATING
gla = gla.replace('INVALID!', None)

# Clean the FLOOR_LEVEL column
gla['FLOOR_LEVEL'] = gla['FLOOR_LEVEL'].apply(floor_level_to_int)

# Clean the CONSTRUCTION_AGE_BAND column
gla['CONSTRUCTION_AGE_BAND'] = gla['CONSTRUCTION_AGE_BAND'].apply(construction_to_int) # noqa: E501

# Remove NaNs and non finite values
with pd.option_context('mode.use_inf_as_null', True):
gla.dropna(inplace=True)

# Ensure int not float
gla['CONSTRUCTION_AGE_BAND'] = gla['CONSTRUCTION_AGE_BAND'].astype(int)

# Ensure int not float
gla['FLOOR_LEVEL'] = gla['FLOOR_LEVEL'].astype(int)

# Ensure int not float
gla['UPRN'] = gla['UPRN'].astype(int)

# Export to csv
gla.to_csv('gla-epc-subset.csv')
52 changes: 52 additions & 0 deletions etl/epc_cleaning_functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
def floor_level_to_int(lvl):
"""Convert differently formatted floor level strings to ints.
As you can see below, there are some assumptions made such as
the 'top floor' being 2. This has been done so we can get an int value
for the floor for each building automatically populated by EPC data.
Incorrect assumptions can be updated later via the Colouring London
interface.
"""
if lvl is None:
return None
elif type(lvl) == int:
return lvl
# else assume we have a string
ordinals = ['st', 'nd', 'rd', 'th']
lvl = lvl.replace('or above', '')
lvl = lvl.replace('+', '')
try:
return int(lvl)
except ValueError:
if 'Ground' in lvl or 'ground' in lvl:
lvl = 0
elif 'basement' in lvl or 'Basement' in lvl:
lvl = -1
elif lvl == 'mid floor':
lvl = 1
elif lvl == 'top floor':
lvl = 2
elif lvl[0] == '0' and lvl != '0':
lvl = lvl[1]
elif any(ordinal in lvl for ordinal in ordinals):
for ordinal in ordinals:
lvl = lvl.replace(ordinal, '')
else:
return None
return int(lvl)


def construction_to_int(year):
if year is None:
return None
elif type(year) == int:
return year
# else assume we have a string
if 'before' in year:
return int(year.split('before ')[-1])
elif '-' in year:
return round(sum(list(map(float, year.split(' ')[-1].split('-'))))/2)
elif 'onwards' in year:
return int(year.split(' onwards')[-2].split(' ')[-1])
elif year == 'NO DATA!' or year == 'INVALID!':
return None
return int(year)
20 changes: 20 additions & 0 deletions etl/load_epc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
psql -c "DROP TABLE IF EXISTS epc;"

# Create EPC data table
## construction_age_band should match date_year in buildings table
## uprn and toid can also be linked to building table
psql -c "
CREATE TABLE epc (
index integer,
current_energy_rating char(1),
lodgement_date timestamp,
floor_level integer,
construction_age_band smallint,
uprn bigint,
epc_data_from_file varchar,
toid varchar
);
"

# Read in the EPC data
psql -c "\copy epc FROM 'gla-epc-subset.csv' DELIMITER ',' CSV HEADER;"
4 changes: 4 additions & 0 deletions etl/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@ psycopg2==2.7.5
shapely==1.7
retrying==1.3.3
requests==2.23.0
pyarrow
fastparquet
cython
pandas
21 changes: 21 additions & 0 deletions tests/test_epc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pytest
from etl import floor_level_to_int, construction_to_int


def test_floor_level_to_int():
"""Test that differently formatted floors can correctly converted."""
test_levels = ['01', '02', '1st', '2nd', '3rd', '4th', '1', '2', '0',
'Ground', 'NODATA!', 'mid floor', 'Basement', 'ground floor', '21st or above',
'top floor', '00', '20+', None, 5]
expected = [1, 2, 1, 2, 3, 4, 1, 2, 0, 0, None, 1, -1, 0, 21, 2, 0, 20, None, 5]
for lvl, ex in zip(test_levels, expected):
assert floor_level_to_int(lvl) == ex


def test_construction_to_int():
"""Test that differently formatted construction ages can correctly converted."""
test_dates = ['England and Wales: before 1900', None, 'England and Wales: 1991-1996',
'NO DATA!', 'England and Wales: 2007 onwards', 'INVALID!', '1950']
expected = [1900, None, 1994, None, 2007, None, 1950]
for date, ex in zip(test_dates, expected):
assert construction_to_int(date) == ex