Skip to content

Commit

Permalink
📊 energy: Get Eurostat data on energy prices (#3499)
Browse files Browse the repository at this point in the history
* 📊 energy: Get Eurostat data on energy prices

* Add snapshot, and create data steps skeleton

* Fix missing dataset code

* Prepare meadow step

* Prepare garden step (WIP)

* Harmonize country names and other improvements of garden step

* Keep working on garden step, mostly mapping different fields

* Map energy price components

* Improve garden dataset

* Adapt code to ignore historical data

* Improve garden step

* Working on garden step (still WIP)

* Garden step (WIP)

* Improve garden step

* Work on garden step and start grapher step

* Prepare grapher step

* Improve metadata

* Impose that certain price components need to be informed

* Add data from prices datasets, include checks, and improve metadata

* Fix table name

* Create another grapher step for prices and improve metadata

* Improve grapher steps

* Remove repeated step in the dag

* Add documentation explaining findings about components

* Add sanity checks and remove TODOs

* Improve metadata

* Add key descriptions to price component variables

* Add key descriptions to prices variables

* Add short descriptions

* Add analysis comparing price components data and price data

* Improve checks

* Add total price to the components data

* Improve metadata

* Improve metadata

* Improve metadata

* Improve format

* Add monthly wholesale electricity prices from Ember

* Add IEA fossil fuel subsidies data (WIP)

* Adapt meadow step

* Adapt garden and grapher steps

* Adapt garden and grapher steps

* Include additional indicators from IEA

* Add other IEA indicators and improve metadata

* Fix old read_table

* Fix export steps ignored by PathFinder

* Create energy prices dataset and mdim explorer

* Various improvements

* Add missing pps data

* Add price components views and other improvements

* Add map tabs (not working properly, there might be a bug somewhere)

* Simplify Eurostat steps

* Delete Eurostat grapher steps and simplify

* Refactor mdim step and add general function to multidim module

* Complete to-do

* Remove unnecessary grapher step

* Homogenize prices

* Remove to-do

* Improve metadata

* Trim long variable names

* Simplify function that expands views, and add documentation

* Update the docs
  • Loading branch information
pabloarosado authored Dec 4, 2024
1 parent 1018780 commit 3a36b22
Show file tree
Hide file tree
Showing 29 changed files with 3,506 additions and 74 deletions.
51 changes: 51 additions & 0 deletions dag/energy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,54 @@ steps:
#
data://grapher/energy/2024-11-15/photovoltaic_cost_and_capacity:
- data://garden/energy/2024-11-15/photovoltaic_cost_and_capacity
#
# Eurostat - Energy statistics, prices of natural gas and electricity
#
data://meadow/eurostat/2024-11-05/gas_and_electricity_prices:
- snapshot://eurostat/2024-11-05/gas_and_electricity_prices.zip
#
# Eurostat - Energy statistics, prices of natural gas and electricity
#
data://garden/eurostat/2024-11-05/gas_and_electricity_prices:
- data://meadow/eurostat/2024-11-05/gas_and_electricity_prices
#
# Ember - European wholesale electricity prices
#
data://meadow/ember/2024-11-20/european_wholesale_electricity_prices:
- snapshot://ember/2024-11-20/european_wholesale_electricity_prices.csv
#
# Ember - European wholesale electricity prices
#
data://garden/ember/2024-11-20/european_wholesale_electricity_prices:
- data://meadow/ember/2024-11-20/european_wholesale_electricity_prices
#
# IEA - Fossil fuel subsidies
#
data://meadow/iea/2024-11-20/fossil_fuel_subsidies:
- snapshot://iea/2024-11-20/fossil_fuel_subsidies.xlsx
#
# IEA - Fossil fuel subsidies
#
data://garden/iea/2024-11-20/fossil_fuel_subsidies:
- data://meadow/iea/2024-11-20/fossil_fuel_subsidies
#
# IEA - Fossil fuel subsidies
#
data://grapher/iea/2024-11-20/fossil_fuel_subsidies:
- data://garden/iea/2024-11-20/fossil_fuel_subsidies
#
# Energy prices
#
data://garden/energy/2024-11-20/energy_prices:
- data://garden/eurostat/2024-11-05/gas_and_electricity_prices
- data://garden/ember/2024-11-20/european_wholesale_electricity_prices
#
# Energy prices
#
data://grapher/energy/2024-11-20/energy_prices:
- data://garden/energy/2024-11-20/energy_prices
#
# Energy prices explorer
#
export://multidim/energy/latest/energy_prices:
- data://grapher/energy/2024-11-20/energy_prices
138 changes: 65 additions & 73 deletions docs/guides/data-work/export-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,101 +31,93 @@ ds_explorer.save()

Multi-dimensional indicators are powered by a configuration that is typically created from a YAML file. The structure of the YAML file looks like this:

```yaml title="etl/steps/export/multidim/covid/latest/covid.deaths.yaml"
definitions:
table: {definitions.table}

```yaml title="etl/steps/export/multidim/energy/latest/energy_prices.yaml"
title:
title: COVID-19 deaths
titleVariant: by interval
title: "Energy prices"
titleVariant: "by energy source"
defaultSelection:
- World
- Europe
- Asia
- "European Union (27)"
topicTags:
- COVID-19

- "Energy"
dimensions:
- slug: interval
name: Interval
- slug: "frequency"
name: "Frequency"
choices:
- slug: weekly
name: Weekly
description: null
- slug: biweekly
name: Biweekly
description: null

- slug: metric
name: Metric
- slug: "annual"
name: "Annual"
description: "Annual data"
- slug: "monthly"
name: "Monthly"
description: "Monthly data"
- slug: "source"
name: "Energy source"
choices:
- slug: absolute
name: Absolute
description: null
- slug: per_capita
name: Per million people
description: null
- slug: change
name: Change from previous interval
description: null

- slug: "electricity"
name: "Electricity"
- slug: "gas"
name: "Gas"
- slug: "unit"
name: "Unit"
choices:
- slug: "euro"
name: "Euro"
description: "Price in euros"
- slug: "pps"
name: "PPS"
description: "Price in Purchasing Power Standard"
views:
- dimensions:
interval: weekly
metric: absolute
indicators:
y: "{definitions.table}#weekly_deaths"
- dimensions:
interval: weekly
metric: per_capita
indicators:
y: "{definitions.table}#weekly_deaths_per_million"
- dimensions:
interval: weekly
metric: change
indicators:
y: "{definitions.table}#weekly_pct_growth_deaths"

- dimensions:
interval: biweekly
metric: absolute
indicators:
y: "{definitions.table}#biweekly_deaths"
- dimensions:
interval: biweekly
metric: per_capita
indicators:
y: "{definitions.table}#biweekly_deaths_per_million"
- dimensions:
interval: biweekly
metric: change
indicators:
y: "{definitions.table}#biweekly_pct_growth_deaths"
# Views will be filled out programmatically.
[]

```

The `dimensions` field specifies selectors, and the `views` field defines views for the selection. Since there are numerous possible configurations, `views` are usually generated programmatically. However, it's a good idea to create a few of them manually to start.
The `dimensions` field specifies selectors, and the `views` field defines views for the selection. Since there are numerous possible configurations, `views` are usually generated programmatically (using function `etl.multidim.generate_views_for_dimensions`).

You can also combine manually defined views with generated ones. See the `etl.multidim` module for available helper functions or refer to examples from `etl/steps/export/multidim/`. Feel free to add or modify the helper functions as needed.

The export step loads the YAML file, adds `views` to the config, and then calls the function.
The export step loads the data dependencies and the config YAML file, adds `views` to the config, and then pushes the configuration to the database.

```python title="etl/steps/export/multidim/covid/latest/covid.py"
```python title="etl/steps/export/multidim/energy/latest/energy_prices.py"
def run(dest_dir: str) -> None:
engine = get_engine()
# Load YAML file
config = paths.load_mdim_config("covid.deaths.yaml")
#
# Load inputs.
#
# Load data on energy prices.
ds_grapher = paths.load_dataset("energy_prices")

# Read table of prices in euros.
tb_annual = ds_grapher.read("energy_prices_annual")
tb_monthly = ds_grapher.read("energy_prices_monthly")

#
# Process data.
#
# Load configuration from adjacent yaml file.
config = paths.load_mdim_config()

# Create views.
config["views"] = multidim.generate_views_for_dimensions(
dimensions=config["dimensions"],
tables=[tb_annual, tb_monthly],
dimensions_order_in_slug=("frequency", "source", "unit"),
warn_on_missing_combinations=False,
additional_config={"chartTypes": ["LineChart"], "hasMapTab": True, "tab": "map"},
)

#
# Save outputs.
#
multidim.upsert_multidim_data_page(slug="mdd-energy-prices", config=config, engine=get_engine())

multidim.upsert_multidim_data_page("mdd-energy", config, engine)
```

To see the multi-dimensional indicator in Admin, run

```bash
etlr export://multidim/energy/latest/energy --export
etlr export://multidim/energy/latest/energy_prices --export
```

and check out the preview at http://staging-site-my-branch/admin/grapher/mdd-name.
and check out the preview at: http://staging-site-my-branch/admin/grapher/mdd-energy-prices


## Exporting data to GitHub
Expand Down
2 changes: 1 addition & 1 deletion etl/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -594,7 +594,7 @@ def _get_attributes_from_step_name(step_name: str) -> Dict[str, str]:
if channel_type.startswith(("walden", "snapshot")):
channel = channel_type
namespace, version, short_name = path.split("/")
elif channel_type.startswith(("data",)):
elif channel_type.startswith(("data", "export")):
channel, namespace, version, short_name = path.split("/")
else:
raise WrongStepName
Expand Down
106 changes: 106 additions & 0 deletions etl/multidim.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
import json
from itertools import product

import pandas as pd
import yaml
from sqlalchemy.engine import Engine
from structlog import get_logger

from apps.chart_sync.admin_api import AdminAPI
from etl.config import OWID_ENV
from etl.db import read_sql
from etl.grapher_io import trim_long_variable_name
from etl.helpers import map_indicator_path_to_id

# Initialize logger.
log = get_logger()


def upsert_multidim_data_page(slug: str, config: dict, engine: Engine) -> None:
validate_multidim_config(config, engine)
Expand Down Expand Up @@ -162,3 +168,103 @@ def fetch_variables_from_table(table: str, engine: Engine) -> pd.DataFrame:
df_dims = pd.DataFrame(dims, index=df.index)

return df.join(df_dims)


def generate_views_for_dimensions(
dimensions, tables, dimensions_order_in_slug=None, additional_config=None, warn_on_missing_combinations=True
):
"""Generate individual views for all possible combinations of dimensions in a list of flattened tables.
Parameters
----------
dimensions : List[Dict[str, Any]]
Dimensions, as given in the configuration of the multidim step, e.g.
[
{'slug': 'frequency', 'name': 'Frequency', 'choices': [{'slug': 'annual','name': 'Annual'}, {'slug': 'monthly', 'name': 'Monthly'}]},
{'slug': 'source', 'name': 'Energy source', 'choices': [{'slug': 'electricity', 'name': 'Electricity'}, {'slug': 'gas', 'name': 'Gas'}]},
...
]
tables : List[Table]
Tables whose indicator views will be generated.
dimensions_order_in_slug : Tuple[str], optional
Dimension names, as they appear in "dimensions", and in the order in which they are spelled out in indicator names. For example, if indicator names are, e.g. annual_electricity_euros, then dimensions_order_in_slug would be ("frequency", "source", "unit").
additional_config : _type_, optional
Additional config fields to add to each view, e.g.
{"chartTypes": ["LineChart"], "hasMapTab": True, "tab": "map"}
warn_on_missing_combinations : bool, optional
True to warn if any combination of dimensions is not found among the indicators in the given tables.
Returns
-------
results : List[Dict[str, Any]]
Views configuration, e.g.
[
{'dimensions': {'frequency': 'annual', 'source': 'electricity', 'unit': 'euro'}, 'indicators': {'y': 'grapher/energy/2024-11-20/energy_prices/energy_prices_annual#annual_electricity_household_total_price_including_taxes_euro'},
{'dimensions': {'frequency': 'annual', 'source': 'electricity', 'unit': 'pps'}, 'indicators': {'y': 'grapher/energy/2024-11-20/energy_prices/energy_prices_annual#annual_electricity_household_total_price_including_taxes_pps'},
...
]
"""
# Extract all choices for each dimension as (slug, choice_slug) pairs.
choices = {dim["slug"]: [choice["slug"] for choice in dim["choices"]] for dim in dimensions}
dimension_slugs_in_config = set(choices.keys())

# Sanity check for dimensions_order_in_slug.
if dimensions_order_in_slug:
dimension_slugs_in_order = set(dimensions_order_in_slug)

# Check if any slug in the order is missing from the config.
missing_slugs = dimension_slugs_in_order - dimension_slugs_in_config
if missing_slugs:
raise ValueError(
f"The following dimensions are in 'dimensions_order_in_slug' but not in the config: {missing_slugs}"
)

# Check if any slug in the config is missing from the order.
extra_slugs = dimension_slugs_in_config - dimension_slugs_in_order
if extra_slugs:
log.warning(
f"The following dimensions are in the config but not in 'dimensions_order_in_slug': {extra_slugs}"
)

# Reorder choices to match the specified order.
choices = {dim: choices[dim] for dim in dimensions_order_in_slug if dim in choices}

# Generate all combinations of the choices.
all_combinations = list(product(*choices.values()))

# Create the views.
results = []
for combination in all_combinations:
# Map dimension slugs to the chosen values.
dimension_mapping = {dim_slug: choice for dim_slug, choice in zip(choices.keys(), combination)}
slug_combination = "_".join(combination)

# Find relevant tables for the current combination.
relevant_table = []
for table in tables:
if slug_combination in table:
relevant_table.append(table)

# Handle missing or multiple table matches.
if len(relevant_table) == 0:
if warn_on_missing_combinations:
log.warning(f"Combination {slug_combination} not found in tables")
continue
elif len(relevant_table) > 1:
log.warning(f"Combination {slug_combination} found in multiple tables: {relevant_table}")

# Construct the indicator path.
indicator_path = f"{relevant_table[0].metadata.dataset.uri}/{relevant_table[0].metadata.short_name}#{trim_long_variable_name(slug_combination)}"
indicators = {
"y": indicator_path,
}
# Append the combination to results.
results.append({"dimensions": dimension_mapping, "indicators": indicators})

if additional_config:
# Include additional fields in all results.
for result in results:
result.update({"config": additional_config})

return results
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"Austria": "Austria",
"Belgium": "Belgium",
"Bulgaria": "Bulgaria",
"Croatia": "Croatia",
"Czechia": "Czechia",
"Denmark": "Denmark",
"Estonia": "Estonia",
"Finland": "Finland",
"France": "France",
"Germany": "Germany",
"Greece": "Greece",
"Hungary": "Hungary",
"Ireland": "Ireland",
"Italy": "Italy",
"Latvia": "Latvia",
"Lithuania": "Lithuania",
"Luxembourg": "Luxembourg",
"Netherlands": "Netherlands",
"North Macedonia": "North Macedonia",
"Norway": "Norway",
"Poland": "Poland",
"Portugal": "Portugal",
"Romania": "Romania",
"Serbia": "Serbia",
"Slovakia": "Slovakia",
"Slovenia": "Slovenia",
"Spain": "Spain",
"Sweden": "Sweden",
"Switzerland": "Switzerland"
}
Loading

0 comments on commit 3a36b22

Please sign in to comment.