Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Shared defaults for load_yaml_dags #297

Closed
1 task done
wearpants opened this issue Nov 21, 2024 · 9 comments · Fixed by #330
Closed
1 task done

[Feature] Shared defaults for load_yaml_dags #297

wearpants opened this issue Nov 21, 2024 · 9 comments · Fixed by #330
Assignees
Labels

Comments

@wearpants
Copy link

Description

It'd be nice to pass some shared default_args for a directory, either via a python object or a defaults.yml file in the directory.

Use case/motivation

One DAG per file is easier for users IMO, and as a system administrator I'd like to be able to give them a shared set of pre-baked defaults (env vars, etc.)

Related issues

#289, #290

Are you willing to submit a PR?

  • Yes, I am willing to submit a PR!
@cmarteepants
Copy link
Collaborator

Something like this?

./dags/default.yml

default:
  catchup: false
  default_args:
    start_date: "2024-01-01"
  schedule_interval: "0 0 * * *"
  tasks:
    extract:
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: extract_helper
    load:
      dependencies:
      - transform
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: load_helper
    transform:
      dependencies:
      - extract
      op_kwargs:
        ds_nodash: '{{ds_nodash}}'
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: transform_helper

./dags/bi.yml

business_analytics:
  schedule_interval: "@daily"
  tasks:
    load:
      op_kwargs:
        database_name: BA
        table_name: inventory

./dags/ds.yml

data_science:
  tasks:
    load:
      op_kwargs:
        database_name: DS
        table_name: daily_sales

./dags/ml.yml

machine_learning:
  tasks:
    load:
      op_kwargs:
        database_name: ML
        table_name: training_data

...

@jroach-astronomer
Copy link
Member

@cmarteepants, only thing I think I'd add here is referencing the default values in .dags/bi.yml, etc.

@wearpants
Copy link
Author

@cmarteepants So basically bi.yml etc are merged on top of default.yml? Could you clarify how that works - does that happen for the entire yaml object tree key-by-key / lists extended / etc? How would you do overrrides? (Take a look at ChainMap for a simple comparison).

I had been mainly thinking of this only for default_args and that defaults.yml wouldn't provide any tasks (could use cross-dag dependencies for that)... but if defaults.yml is more like a template / base class that can be extended/overriden, that opens up some interesting possibilities, but not totally clear how that would work.

Docker compose does something similar, but the merge rules are kind of adhoc-yet-sensible

@cmarteepants
Copy link
Collaborator

@wearpants If everything is contained in the same yaml today, yes anything in default is more like a template that can be extended AND overridden.

As for the how? I'll be honest: I haven't delved much into the source code to understand how this was implemented. Could be something we are getting "for free" from pyyaml, but never looked into it as the capability was around from before Astronomer took over the project. I opened up issue #295 so we can we document this properly. The examples in the issue are for extending, overriding and even generating the exact same dag structure with different task ids, and they all work.

I really like your idea about splitting up the definitions into different files though, and allowing for different defaults per folders. I'd even go so far as push that as a best practice. We'd need to allow for an order of precedence, but assuming we can pull it off (and I don't see why not, but I'm the PM :D) I agree, I think it would be really powerful.

I'll have someone on the engineering team start looking into this within the next few sprints. Do you want to be kept to update as things progress?

@wearpants
Copy link
Author

@cmarteepants yes please keep me in the loop, happy to hop on a quick design brainstorming session call as well if that be helpful

@timkpaine
Copy link

Sounds like a perfect use for https://hydra.cc
Might I suggest https://github.com/airflow-laminar/airflow-config

@tatiana
Copy link
Collaborator

tatiana commented Jan 3, 2025

@wearpants please let us know if you have any feedback on the implementation: #330

We're planning to release this in DAG Factory 0.22, on 10 January - let us know if you have any thoughts.

@wearpants
Copy link
Author

This is confusing:

  1. At DAG configuration YML
  2. At the top of the YML file
  3. default.yml

I understand (2), and (3) makes sense, but where/what is (1)?

Also, what's the precedence order between defaults.yml, default_args in my_dag.yml and the python-level self.config['default']? Extremely confusing.

Also, looking ahead to #290, I'd like that to have a defaults.yml per folder (like: "all the DAG files in this folder get these environment variables") - #330 seems to support a global singleton, I worry a 4th option is even more confusing.

All this just might be an issue of documentation however.

@pankajastro
Copy link
Collaborator

pankajastro commented Jan 3, 2025

Example for all 3 supported ways for default_args

  1. At DAG configuration YML
  2. At the top of the YML file
  3. defaults.yml

(1) and (2) already exist and (3) will be release in 0.22.0

precedence order is: 1 > 2 > 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants