Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidated three EOBS recipe in meta recipe #169

Merged
merged 101 commits into from
Oct 21, 2022

Conversation

norlandrhagen
Copy link
Contributor

@pangeo-forge-bot
Copy link

When I tried to load recipes/EOBS/meta.yaml, I got a <class 'yaml.scanner.ScannerError'>.

You should be able to replicate this error yourself.

First make sure you're in the root of your cloned staged-recipes repo. Then run this code in a Python interpreter:

import yaml  # note: you may need to `pip install yaml` first

with open("recipes/EOBS/meta.yaml", "r") as f:
    yaml.safe_load(f)

Please correct meta.yaml so that you're able to run this code without error, then commit the corrected meta.yaml.

@norlandrhagen
Copy link
Contributor Author

Hey there @cisaacstern. Any idea how to fix this pre-commit issue?

Also, I think this is ready for a test run.

@cisaacstern
Copy link
Member

Thanks for checking in here, @norlandrhagen. The pre-commit issue is being tracked in #88. It's on my radar to address soon, but we can ignore here for now.

Re: test run, yes, I'll do that now. 😄

@cisaacstern
Copy link
Member

/run eobs-tg-tn-tx-rr-hu-pp

@cisaacstern
Copy link
Member

/run eobs-wind-speed

@cisaacstern
Copy link
Member

/run eobs-surface-downwelling

@norlandrhagen
Copy link
Contributor Author

Awesome thanks!

@cisaacstern
Copy link
Member

I'm confused about what happened to these test runs. If they failed to submit to Dataflow, the bot should have added a 😕 emoji reaction to the comments above. And if they failed on Dataflow, we should have gotten a new comment back about that.

🤔 This will require some digging in the logs to get to the bottom of. I'll try to do that this afternoon. Feel free to ping me here next week if I haven't gotten back with a response yet.

@andersy005
Copy link
Member

pre-commit.ci autofix

@andersy005
Copy link
Member

you bet, @norlandrhagen!

@cisaacstern
Copy link
Member

@andersy005, similar to my confusion in #169 (comment), I do not see jobs for these recipe runs on the Dataflow dashboard. I am wondering if, in light of pangeo-forge/pangeo-forge-orchestrator#147, the issue is that starting these tests in such close succession to each other is spiking memory on the backend, and causing their submissions to fail. I'm not done with pangeo-forge/pangeo-forge-orchestrator#143 yet, so actually "proving" this in the logs is difficult, but I'll try an experiment now, of just deploying one of these, and watching the logs in my terminal in real time...

@cisaacstern
Copy link
Member

/run eobs-wind-speed

Comment on lines 16 to 20
object: 'recipe:eobs-tg-tn-tx-rr-hu-pp'
- id: eobs-wind-speed
object: 'recipe:eobs_wind_speed'
- id: eobs-surface-downwelling
object: 'recipe:eobs_surface_downwelling'
Copy link
Member

@cisaacstern cisaacstern Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@norlandrhagen & @andersy005 the problem is that these objects defined in the meta.yaml don't exist in the recipe.py.

Raphael, I tried to figure out which of the recipes

 tg_tn_tx_rr_hu_pp_recipe = XarrayZarrRecipe(
     file_pattern=tg_tn_tx_rr_hu_pp_pattern, target_chunks=target_chunks, subset_inputs=subset_inputs
 )

 qq_pattern_recipe = XarrayZarrRecipe(
     file_pattern=qq_pattern, target_chunks=target_chunks, subset_inputs=subset_inputs
 )

 fg_pattern_recipe = XarrayZarrRecipe(
     file_pattern=fg_pattern, target_chunks=target_chunks, subset_inputs=subset_inputs
 )

correspond to each of these objects, but I couldn't figure it out based on the names alone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that these objects don't exist in the recipe.py

i presume you got this information from dataflow logs, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005, actually no. This is an important debugging trick, so I'll describe it here, but then also an open an issue elsewhere to track document it.

So what I did was watch the server logs right after /runing the test. And I saw:

2022-09-29T02:02:56.875589+00:00 app[web.1]: 2022-09-29 02:02:56,875 DEBUG - orchestrator - Dumping bakery config to json: {'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.dataflow.DataflowBakery', 'job_name': 'a6170692e70616e67656f2d666f7267652e6f7267251191'}, 'TargetStorage': {'fsspec_class': 's3fs.S3FileSystem', 'fsspec_args': {'client_kwargs': {'endpoint_url': 'https://ncsa.osn.xsede.org'}, 'default_cache_type': 'none', 'default_fill_cache': False, 'use_listings_cache': False, 'key': SecretStr('**********'), 'secret': SecretStr('**********')}, 'root_path': 'Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr', 'public_url': 'https://ncsa.osn.xsede.org/{root_path}'}, 'InputCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-prod-cache'}, 'root_path': 'pangeo-forge-prod-cache'}, 'MetadataCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {}, 'root_path': 'pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr'}, 'DataflowBakery': {'temp_gcs_location': 'gs://pangeo-forge-prod-dataflow/temp'}}
2022-09-29T02:02:56.877275+00:00 app[web.1]: 2022-09-29 02:02:56,877 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']
2022-09-29T02:02:58.116682+00:00 heroku[web.1]: Process running mem=647M(120.0%)
2022-09-29T02:02:58.143937+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2022-09-29T02:03:00.449388+00:00 app[web.1]: [2022-09-29 02:03:00 +0000] [60] [ERROR] Exception in ASGI application
2022-09-29T02:03:00.449397+00:00 app[web.1]: Traceback (most recent call last):
2022-09-29T02:03:00.449398+00:00 app[web.1]: File "/opt/app/pangeo_forge_orchestrator/routers/github_app.py", line 621, in run
2022-09-29T02:03:00.449398+00:00 app[web.1]: out = subprocess.check_output(cmd)
2022-09-29T02:03:00.449399+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
2022-09-29T02:03:00.449399+00:00 app[web.1]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2022-09-29T02:03:00.449400+00:00 app[web.1]: File "/usr/lib/python3.9/subprocess.py", line 528, in run
2022-09-29T02:03:00.449400+00:00 app[web.1]: raise CalledProcessError(retcode, process.args,
2022-09-29T02:03:00.449414+00:00 app[web.1]: subprocess.CalledProcessError: Command '['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']' returned non-zero exit status 1.

But I didn't get much detail on what the called process error issue actually was. So I copy-and-pasted the config in the first log line

2022-09-29T02:02:56.875589+00:00 app[web.1]: 2022-09-29 02:02:56,875 DEBUG - orchestrator - Dumping bakery config to json: {'Bake': {'bakery_class': 'pangeo_forge_runner.bakery.dataflow.DataflowBakery', 'job_name': 'a6170692e70616e67656f2d666f7267652e6f7267251191'}, 'TargetStorage': {'fsspec_class': 's3fs.S3FileSystem', 'fsspec_args': {'client_kwargs': {'endpoint_url': 'https://ncsa.osn.xsede.org'}, 'default_cache_type': 'none', 'default_fill_cache': False, 'use_listings_cache': False, 'key': SecretStr('**********'), 'secret': SecretStr('**********')}, 'root_path': 'Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr', 'public_url': 'https://ncsa.osn.xsede.org/{root_path}'}, 'InputCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {'bucket': 'pangeo-forge-prod-cache'}, 'root_path': 'pangeo-forge-prod-cache'}, 'MetadataCacheStorage': {'fsspec_class': 'gcsfs.GCSFileSystem', 'fsspec_args': {}, 'root_path': 'pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr'}, 'DataflowBakery': {'temp_gcs_location': 'gs://pangeo-forge-prod-dataflow/temp'}}

into a local JSON file on my laptop

{"Bake": {"bakery_class": "pangeo_forge_runner.bakery.dataflow.DataflowBakery", "job_name": "a6170692e70616e67656f2d666f7267652e6f7267251191"}, "TargetStorage": {"fsspec_class": "s3fs.S3FileSystem", "fsspec_args": {"client_kwargs": {"endpoint_url": "https://ncsa.osn.xsede.org"}, "default_cache_type": "none", "default_fill_cache": false, "use_listings_cache": false, "key": "**********", "secret": "**********"}, "root_path": "Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr", "public_url": "https://ncsa.osn.xsede.org/{root_path}"}, "InputCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {"bucket": "pangeo-forge-prod-cache"}, "root_path": "pangeo-forge-prod-cache"}, "MetadataCacheStorage": {"fsspec_class": "gcsfs.GCSFileSystem", "fsspec_args": {}, "root_path": "pangeo-forge-prod-cache/metadata/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1191/eobs-wind-speed.zarr"}, "DataflowBakery": {"temp_gcs_location": "gs://pangeo-forge-prod-dataflow/temp"}}

and then I copied the pangeo-forge-runner command from the second log line

2022-09-29T02:02:56.877275+00:00 app[web.1]: 2022-09-29 02:02:56,877 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef', '--json', '--prune', '--Bake.recipe_id=eobs-wind-speed', '-f=/tmp/tmplol5klds.json', '--feedstock-subdir=recipes/EOBS']

and, replacing '-f=/tmp/tmplol5klds.json' with the path to my local JSON config (c.json), ran

$ pangeo-forge-runner bake --repo=https://github.com/norlandrhagen/staged-recipes --ref=be1d60de2f3b9aed5fe480328b9d60e1ef0694ef --json --prune --Bake.recipe_id=eobs-wind-speed -f=c.json --feedstock-subdir=recipes/EOBS

which gave me a descriptive error

{"message": "Error during running: 'eobs-tg-tn-tx-rr-hu-pp'", "exc_info": "Traceback (most recent call last):\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/bin/pangeo-forge-runner\", line 8, in <module>\n    sys.exit(main())\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/cli.py\", line 28, in main\n    app.start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/cli.py\", line 23, in start\n    super().start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/traitlets/config/application.py\", line 462, in start\n    return self.subapp.start()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/commands/bake.py\", line 112, in start\n    recipes = feedstock.parse_recipes()\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py\", line 55, in parse_recipes\n    recipes[r[\"id\"]] = self._import(r[\"object\"])\n  File \"/Users/charlesstern/miniconda3/envs/pfo-new/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py\", line 43, in _import\n    return self._import_cache[module][export]\nKeyError: 'eobs-tg-tn-tx-rr-hu-pp'", "status": "failed"}

@cisaacstern
Copy link
Member

For the record, my theory in #169 (comment) about memory overruns was incorrect. I believe the problem is #169 (comment)

@andersy005
Copy link
Member

thank you for the heads up about potential memory spikes and tracking this issue, @cisaacstern!

@norlandrhagen
Copy link
Contributor Author

@cisaacstern Thanks for catching that issue. I updated the recipe, so that each id in the meta.yaml should now correspond to the correct recipe.

@andersy005
Copy link
Member

/run eobs-tg-tn-tx-rr-hu-pp

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Sep 29, 2022

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors.
That feature is not quite ready yet, however, so please reach out on this thread to a
maintainer, and they'll help you diagnose the problem.

@cisaacstern
Copy link
Member

@andersy005, rather than manually grab these error logs myself, I'm going to try to add a dataflow logs endpoint to the orchestrator right now, so that you can be more self-sufficient in diagnosing these issues going forward. I'll check back here once it's ready!

@andersy005
Copy link
Member

@andersy005, rather than manually grab these error logs myself, I'm going to try to add a dataflow logs endpoint to the orchestrator right now, so that you can be more self-sufficient in diagnosing these issues going forward. I'll check back here once it's ready!

that would be awesome, @cisaacstern. I haven't had a chance to do a deep dive into the dataflow functionality. Ping me if you need any feedback/review.

@andersy005
Copy link
Member

/run eobs-tg-tn-tx-rr-hu-pp

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Oct 20, 2022

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors.
That feature is not quite ready yet, however, so please reach out on this thread to a
maintainer, and they'll help you diagnose the problem.

Comment on lines 9 to 16
target_chunks = {'time': 40}
dataset_version = 'v23.1e'
grid_res = '0.1'
subset_inputs = {'time': 700}


def make_filename(time: str, variable: str) -> str:
return f'https://knmi-ecad-assets-prd.s3.amazonaws.com/ensembles/data/Grid_{grid_res}deg_reg_ensemble/{variable}_ens_mean_{grid_res}deg_reg_{dataset_version}.nc' # noqa: E501
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@norlandrhagen, it appears apache-beam doesn't work well with globally defined variables such as grid_res. Could you move these into the function body of make_filename?

  File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 624, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/usr/local/lib/python3.9/dist-packages/apache_beam/transforms/core.py", line 1956, in <lambda>
  File "/usr/local/lib/python3.9/dist-packages/pangeo_forge_recipes/executors/beam.py", line 40, in exec_stage
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 155, in cache_input
    fname = config.file_pattern[input_key]
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/patterns.py", line 219, in __getitem__
    fname = self.format_function(**format_function_kwargs)
  File "/tmp/tmpis_skwvg/recipes/EOBS/recipe.py", line 16, in make_filename
NameError: name 'grid_res' is not defined [while running 'Start|cache_input|Reshuffle_000|prepare_target|Reshuffle_001|store_chunk|Reshuffle_002|finalize_target|Reshuffle_003/cache_input/Execute-ptransform-56']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally. Thanks for looking into this @andersy005

@andersy005
Copy link
Member

/run eobs-tg-tn-tx-rr-hu-pp

@andersy005
Copy link
Member

/run eobs-wind-speed

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Oct 21, 2022

🎉 The test run of eobs-wind-speed at d93d94b succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1284/eobs-wind-speed.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

@andersy005
Copy link
Member

/run eobs-surface-downwelling

@andersy005
Copy link
Member

@norlandrhagen
Copy link
Contributor Author

Woohoo! Thanks for picking this back up @andersy005

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Oct 21, 2022

🎉 The test run of eobs-surface-downwelling at d93d94b succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1285/eobs-surface-downwelling.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Oct 21, 2022

🎉 The test run of eobs-tg-tn-tx-rr-hu-pp at d93d94b succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1283/eobs-tg-tn-tx-rr-hu-pp.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

@andersy005
Copy link
Member

Woohoo! Thanks for picking this back up @andersy005

You bet! Do you plan on pushing more commits to this or should I go ahead and merge it?

@norlandrhagen
Copy link
Contributor Author

Don't plan on pushing any more commits, so ready to merge!

@norlandrhagen norlandrhagen merged commit 6cb2e03 into pangeo-forge:master Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-failed:needs-admin-debug A test failed, and this requires an admin to debug why.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants