Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ciroh-ngen-image build is broken with ngen latest #137

Closed
benlee0423 opened this issue Mar 25, 2024 · 10 comments
Closed

ciroh-ngen-image build is broken with ngen latest #137

benlee0423 opened this issue Mar 25, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@benlee0423
Copy link

Short description explaining the high-level reason for the new issue.

Current behavior

Action is broken for the following error message.

#17 41.49 [ 57%] Building CXX object src/core/catchment/CMakeFiles/core_catchment.dir/HY_CatchmentRealization.cpp.o
#17 41.55 [ 57%] Built target geopackage
#17 41.56 [ 58%] Building CXX object test/CMakeFiles/test_network.dir/core/NetworkTests.cpp.o
#17 41.87 terminate called after throwing an instance of 'pybind11::error_already_set'
#17 41.87   what():  ModuleNotFoundError: No module named 'numpy'
#17 41.99 CMake Error at /usr/local/lib64/python3.9/site-packages/cmake/data/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:112 (message):
#17 41.99   Error running test executable.
#17 41.99 
#17 41.99     Path: '/ngen/ngen/cmake_build_serial/test/test_routing_pybind'
#17 41.99     Result: Subprocess aborted
#17 41.99     Output:
#17 41.99       
#17 41.99 
#17 41.99 Call Stack (most recent call first):

Expected behavior

Build the image without any error

Steps to replicate behavior (include URLs)

Screenshots

@benlee0423 benlee0423 added the bug Something isn't working label Mar 25, 2024
@benlee0423
Copy link
Author

The last successful ngen commit id is f91e2ea, and I create a branch ngen-commit-f91e2ea.
It build successfully.

@arpita0911patel
Copy link
Member

@benlee0423 is this for x86 build?

@hellkite500
Copy link
Collaborator

hellkite500 commented Mar 26, 2024

This issue appears to be the same as the one in the singularity thread (CIROH-UA/NGIAB-HPCInfra#12)

As mentioned in that thread:

https://github.com/CIROH-UA/NGIAB-CloudInfra/actions/runs/8426072053/job/23074620878#step:4:814

Shows that numpy is in /usr/local/lib64 but the site library for this install is /usr/lib

https://github.com/CIROH-UA/NGIAB-CloudInfra/actions/runs/8426072053/job/23074620878#step:4:810

NOAA-OWP/ngen#755 updated the pybind module in an attempt to fix up some other path issues related to pybind/pybind11#4471 but it appears that the issue may not be entirely resolved upstream (pybind/pybind11#4654).

I would suggest using a virtual environment for the ngen build and runtime which installs numpy and other required python modules in that environment. This seems to avoid many of the potential path issues such as the one seen in this issue. This is the reccomendation from the ngen depndencies documentation as well.

@JoshCu
Copy link
Collaborator

JoshCu commented Mar 26, 2024

unless docker has been updated with a better way of doing this, a python venv can be created like this

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

I'd make the changes myself but I've got to head to class in 2 minutes

@JoshCu
Copy link
Collaborator

JoshCu commented Jul 18, 2024

After investigating this further, the issue with the docker build can be fixed by using a more recent release of pybind11
The easiest way of doing this is to delete the pybind11 directory and re-clone it here

This line should do it

RUN cd /ngen/ngen/extern && rm -rf pybind11 && git clone https://github.com/pybind/pybind11.git && cd pybind11 && git checkout v2.11.0

It's what I've used in this dockerfile I've been working on to build ngen as fast as possible for testing.
https://github.com/JoshCu/NGIAB-CloudInfra/blob/328808dbf5fa08c36412840be618df52239e0ef4/docker/Dockerfile#L63

The latest version of pybind11 2.13.1 also works, but I figured it would be best to use the next most recent release after the version currently pinned in the ngen repo (pinned is 2.10.4, next after that is 2.11).

It looks like there are plans to increase pybind11 to 2.12 NOAA-OWP/ngen#837 which will also fix this. It should also fix CIROH-UA/NGIAB-HPCInfra#12, and will likely fix NOAA-OWP/DMOD#630.

@arpita0911patel
Copy link
Member

arpita0911patel commented Aug 2, 2024

@benlee0423 - possible to recheck on this when you get chance.

@benlee0423 benlee0423 changed the title ciroh-ngen-image build is broken ciroh-ngen-image build is broken with ngen latest Aug 2, 2024
@benlee0423
Copy link
Author

@arpita0911patel
Build failed with ngen latest code.
https://github.com/CIROH-UA/NGIAB-CloudInfra/actions/runs/10217505548
The error message from the build is the same as the one on top.

@benlee0423
Copy link
Author

@arpita0911patel
Josh's suggestion fixes the issue.
It was tested and verified.
#212 Good to merge now.
Thank you @JoshCu

@arpita0911patel
Copy link
Member

Thank you Ben

@benlee0423
Copy link
Author

Closing as the PR gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants