Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Dependencies from private repositories unable to be seen #25085

Closed
2 of 15 tasks
RobMcKiernan opened this issue Jan 19, 2023 · 19 comments · Fixed by #26471
Closed
2 of 15 tasks

[Bug]: Dependencies from private repositories unable to be seen #25085

RobMcKiernan opened this issue Jan 19, 2023 · 19 comments · Fixed by #26471
Labels
bug dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python

Comments

@RobMcKiernan
Copy link

RobMcKiernan commented Jan 19, 2023

What happened?

Running a gcp dataflow, using the python sdk 2.44.0 I can no longer access my private repositories. It works on 2.43.0

My set up is as follows:

# These image repos have access to my private repos
FROM my.private.image-repo/python:3.8-slim-builder as builder
FROM my.private.image-repo/python:3.8-slim
COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
# this virtual env has all the dependencies I need pre-installed on it, including ones from private repos
COPY --from=builder $VENV_PATH $VENV_PATH
ENTRYPOINT ["/opt/apache/beam/boot"]

This is my run command:

poetry run python -m projname.main \
  --project="$PROJECT_ID" \
  --runner=DataFlowRunner \
  --temp_location=gs://"$BUCKET_NAME"/temp \
  --region="$REGION" \
  --job_name="$JOB_NAME" \
  --setup_file=./setup.py \
  --subnetwork https://www.googleapis.com/compute/v1/projects/"$PROJECT_ID"/regions/"$REGION"/subnetworks/"$SUBNET" \
  --experiment=use_runner_v2 \
  --sdk_container_image=$IMAGE_NAME \
  --template_location=gs://"$BUCKET_NAME"/templates/"$JOB_NAME" \

Checking my dataflow worker logs it fails to see my private repos:

ERROR: Could not find a version that satisfies the requirement package-i-want<3.0.0,>=2.2.0 (from name-of-my-dataflow) (from versions: none)

I think this is the culprit PR: https://github.com/apache/beam/pull/23684/files#diff-cc1f3d7f808c692a6102847bec78809f2e4350c5ee34278100ce0f55d8c23d68R234

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor

Abacn commented Jan 20, 2023

CC: @robertwb
CC: @tvalentyn

Sounds like a regression. Is there a workaround to mitigate this?

@tvalentyn
Copy link
Contributor

tvalentyn commented Jan 20, 2023

ack, thanks, I'll try to get some eyes here.

@riteshghorse
Copy link
Contributor

I looked at the mentioned culprit PR and I don't think its quite the culprit because it is not discarding anything that used to work earlier. I'll take a closer look at the bug for other possibilities.

@riteshghorse
Copy link
Contributor

are your private dependencies listed in requirements.txt somehow and not pulled locally when running the job?

@tvalentyn
Copy link
Contributor

FWIW, if there is a regression between versions, it should be possible to bisect the regression to an exact commit.

@tvalentyn
Copy link
Contributor

ERROR: Could not find a version that satisfies the requirement package-i-want<3.0.0,>=2.2.0 (from name-of-my-dataflow) (from versions: none)

Re: 'from versions: none' - just to double check, when you changed versions of Beam, did you by chance also change the version of Python interpreter in addition to Beam version? Could you double check that it didn't change?

@RobMcKiernan
Copy link
Author

Sorry I've been away for the past week.

are your private dependencies listed in requirements.txt somehow and not pulled locally when running the job?
I don't use a requirements.txt. Instead I use poetry, which creates a poetry.lock file, which serves a similar purpose as a requirements.txt. I have verified that my local poetry virtual env has my private python repos installed in it.

The other part to this is that I've created a base docker container for my workers on gcp to use. The private docker image referred to in my Dockerfile FROM my.private.image-repo/python:3.8-slim-builder as builder has access to my private python repositories (I've verified this by pulling the docker image myself and exec-ing into it). It seems it is at this point that it fails to have access to my private repos.

@tvalentyn no, I'm afraid my python version has remained constant.

@tvalentyn
Copy link
Contributor

tvalentyn commented Feb 11, 2023

I see. It looks like you may be copying site-packages directory from a different virtual environment. There was a change recently that creates one virtual environment per each SDK process: #16658

It could be that you were impacted by this change, if you have been using a non-global site-packages directory to store your packages after COPY --from=builder $VENV_PATH $VENV_PATH command.

Note that dependencies installed in the global python environment should still be accessible in individual python environments, which are created after #16658.

@tvalentyn
Copy link
Contributor

I think 2.44.0 is the first release that include #16658 , which matches the timing you describe.

@RobMcKiernan
Copy link
Author

Ah ok, yep that sounds like it could be the culprit then.

I've noticed that the dataflow docs use pip to install python packages whereas I'm using poetry. I wonder if that plays into this? https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild

The envvar VENV_PATH is set to /venv in my COPY --from=builder $VENV_PATH $VENV_PATH if that helps illuminate anything

@tvalentyn
Copy link
Contributor

tvalentyn commented Feb 17, 2023

The global environment will have packages installed in ./usr/local/lib/python3.8/site-packages. These packages will be available to other environments. If you activate a custom venv, I think it will be ignored now that the codepath has changed in #16658, and a python process creates an individual environment.

I suppose you could try to manipulate the PYTHONPATH variable to include your environment, but that may be brittle if you have package mismatches.

@RobMcKiernan
Copy link
Author

I'm back working on this now. I tried altering my PYTHONPATH in my Dockerfile, but that didn't seem to work, although I'm not quite sure why.

I'm now experimenting using poetry config virtualenvs.create false to install my packages in the global python environment. I'll let you know how it goes.

@tvalentyn
Copy link
Contributor

sg, thanks.

@RobMcKiernan
Copy link
Author

Yep, that worked! My new Dockerfile, in case it helps anyone:

# This image is just a thin wrapper around the standard python10 slim image. It should work just fine using the standard image
FROM eu.gcr.io/my-proj/python:3.10-slim
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
COPY --from=apache/beam_python3.10_sdk:2.46.0 /opt/apache/beam /opt/apache/beam

ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 \
    POETRY_NO_INTERACTION=1 \
    PATH=/usr/lib/google-cloud-sdk/bin:$PATH

WORKDIR /app

# -- Omitted Section to sort out my gcloud authentication, which I'm not including out of paranoia --

RUN pip install --no-cache-dir \
        poetry \
        keyring \
        keyrings.google-artifactregistry-auth

COPY ./pyproject.toml ./poetry.lock ./

# setting virtualenvs.create to false prevents poetry using venvs as
# beam >2.43 uses global python packages only
RUN poetry config virtualenvs.create false \
 && poetry install --no-cache --no-root --only main \
 && rm -rf /root/.cache

ENTRYPOINT ["/opt/apache/beam/boot"]

tl;dr for anyone skipping to the end: Make sure your python packages are installed in /usr/local/lib/python<version number>/site-packages in your docker container.

Cheers for your help everyone! Should I close, or would you like it kept open? I guess at a minimum this should be documented somewhere.

@tvalentyn
Copy link
Contributor

tvalentyn commented Apr 28, 2023

You could modify CHANGES.md to further document suggestions/instructions pertaining to change in behavior in 2.44.0 if you'd like and link this issue.

@tvalentyn
Copy link
Contributor

Glad to hear you resolved the issue.

@RobMcKiernan
Copy link
Author

I just tried raising a PR, but it appears that I don't have the needed permissions to push to this repo. This is the diff of my PR:

diff --git a/CHANGES.md b/CHANGES.md
index 871f24bf9d..c7578a8a61 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -254,6 +254,8 @@
   runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.
 * Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
   if a Slice type is used as a PCollection element or State API element. (Go)[#24339](https://github.com/apache/beam/issues/24339)
+* Custom worker Dockerfiles must now install their dependencies in the global python environment. For example, when using poetry
+  you must use `poetry config virtualenvs.create false` before installing deps [#25085](https://github.com/apache/beam/issues/25085)
 
 ## Deprecations
 
diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md
index 17ee452a57..46a7f69209 100644
--- a/website/www/site/content/en/documentation/runtime/environments.md
+++ b/website/www/site/content/en/documentation/runtime/environments.md
@@ -198,6 +198,7 @@ Beam offers a way to provide your own custom container image. The easiest way to
 >The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
 >**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.**
 
+>**NOTE**: When using version >=2.44.0 you must ensure dependencies are installed in the global python environment in the resulting image 
 
 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker.

@tvalentyn
Copy link
Contributor

np. you might have to fork a repo first to create PRs. Sent you #26471

@tvalentyn
Copy link
Contributor

Thanks a lot!

@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants