-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVC feature requests #560
Comments
Naïve requirements (first edition)Dulwich authenticationCommands like GitHub Actions, as per the [http "https://github.com/"]
extraheader = AUTHORIZATION: basic ··· GitLab and others may be using different mechanisms, like SSH keys or credential helpers, so this would require further investigation. See jelmer/dulwich#873 and jelmer/dulwich#882 for a similar request. Possible fixes
Automatic
|
don't see much of a problem with e.g. |
Me neither, that's why I used the 🤔 emoji above. Nevertheless, we would be silently masking any error — even network ones — under the experiment not found case, and this might not be good on a continuous integration scenario where failing early is better than blindly using expensive resources. |
CML workflow stoppage or the endless training problemIf the workflow stops, CML runner should be able to restart the workflow and continue the training from the last checkpoint. We conduct a series of experiments assuming that the training generates incremental checkpoints like tensorflow. Tensorflow example checkpoints
When
To simulate this stoppage we setup a workflow timeout if 1min ExpectedThe workflow should be able to be restarted and continue training from the last checkpoint until completed. ProblemWith DVC as storage all the experiments needs at some point to handle TrialsDVC repro
Implementation: dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc repro
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
fi
done dvc commit & push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc repro
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc add models
dvc commit
dvc push
fi
done dvc push --run-cache
.github/workflows/cml.yaml name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull --run-cache || echo 'Forgive this :pray:'
dvc repro --pull
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push --run-cache dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true train.sh #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push --run-cache
fi
done dvc commit & push --run-cache
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull --run-cache || echo 'Forgive this :pray:'
dvc repro --pull
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push --run-cache dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc add models
dvc commit
dvc push --run-cache
fi
done
Problems: DVC run exp
Implementation: Problems dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc run exp
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
fi
done DVC run exp checkpoints:
dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc exp run
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push dvc .yaml stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
checkpoint: true #!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
echo 'cml-pr'
cml-pr '.gitignore' 'dvc.lock'
fi
done Problems
A plausible solution would be merge the last PR enforcing the CI to restart and continue from there. |
My 2 cents: I feel this issue iterative/dvc#5369 is related to cml, I'd love to have the ability to check if I need to repro the pipeline without having to spin up a self-hosted runner and pull the data. |
SIGINT is not very effective when running |
btw @0x2b3bfa0 would |
If we use the run cache to save checkpoints, that would be much more elegant that my earlier suggestion. |
potential workflow: dvc exp run --name JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull
# ... (auto)kill via SIGINT after ~72h ... # CML does this 5 min early
dvc exp push # CML does this
dvc push # CML does this
# ... CML restarts the workflow better alternative: dvc exp run --name CI_JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull --push-every-checkpoint
# ... (auto)kill via SIGINT after ~72h ...
# ... CML restarts the workflow Note that using COMMIT_SHA instead of CI_JOB_ID might not work in cases where the exp params are not stored in the commit (i.e. 2 job ids with different params but yet same commit sha). |
to be revisited |
Collection of DVC issues which CML functionality needs
Potentially needs
dvc verify
New command: dvc verify - check that the pipeline is up to date without having to pull or run it dvc#5369dulwich
auto-auth using CI configexp run
and mid-checkpointdvc.lock
won't be generated checkpoints: writedvc.lock
on every checkpoint dvc#6180dvc exp run && dvc exp run
should only execute oncedvc exp run
followed by callingdvc exp run
again should resume (rather than start from checkpoint 0)dvc exp push
for >50MB commits (e.g. somehow push to DVC remote rather than Git remote?) exp push: fails for >50MB commits dvc#6181Needs
dvc exp push
upon each checkpoint (e.g. via user callback? Or builtin option checkpoints: flag toexp push
&&push
dvc#6182?)DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183The text was updated successfully, but these errors were encountered: