Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if model folder exists on startup and request processing #3044

Closed
wants to merge 40 commits into from

Conversation

eliteprox
Copy link
Collaborator

@eliteprox eliteprox commented May 6, 2024

What does this pull request do? Explain your changes. (required)

This PR is dependent on livepeer/ai-runner#79

The change checks if the requested model folder exists when loading during startup (warm only) and gracefully handles the condition of a model folder missing in requests from gateway.

  • This improves response times on the network by immediately returning a 503 API error code when the orchestrator is missing the model and is primarily useful for cold models.
  • This improves orchestrator onboarding by logging the exact path the container is looking for the model in on startup and individual requests when model is not found.

Gateway error log:

I0506 09:29:28.120307 1985227 discovery.go:180] Done fetching orch info numOrch=1 responses=1/1 timedOut=false
I0506 09:29:30.600500 1985227 ai_process.go:344] clientIP=127.0.0.1 request_id=14b57a61 Error submitting request cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1 try=1 orch=https://0.0.0.0:8936 err=Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600545 1985227 handlers.go:1479] clientIP=127.0.0.1 request_id=14b57a61 Error with API code=503 err=no orchestrators available within 2s timeout

AI Core error log on cold model request:

I0506 09:29:28.121922 1984042 ai_http.go:198] manifestID=27_stabilityai/stable-video-diffusion-img2vid-xt-1-1 orchSessionID=8983c425 clientIP=127.0.0.1 Received request id=6156387e cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
2024/05/06 09:29:30 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600020 1984042 handlers.go:1511] HTTP Response Error 503: Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1

AI Core error log on startup:

2024/05/06 10:04:25 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 10:04:25.144208 2005927 starter.go:549] Error AI worker warming text-to-image container: model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist
I0506 10:04:25.144224 2005927 db.go:368] Closing DB

Specific updates (required)

  • This code checks if the given model exists on startup and when processing requests.
  • Uses a new method ModelExists in ai-worker that returns boolean if specific model folder exists

How did you test each of these updates (required)

  1. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to true
  2. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to false
  3. Sent AI request with gateway to go-livepeer running a cold model name that doesn't exist, received immediate error response from orchestrator of 503.

Does this pull request close any open issues?
Addresses LIV-117

Checklist:

@github-actions github-actions bot added the AI Issues and PR related to the AI-video branch. label May 6, 2024
@eliteprox eliteprox changed the title Check-model-folder Check if model folder exists on startup and request processing May 6, 2024
server/ai_http.go Outdated Show resolved Hide resolved
@eliteprox eliteprox marked this pull request as ready for review May 6, 2024 14:36
@eliteprox eliteprox requested a review from rickstaa as a code owner May 6, 2024 14:36
eliteprox and others added 23 commits May 7, 2024 07:36
This commit updates the 'ai-worker' dependency to the latest commit.
This commit adds the `gateway` flag and deprecates the `broadcaster` flag per core team decision (details: https://discord.com/channels/423160867534929930/1051963444598943784/1210356864643109004).
)

* Remove -pricePerUnit requirement for orchestrator with -AIWorker flag

* refactor: add PricePerUnit comment

This commit reintroduces the previously omitted comment for the
PricePerUnit variable, improving code readability and maintainability.

* refactor: simplify PricePerUnit flag check condition

This commit simplifies the conditional check used to check if the
`PricePerUnit` flag is needed.

---------

Co-authored-by: Rick Staa <[email protected]>
This commit updates the https://github.com/livepeer/ai-worker to the
latest version so that Orchestrators can enable the
[DeepCache](https://github.com/horseee/DeepCache) optimization. This
optimization will provide a 50% speedup for multi-step inference
requests.
This commit ensures that the global
https://pkg.go.dev/github.com/golang/mock/Mockgen package is correctly
found when the binary is built using the makescript.
This commit enables the NSFW filter on the AI Subnet that has been
implement at the runner side in
livepeer/ai-runner#76.

BREAKING CHANGE: Depending on how dApps interact with the subnet this
could be a breaking change given that we return an extra `nsfw`
property.
This commit updates the ai-worker so that the right go bindings are
available and no nil errors are thrown.
This commit ensures that the livepeer builder is triggered when AI-version tags
are used (e.g., `v0.7.2-ai-video-1`).
This commit ensures that the ai-worker is up to date so that no `nil`
pointer runtime error is thrown when the runner container returns a
empty response.
* refactor(census): rename Broadcaster metrics to Gateway

This commit renames the metrics related to Broadcaster to Gateway, following
a team decision. More details can be found in the discussion
here:
[Team Discussion Link](.com/channels/423160867534929930/1051963444598943784/1210356864643109004).

* chore: update pending changelog
…vepeer#3061)

This commit adds the `pricePerGateway` flag and deprecates the
`pricePerBroadcaster` flag
per core team decision (details:
https://discord.com/channels/423160867534929930/1051963444598943784/1210356864643109004).
This commit introduces a safeguard to ensure that the Docker image
tagged
as 'stable' is only pushed when a new tag is created on the stable
branch.
This prevents unintended updates to the stable Docker image, ensuring
consistency and reliability for users relying on the stable tag.
This commit addresses a syntax error in the Docker image tag creation
step.
…vepeer#3059)

* Fix nil baseprice when pricePerUnit is unused in aiWorker

* fix: fix priceInfo 'nil' error on discovery

This commit ensures that when the `transcodePrice` is not set by the AI
orchestrator no `nil` error is thrown when a Gateway requests the
orchestrators OrchInfo.

* fix(ai): fix incorrect transcodePrice condition

This commit fixes the check that is performed to check if transcodePrice
is set.

---------

Co-authored-by: Rick Staa <[email protected]>
This commit ensures that the livepeer_cli does not throw a `nil` error
when it tries to retrieve the orchestrator base price.
This commit allows orchestrators to pin the https://hub.docker.com/r/livepeer/ai-runner image, preventing disruptions from breaking changes in the latest tag.
This commit ensures that the stable tag is created on the master branch.
* add safety check to image-to-video input image

* refactor(ai): improve code syntax

This commit improves the code syntax by making the output format
generation step consistent between pipelines. It also updates the
ai-worker to the latest version.

---------

Co-authored-by: Brad P <[email protected]>
rickstaa and others added 15 commits July 26, 2024 14:57
This commit updates the ai-worker dependency to the latest version (i.e.
v0.0.4).
This commit updates the AI worker to v0.0.5 so that people can use the
new I2I pix2pix model.
This commit updates the ai-worker to the latest version (i.e. v0.0.6) in
order to fix a syntax error that was introduced due to an upstream
dependency in v0.0.4 and v0.0.5.
…re calculation (livepeer#3074)

* fix(ai): Fix accuracy of T2I latency score when num_inference_steps provided

* refactor(ai): update numInferenceSteps default

This commit ensures that the same numInferenceSteps default value is
used as the one set in
https://github.com/livepeer/ai-worker/blob/31fe460a45e1d9e908d3a1bdcfdd8822c3889214/runner/app/routes/text_to_image.py#L28.

---------

Co-authored-by: Elite Encoder <[email protected]>
This commit ensures that the go-livepeer ai-video branch uses the latest
ai-worker dependeny (i.e. v0.0.7).
* add upscale image support using stabilityai/stable-diffusion-x4-upscaler model

* fix(ai): fix ai-worker client bindings

This commit ensures that the right golang client bindings response and
request types are used. It also cleans up the codebase a bit.

---------

Co-authored-by: Mike Zupper <[email protected]>
…3093)

This commit ensures that the I2I pipeline latency score calculation now
considers the number of images.
…ivepeer#3099)

This commit adds support for the `num_inference_steps` parameter to the
I2I, I2V and upscale pipelines. It also fixes a incorrect latencyScore
calculation for the bytedance model.
* Add speech-to-text pipeline, refactor processAIRequest and handleAIRequest to allow for various response types

* Pin gomod to ai-runner for testing

* Revert "Pin gomod to ai-runner for testing"

This reverts commit d4ba500.

* Update go mod dep for ai-worker

* Calculate pixel value of audio file

* fix go-mod deps

* Adjust price calculation

* one second per pixel

* cleanup, fix missing duration

* Add supported file types, calculate price by milliseconds

* Add bad request response for unsupported file types

* Update name of function

* Update go mod to ai-runner

* Use ffmpeg to get duration

* update install_ffmpeg.sh to parse audio better

* Check for audio codec instead of video codec

* gomod edits

* add docker file

* Update install_ffmpeg.sh to improve audio support, Add duration validation and logging, pin lpms

* rename speech-to-text to audio-to-text

* Update go-mod

* cleanup

* update go mod

* remove comment

* update gomod

* Update lpms mod

* Update to latest lpms

* Update lpms

* feat(ai): apply code improvements to AudioToText pipeline

This commit applies several code improvements to the AudioToText
codebase.

* Remove unnecessary logic

* Remove unused error

* Fix missing err

* Update go.mod and tidy

* chore(ai): update ai-worker and lpms to latest version

This commit ensures that the ai-worker and lpms are at the latest
versions which contain the changes needed for the audio-to-text
pipeline.

---------

Co-authored-by: 0xb79orch <[email protected]>
Co-authored-by: Rick Staa <[email protected]>
* Add gateway metric for roundtrip ai times by model and pipeline

* Rename metrics and add unique manifest

* Fix name mismatch

* modelsRequested not working correctly

* feat: add initial POC AI gateway metrics

This commit adds the initial AI gateway metrics so that they can
reviewed by others. The code still need to be cleaned up and the buckets
adjusted.

* feat: improve AI metrics

This commit improves the AI metrics so that they are easier to work
with.

* feat(ai): log no capacity error to metrics

This commit ensures that an error is logged when the Gateway could not
find orchestrators for a given model and capability.

* feat(ai): add TicketValueSent and TicketsSent metrics

This commit ensure that the `ticket_value_sent` abd `tickets_sent`
metrics are also created for a AI Gateway.

* fix(ai): ensure that AI metrics have orch address label

This commit ensures that the AI gateway metrics contain the orch address
label.

* fix(ai): fix incorrect Gateway pricing metric

This commit ensures that the AI job pricing is calculated correctly and
cleans up the codebase.

* refactor(ai): remove Orch label from ai_request_price metric

This commit removes the Orch label from the ai_request_price metrics
since that information is better to be retrieved from another endpoint.

---------

Co-authored-by: Elite Encoder <[email protected]>
This commit adds the gateway metrics to the Audio-to-text pipeline.
* Add gateway metric for roundtrip ai times by model and pipeline

* Rename metrics and add unique manifest

* Fix name mismatch

* modelsRequested not working correctly

* feat: add initial POC AI gateway metrics

This commit adds the initial AI gateway metrics so that they can
reviewed by others. The code still need to be cleaned up and the buckets
adjusted.

* feat: improve AI metrics

This commit improves the AI metrics so that they are easier to work
with.

* feat(ai): log no capacity error to metrics

This commit ensures that an error is logged when the Gateway could not
find orchestrators for a given model and capability.

* feat(ai): add TicketValueSent and TicketsSent metrics

This commit ensure that the `ticket_value_sent` abd `tickets_sent`
metrics are also created for a AI Gateway.

* fix(ai): ensure that AI metrics have orch address label

This commit ensures that the AI gateway metrics contain the orch address
label.

* feat(ai): add orchestrator AI census metrics

This commit introduces a suite of AI orchestrator metrics to the census
module, mirroring those received by the Gateway. The newly added metrics
include `ai_models_requested`, `ai_request_latency_score`,
`ai_request_price`, and `ai_request_errors`, facilitating comprehensive
tracking and analysis of AI request handling performance on the orchestrator side.

* refactor: improve orchestrator metrics tags

This commit ensures that the right tags are attached to the Orchestrator
AI metrics.

* refactor(ai): improve latency score calculations

This commit ensures that no devide by zero errors can occur in the
latency score calculations.

---------

Co-authored-by: Elite Encoder <[email protected]>
This commit applies some small comment changes to ease the conflicts
between the main and ai-video branch.
@eliteprox eliteprox closed this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants