Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve test stability issues caused by resources failing to start due to "address in use" errors #6678

Closed
DamianEdwards opened this issue Nov 14, 2024 · 6 comments
Assignees
Labels
area-app-testing Issues pertaining to the APIs in Aspire.Hosting.Testing area-orchestrator
Milestone

Comments

@DamianEdwards
Copy link
Member

Found in the dotnet/aspire-samples tests (e.g. this CI failure).

Sometimes when running app host integration tests using DistributedApplicationTestingBuilder, resources can fail to start due to the port that was randomly assigned by DCP already being in use by the time the resource goes to start:

Image

This results in a test that fails, but that will most likely pass on re-run. We should look at how we can improve this so that tests are more reliable, e.g. doing automatic retries when resources fail to start.

@DamianEdwards DamianEdwards added area-app-testing Issues pertaining to the APIs in Aspire.Hosting.Testing area-orchestrator untriaged New issue has not been triaged labels Nov 14, 2024
@KennethHoff
Copy link

KennethHoff commented Nov 14, 2024

Is there something to be done about the "address in use" problem in general? (Or more specifically, Aspire being bad at ending its child processes). I've had to resort to creating a script that first executes kill on various processes before starting Aspire in order to make restarts more palettable. This could presumably also be solved by making hot reloading more functional; not requiring a full restart when you change/add a new environment variable for example.

@DamianEdwards
Copy link
Member Author

Is there something to be done about the "address in use" problem in general?

The challenge is always that it's inherently racy if you want to know what the port is before you assign it to the process. Also it's not clear when it does fail if it's due to some other random process using the port or a previous instance of something launched by Aspire that wasn't shut down properly, or even random collisions in situations like parallel startup. I think adding some kind of support for startup retries could help here. I might attempt to see if I can add some startup-failure detection and retry logic to the testing infrastructure in dotnet/aspire-samples at the app host level, e.g. when a resource state changes to FailedToStart, scrape its logs and look for "port/address already in use" messages and in those cases, issue a "Start" command.

Aspire being bad at ending its child processes

This was improved in between 9.0.0-rc.1 and 9.0.0, so I'd love to hear if you're seeing any improvement after updating to the latest version.

This could presumably also be solved by making hot reloading more functional; not requiring a full restart when you change/add a new environment variable for example.

For sure, this is something we're thinking of improving but likely more in the mid-to-long term rather than short term.

@yoDon
Copy link

yoDon commented Nov 18, 2024

This issue is not just a "test stability issue" Aspire 9 seems to have lots of problems stopping projects. In Aspire 8, I routinely started and stopped solutions without issue. In Aspire 9 on OSX with Podman Desktop and the Rider Aspire plugin, about half the time there are stuck ports preventing Aspire from starting.

@DamianEdwards
Copy link
Member Author

@karolz-ms @danegsta FYI

@yoDon can you try seeing what it's like without using the Rider plug-in, i.e. starting/stopping from the cmd line using dotnet run? Trying to see if we can narrow down the scope.

@yoDon
Copy link

yoDon commented Nov 18, 2024

@DamianEdwards
Yes, I am able to reproduce the issue using AppHost dotnet run on OSX + Podman Desktop (no Rider involvement)

I'm also mentioning a non-test-related issue that I reported around this, incase it's helpful to connect the two: #6704

@joperezr joperezr added this to the Backlog milestone Nov 20, 2024
@joperezr joperezr removed the untriaged New issue has not been triaged label Nov 20, 2024
@joperezr joperezr modified the milestones: Backlog, 9.1 Nov 20, 2024
@karolz-ms karolz-ms self-assigned this Dec 9, 2024
@davidfowl
Copy link
Member

Fixed by #7098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-app-testing Issues pertaining to the APIs in Aspire.Hosting.Testing area-orchestrator
Projects
None yet
Development

No branches or pull requests

6 participants