Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to connect to metrics listener in acceptance tests #1769

Closed
teor2345 opened this issue Feb 18, 2021 · 6 comments · Fixed by #1812
Closed

Unable to connect to metrics listener in acceptance tests #1769

teor2345 opened this issue Feb 18, 2021 · 6 comments · Fixed by #1812
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug I-integration-fail Continuous integration fails, including build and test failures
Milestone

Comments

@teor2345
Copy link
Contributor

teor2345 commented Feb 18, 2021

Next Steps

Wait for this failure to happen again, so we can see the diagnostics we added in #1770.

Version

Zebra main, PR #1766, commit 79d45e7.

Platform

Windows CI

Description

The metrics_endpoint test sometimes fails because the test can't connect to the zebrad metrics listener.

We don't know why, because the acceptance test doesn't show the zebrad logs for this error.

Commands

zebrad start with the metrics_endpoint test config.

Newer Logs

Error: 
   0: error trying to connect: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   1: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   2: No connection could be made because the target machine actively refused it. (os error 10061)

Stdout:
   Feb 23 21:25:42.629  INFO {zebrad="f7fa70f6" net="Main"}: zebrad::components::metrics: Trying to open metrics endpoint at 127.0.0.1:49821...

Stderr:
   The application panicked (crashed).
   Message:  Opening metrics endpoint listener 127.0.0.1:49821 failed: Hyper(hyper::Error(Listen, Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." })). Hint: Check if another zebrad or zcashd process is running. Try changing the metrics endpoint_addr in the Zebra config.
   Location: zebrad\src\components\metrics.rs:23
---- metrics_endpoint stdout ----
Error: 
   0: error trying to connect: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   1: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   2: No connection could be made because the target machine actively refused it. (os error 10061)

Location:
   /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\core\src\convert\mod.rs:539

Command:
   "D:\\a\\zebra\\zebra\\target\\debug\\zebrad.exe" "-c" "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\zebrad_tests.3uFrz3Klk6zO\\zebrad.toml" "start"

Exit Status:
   command exited unsuccessfully with status code -1073740791

Stdout:
   Feb 23 21:25:42.629  INFO {zebrad="f7fa70f6" net="Main"}: zebrad::components::metrics: Trying to open metrics endpoint at 127.0.0.1:49821...
   

Stderr:
   The application panicked (crashed).
   Message:  Opening metrics endpoint listener 127.0.0.1:49821 failed: Hyper(hyper::Error(Listen, Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." })). Hint: Check if another zebrad or zcashd process is running. Try changing the metrics endpoint_addr in the Zebra config.
   Location: zebrad\src\components\metrics.rs:23
   
   Metadata:
   version: 1.0.0-alpha.3
   git commit: f7fa70f6
   Zcash network: Mainnet
   
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SPANTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
     
      0: zebrad::application:: with zebrad="f7fa70f6" net="Main"
         at zebrad\src\application.rs:271
   
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                   ⋮ 11 frames hidden ⋮                              
     12: zebrad::components::metrics::MetricsEndpoint::new<unknown>
         at D:\a\zebra\zebra\zebrad\src\components\metrics.rs:23
           21 │                     info!("Opened metrics endpoint at {}", addr);
           22 │                 }
           23 >                 Err(e) => panic!(
           24 │                     "Opening metrics endpoint listener {:?} failed: {:?}. \
           25 │                      Hint: Check if another zebrad or zcashd process is running. \
     13: zebrad::application::{{impl}}::register_components<unknown>
         at D:\a\zebra\zebra\zebrad\src\application.rs:284
          282 │             components.push(Box::new(TokioComponent::new()?));
          283 │             components.push(Box::new(TracingEndpoint::new(cfg_ref)?));
          284 >             components.push(Box::new(MetricsEndpoint::new(cfg_ref)?));
          285 │         }
          286 │ 
     14: zebrad::application::{{impl}}::init<unknown>
         at D:\a\zebra\zebra\zebrad\src\application.rs:295
          293 │         // We do this first to calculate a proper dependency ordering before
          294 │         // application configuration is processed
          295 >         self.register_components(command)?;
          296 │ 
          297 │         // Fire callback to signal state in the application lifecycle
     15: abscissa_core::application::Application::run<zebrad::application::ZebradApp,std::env::Args><unknown>
         at C:\Users\runneradmin\.cargo\registry\src\jackfan.us.kg-1ecc6299db9ec823\abscissa_core-0.5.2\src\application.rs:60
           58 │         // Initialize application
           59 │         let mut app = Self::default();
           60 >         app.init(&command).unwrap_or_else(|e| fatal_error(&app, &e));
           61 │         app_cell.set_once(app);
           62 │ 
     16: abscissa_core::application::boot<zebrad::application::ZebradApp><unknown>
         at C:\Users\runneradmin\.cargo\registry\src\jackfan.us.kg-1ecc6299db9ec823\abscissa_core-0.5.2\src\application.rs:196
          194 │     let mut args = env::args();
          195 │     assert!(args.next().is_some(), "expected one argument but got zero");
          196 >     A::run(app_cell, args);
          197 │     process::exit(0);
          198 │ }
     17: zebrad::main<unknown>
         at D:\a\zebra\zebra\zebrad\src\bin\zebrad\main.rs:10
            8 │ /// Boot Zebrad
            9 │ fn main() {
           10 >     abscissa_core::boot(&APPLICATION);
           11 │ }
     18: core::ops::function::FnOnce::call_once<fn(),tuple<>><unknown>
         at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\core\src\ops\function.rs:227
                                   ⋮ 13 frames hidden ⋮                              
   
   Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
   

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                ⋮ 9 frames hidden ⋮                               
  10: core::convert::{{impl}}::into<hyper::error::Error,eyre::Report><unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\core\src\convert\mod.rs:539
  11: zebra_test::command::TestChild<tempdir::TempDir>::kill_on_error<tempdir::TempDir,http::response::Response<hyper::body::body::Body>,hyper::error::Error><unknown>
      at D:\a\zebra\zebra\zebra-test\src\command.rs:337
       335 │         let mut error: Report = match result {
       336 │             Ok(success) => return Ok((success, self)),
       337 >             Err(error) => error.into(),
       338 │         };
       339 │ 
  12: acceptance::metrics_endpoint::{{closure}}<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:954
       952 │     // Test metrics endpoint
       953 │     let res = client.get(url.try_into().expect("url is valid")).await;
       954 >     let (res, child) = child.kill_on_error(res)?;
       955 │     assert!(res.status().is_success());
       956 │     let body = hyper::body::to_bytes(res).await;
  13: core::future::from_generator::{{impl}}::poll<generator-0><unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\core\src\future\mod.rs:80
  14: core::future::future::{{impl}}::poll<mut core::future::from_generator::GenFuture<generator-0>*><unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\core\src\future\future.rs:119
                                ⋮ 14 frames hidden ⋮                              
  29: acceptance::metrics_endpoint<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:927
       925 │ }
       926 │ 
       927 > #[tokio::test]
       928 │ async fn metrics_endpoint() -> Result<()> {
       929 │     use hyper::Client;
                                ⋮ 3 frames hidden ⋮                               
  33: test::__rust_begin_short_backtrace<fn()><unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\/library\test\src\lib.rs:556
  34: alloc::boxed::{{impl}}::call_once<unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\alloc\src\boxed.rs:1521
                                ⋮ 20 frames hidden ⋮                              

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
---- end of test output ----

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                ⋮ 12 frames hidden ⋮                              
  13: acceptance::metrics_endpoint::{{closure}}<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:927
       925 │ }
       926 │ 
       927 > #[tokio::test]
       928 │ async fn metrics_endpoint() -> Result<()> {
       929 │     use hyper::Client;
                                ⋮ 2 frames hidden ⋮                               
  16: test::__rust_begin_short_backtrace<fn()><unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\/library\test\src\lib.rs:556
  17: alloc::boxed::{{impl}}::call_once<unknown>
      at /rustc/14b27555a03f4abf9b8d80ea34f59144e2ded5e4\library\alloc\src\boxed.rs:1521
                                ⋮ 20 frames hidden ⋮                              

Older Logs

---- metrics_endpoint stdout ----
Error: 
   0: error trying to connect: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   1: tcp connect error: No connection could be made because the target machine actively refused it. (os error 10061)
   2: No connection could be made because the target machine actively refused it. (os error 10061)

Location:
   zebrad\tests\acceptance.rs:964
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                ⋮ 9 frames hidden ⋮                               
  10: acceptance::metrics_endpoint::{{closure}}<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:964
       962 │ 
       963 │     // Test metrics endpoint
       964 >     let res = client.get(url.try_into().expect("url is valid")).await?;
       965 │     assert!(res.status().is_success());
       966 │     let body = hyper::body::to_bytes(res).await?;
  11: core::future::from_generator::{{impl}}::poll<generator-0><unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\library\core\src\future\mod.rs:80
  12: core::future::future::{{impl}}::poll<mut core::future::from_generator::GenFuture<generator-0>*><unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\library\core\src\future\future.rs:119
                                ⋮ 14 frames hidden ⋮                              
  27: acceptance::metrics_endpoint<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:938
       936 │ }
       937 │ 
       938 > #[tokio::test]
       939 │ async fn metrics_endpoint() -> Result<()> {
       940 │     use hyper::Client;
                                ⋮ 3 frames hidden ⋮                               
  31: test::__rust_begin_short_backtrace<fn()><unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\/library\test\src\lib.rs:516
  32: alloc::boxed::{{impl}}::call_once<unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\library\alloc\src\boxed.rs:1328
                                ⋮ 20 frames hidden ⋮                              

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
---- end of test output ----

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                ⋮ 12 frames hidden ⋮                              
  13: acceptance::metrics_endpoint::{{closure}}<unknown>
      at D:\a\zebra\zebra\zebrad\tests\acceptance.rs:938
       936 │ }
       937 │ 
       938 > #[tokio::test]
       939 │ async fn metrics_endpoint() -> Result<()> {
       940 │     use hyper::Client;
                                ⋮ 2 frames hidden ⋮                               
  16: test::__rust_begin_short_backtrace<fn()><unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\/library\test\src\lib.rs:516
  17: alloc::boxed::{{impl}}::call_once<unknown>
      at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\library\alloc\src\boxed.rs:1328
                                ⋮ 20 frames hidden ⋮                                

https://github.com/ZcashFoundation/zebra/pull/1766/checks?check_run_id=1919269780#step:8:1585

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage P-High I-integration-fail Continuous integration fails, including build and test failures labels Feb 18, 2021
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Feb 22, 2021
@oxarbitrage
Copy link
Contributor

This seems to don't be in the current sprint but it can potentially make our next PRs to fail, having to restart, etc.

This is a known issue even if we don't have the exact cause. What if we do the following in this sprint until we have more time to debug this:

  • Create a conditional for windows inside the code of the 2 conflict tests.
  • Look for the "trying" strings added at Diagnostics for listener conflict CI failures #1766 instead.
  • Add a comment explaining what is happening, why we are doing it and point it to this ticket.

@oxarbitrage
Copy link
Contributor

Another option:

  • Disable the 2 tests completely from windows runs.

@teor2345
Copy link
Contributor Author

We have a lot of open CI failure tickets and bugs.
So need to triage failures based on how often they happen.

Do you know how often this failure is happening right now?
Do we know if it only happens on Windows?

Whenever I see a CI failure, I add a note to the pull request, and include the ticket number for that CI failure. (Or I open a new CI failure ticket and tag it.)

I don't see any PRs tagged with this bug. So maybe we don't need to do anything about it right now.

Last time I saw this failure, I opened this ticket, and added more diagnostics to the test in PR #1770. So we need to wait for it to happen again. Then we might know how to fix it, or we might decide to disable it.

@teor2345
Copy link
Contributor Author

Do you know how often this failure is happening right now?
Do we know if it only happens on Windows?

Ah, I see that #1811 failed on Windows CI due to this bug.

I'll go check the new logs and see if they help us decide what to do.

@teor2345 teor2345 added this to the 2021 Sprint 4 milestone Feb 24, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Feb 24, 2021

This is actually a known issue in Windows Hyper-V, introduced by update KB4074588:

After installing this update, applications may not be able to reserve or bind to ports that previously worked.

Check if the port you want to bind to or reserve is already reserved on the container host. For example, you can use netsh as follows:
netsh interface ipv4 show excludedportrange protocol=tcp
If the port is not reserved, you may use it. If it is reserved, choose a different port.

We can't use that workaround, but we can adjust the port range to exclude typically reserved ports (less than 53500).
For details, see:

As an added bonus, we'll be less likely to conflict with ephemeral ports on OSes that allocate them sequentially (macOS, Windows):
https://dataplane.org/ephemeralports.html

@teor2345
Copy link
Contributor Author

(As an alternative, we could switch most tests to use port 0, and then check the logs for the OS-assigned port. But that's a lot more work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants