Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection issues when using k9s with tsh kube login compared to tsh proxy kube #49622

Closed
Ezzahhh opened this issue Dec 2, 2024 · 2 comments
Closed

Comments

@Ezzahhh
Copy link

Ezzahhh commented Dec 2, 2024

Expected behavior:
tsh kube login my-cluster
k9s
Start navigating and using k9s as normal without needing to wait or have any context deadline or connection issues.

Current behavior:
tsh kube login my-cluster
k9s
Try to navigate anywhere in k9s like viewing logs, describing pods or changing namespaces.
Observe k9s hang or become slow. Often will throw an error like http2 connection lost or context deadline exceeded.
Usually if you let it sit for a minute or two, it will recover and become fully usable but not always.

If I use tsh proxy kube instead, I don't observe this issue and can immediately use k9s as normal.
If I use kubectl with tsh kube login, I also do not encounter these issues.

Bug details:
Teleport Version: 17.1.2
Cluster Helm configuration:

chartMode: aws
clusterName: teleport.x.com
kubeClusterName: internal
proxyListenerMode: multiplex
aws:
  region: us-east-1
  backendTable: teleport-helm-backend
  auditLogTable: teleport-helm-events
  auditLogMirrorOnStdout: false
  sessionRecordingBucket: my-teleport-sessions
  backups: true
  dynamoAutoScaling: false
highAvailability:
  replicaCount: 2
  certManager:
    enabled: true
    issuerKind: ClusterIssuer
    issuerName: letsencrypt-prod
enterprise: false
podSecurityPolicy:
  enabled: false
podMonitor:
  enabled: true
operator:
  enabled: true

Using Network Load Balancer in AWS with multiplex or TLS routing enabled.

Teleport Agent configuration (another kubernetes cluster):

kubeClusterName: {{.name}}
roles: kube,db,discovery
proxyAddr: teleport.x.com:443
joinParams:
  tokenName: iam-token
  method: iam
labels:
  cluster: {{.name}}
  environment: {{index .metadata.labels "environment"}}
highAvailability:
  replicaCount: 2
podMonitor:
  enabled: true

k9s logs:

11:36PM WRN namespace validation failed for: "default" error="user not authorized to list all namespaces"
11:36PM INF 🐶 K9s starting up...
11:36PM DBG Active Context "teleport.xcloud.com-x86"
11:36PM INF ✅ Kubernetes connectivity
11:36PM ERR Fail to load global/context configuration error="k9s config file \"/Users/x/Library/Application Support/k9s/config.yaml\" load failed:\nAdditional property fullScreen is not allowed"
11:36PM DBG No custom skin found. Using stock skin
11:36PM DBG Factory START with ns `"default"
11:36PM DBG Fetching latest k9s rev...
11:36PM DBG K9s latest rev: "v0.32.7"
11:36PM DBG No custom skin found. Using stock skin
11:36PM ERR can't connect to cluster error="Get \"https://teleport.xcloud.com:443/version?timeout=15s\": context deadline exceeded"
11:36PM ERR ClusterUpdater failed error="conn check failed (1/5)"
11:37PM ERR can't connect to cluster error="Get \"https://teleport.xcloud.com:443/version?timeout=15s\": context deadline exceeded"
11:37PM ERR ClusterUpdater failed error="conn check failed (2/5)"
11:37PM DBG Dropping update...
11:37PM DBG Describe model elapsed: 30.210608042s
11:37PM ERR reconcile failed "v1/pods" error="Get \"https://teleport.xcloud.com:443/api/v1/namespaces/default/pods/cluster-autoscaler-aws-cluster-autoscaler-chart-ff4f95d99-tdmh2\": http2: client connection lost"
11:37PM ERR refresh failed error="Get \"https://teleport.xcloud.com:443/api/v1/namespaces/default/pods/cluster-autoscaler-aws-cluster-autoscaler-chart-ff4f95d99-tdmh2\": http2: client connection lost"
11:37PM DBG Describe model elapsed: 500.36625ms
11:37PM ERR can't connect to cluster error="Get \"https://teleport.xcloud.com:443/version?timeout=15s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
11:37PM ERR ClusterUpdater failed error="conn check failed (1/5)"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM ERR failed to list namespaces error="user not authorized to list all namespaces"
11:38PM WRN Unable to dial discovery API error="no connection to dial"
11:38PM ERR Watcher failed for v1/pods -- ACCESS -- No API server connection error="ACCESS -- No API server connection"
11:38PM WRN reconciler exited error="context canceled"
11:38PM ERR can't connect to cluster error="Get \"https://teleport.xcloud.com:443/version?timeout=15s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
11:38PM ERR ClusterUpdater failed error="conn check failed (1/5)"
11:38PM WRN   Dial Failed! error="Post \"https://teleport.xcloud.com:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": context deadline exceeded"
11:38PM ERR Component init failed for "Secret" error="Post \"https://teleport.xcloud.com:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": context deadline exceeded"
11:38PM ERR can't connect to cluster error="Get \"https://teleport.xcloud.com:443/version?timeout=15s\": http2: client connection lost"
11:38PM ERR ClusterUpdater failed error="conn check failed (2/5)"

I only see code 200 and 201 responses in the teleport-agent and teleport-proxy logs. However, I do notice that on startup of k9s I don't see any logs in teleport-proxy at all and then I start seeing context deadlines once I start trying to issue commands in k9s. Is it possible that the initial connection(s) from k9s is taking a while?

@Ezzahhh Ezzahhh added the bug label Dec 2, 2024
@Ezzahhh
Copy link
Author

Ezzahhh commented Dec 12, 2024

Seeing a lot of timeouts related to selfsubjectaccessreviews which I understand to be called to determine whether a user can or cannot perform an action. It looks like potentially related to our environment - perhaps the return from this call is slow when k9s opens many connections without the proxy?

@Ezzahhh
Copy link
Author

Ezzahhh commented Feb 10, 2025

I believe this might be a network issue on my side but not something that can be easily controlled. When using a VPN or moving to a completely different network, the issue disappears and there are no connection issues. I have narrowed down the issue to be likely related to my ISP's implementation of CGNAT or perhaps a general issue with CGNATs in general.

Will close the issue as it's not really a Teleport issue/bug.

@Ezzahhh Ezzahhh closed this as completed Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants