Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OBD devices are not always removed on umount #395

Open
bwjoh opened this issue Sep 26, 2024 · 4 comments
Open

OBD devices are not always removed on umount #395

bwjoh opened this issue Sep 26, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@bwjoh
Copy link

bwjoh commented Sep 26, 2024

/kind bug

What happened?
Ran the following job on a cluster with aws-fsx-csi-driver:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: mount-stress-
spec:
  parallelism: 1
  completions: 100
  ttlSecondsAfterFinished: 10
  template:
    spec:
      containers:
      - name: busybox-mount
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ['sh', '-c', 'echo "Test Job Start" && sleep 15 && echo "Test Job End" && exit 0']
        resources:
          limits:
            memory: "2048Mi"
            cpu: "500m"
          requests:
            memory: "2048Mi"
            cpu: "500m"
        volumeMounts:
          - mountPath: /mnt/fsx/test
            name: fsx-mount
      restartPolicy: Never
      volumes:
        - name: fsx-mount
          persistentVolumeClaim:
            claimName: lustre-test
  backoffLimit: 4

OBD devices created when mounting the file system were removed on unmount only ~59% of the time. Monitored using lctl dl | wc -l.

There is some documentation about monitoring devices here (due to a limit of 8192 by the Lustre client): https://aws.amazon.com/blogs/storage/best-practices-for-monitoring-amazon-fsx-for-lustre-clients-and-file-systems/

What you expected to happen?
After running the above job lctl dl | wc -l would show 0.

How to reproduce it (as minimally and precisely as possible)?
I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

On the host instance I haven't been able to reproduce this behaviour with mount and umount directly. There are no obvious errors from syslog on the host when devices are not removed (fsx-driver logs related to unmounting are all successful).

I am not sure if this is an issue with Lustre client version, something specific to the CSI workflow, or something else.

Anything else we need to know?:
This has been problematic as we have workflows with short-lived pods, and nodes can be recycled frequently to avoid hitting the Lustre client 8192 device limit.

This may also be related to some memory issues we have had on nodes - /proc/vmallocinfo ends up with many cfs_hash_buckets_realloc entries (looks related to the Lustre client) from the leftover devices. We have not found a way to remove these leftover devices besides recycling nodes.

Any confirmation if others are hitting this issue, or guidance on how to avoid this would be appreciated!

Environment

  • Kubernetes version (use kubectl version): 1.30.3
  • Driver version: v1.2.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2024
@jacobwolfaws
Copy link
Contributor

I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

Was this custom AMI built using an FSx for Lustre vended 2.12.8 Lustre client?

@bwjoh
Copy link
Author

bwjoh commented Sep 27, 2024

Realized my initial post is a bit unclear - I haven't tried to reproduce this issue with a generic AWS AMI - I have only tested with a custom AMI.

The AMI we are using has Lustre client installed based on https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024
@jacobwolfaws
Copy link
Contributor

Hi @bwjoh can you provide some more information about the client version and kernel version being used for the underlying OS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants