Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[grid] ensure --drain-after-session-count is respected with a lot of sessions in the queue #14987

Merged
merged 1 commit into from
Dec 31, 2024

Conversation

joerg1985
Copy link
Member

@joerg1985 joerg1985 commented Dec 30, 2024

User description

Description

This PR will ensure a node does not start more than drain-after-session-count session by moving the counter check before starting a new session. The drainLock does ensure we do not drain to early while another session is still starting.

Motivation and Context

When a new node does register and there are alot of sessions in the queue, the node did start more than expected sessions.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Bug fix


Description

  • Fixed race condition issue where nodes could exceed configured drain-after-session-count when many sessions are queued
  • Implemented proper synchronization using ReadWriteLock to ensure accurate session counting
  • Moved session count validation before actual session creation to prevent over-allocation
  • Added proper error handling when drain limit is reached
  • Improved logging and error messages for drain-related scenarios

Changes walkthrough 📝

Relevant files
Bug fix
LocalNode.java
Fix race condition in session count tracking and node draining

java/src/org/openqa/selenium/grid/node/local/LocalNode.java

  • Added ReadWriteLock to synchronize session count checking and draining
  • Moved session count decrement check before session creation
  • Changed drainAfterSessions from AtomicBoolean to boolean
  • Added proper locking around session creation and drain checking
  • +49/-12 

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Lock Release

    The read lock is only released in the finally block of newSession() method. If an exception occurs before acquiring the lock, unlock() might be called on an unlocked lock which could cause issues.

    Lock lock = drainLock.readLock();
    lock.lock();
    
    try (Span span = tracer.getCurrentContext().createSpan("node.new_session")) {
      AttributeMap attributeMap = tracer.createAttributeMap();
      attributeMap.put(AttributeKey.LOGGER_CLASS.getKey(), getClass().getName());
      attributeMap.put(
          "session.request.capabilities", sessionRequest.getDesiredCapabilities().toString());
      attributeMap.put(
          "session.request.downstreamdialect", sessionRequest.getDownstreamDialects().toString());
    
      int currentSessionCount = getCurrentSessionCount();
      span.setAttribute("current.session.count", currentSessionCount);
      attributeMap.put("current.session.count", currentSessionCount);
    
      if (currentSessionCount >= maxSessionCount) {
        span.setAttribute(AttributeKey.ERROR.getKey(), true);
        span.setStatus(Status.RESOURCE_EXHAUSTED);
        attributeMap.put("max.session.count", maxSessionCount);
        span.addEvent("Max session count reached", attributeMap);
        return Either.left(new RetrySessionRequestException("Max session count reached."));
      }
    
      if (isDraining()) {
        span.setStatus(
            Status.UNAVAILABLE.withDescription(
                "The node is draining. Cannot accept new sessions."));
        return Either.left(
            new RetrySessionRequestException("The node is draining. Cannot accept new sessions."));
      }
    
      // Identify possible slots to use as quickly as possible to enable concurrent session starting
      SessionSlot slotToUse = null;
      synchronized (factories) {
        for (SessionSlot factory : factories) {
          if (!factory.isAvailable() || !factory.test(sessionRequest.getDesiredCapabilities())) {
            continue;
          }
    
          factory.reserve();
          slotToUse = factory;
          break;
        }
      }
    
      if (slotToUse == null) {
        span.setAttribute(AttributeKey.ERROR.getKey(), true);
        span.setStatus(Status.NOT_FOUND);
        span.addEvent("No slot matched the requested capabilities. ", attributeMap);
        return Either.left(
            new RetrySessionRequestException("No slot matched the requested capabilities."));
      }
    
      if (!decrementSessionCount()) {
        slotToUse.release();
        span.setAttribute(AttributeKey.ERROR.getKey(), true);
        span.setStatus(Status.RESOURCE_EXHAUSTED);
        attributeMap.put("drain.after.session.count", configuredSessionCount);
        span.addEvent("Drain after session count reached", attributeMap);
        return Either.left(new RetrySessionRequestException("Drain after session count reached."));
      }
    
      UUID uuidForSessionDownloads = UUID.randomUUID();
      Capabilities desiredCapabilities = sessionRequest.getDesiredCapabilities();
      if (managedDownloadsRequested(desiredCapabilities)) {
        Capabilities enhanced = setDownloadsDirectory(uuidForSessionDownloads, desiredCapabilities);
        enhanced = desiredCapabilities.merge(enhanced);
        sessionRequest =
            new CreateSessionRequest(
                sessionRequest.getDownstreamDialects(), enhanced, sessionRequest.getMetadata());
      }
    
      Either<WebDriverException, ActiveSession> possibleSession = slotToUse.apply(sessionRequest);
    
      if (possibleSession.isRight()) {
        ActiveSession session = possibleSession.right();
        sessionToDownloadsDir.put(session.getId(), uuidForSessionDownloads);
        currentSessions.put(session.getId(), slotToUse);
    
        SessionId sessionId = session.getId();
        Capabilities caps = session.getCapabilities();
        SESSION_ID.accept(span, sessionId);
        CAPABILITIES.accept(span, caps);
        String downstream = session.getDownstreamDialect().toString();
        String upstream = session.getUpstreamDialect().toString();
        String sessionUri = session.getUri().toString();
        span.setAttribute(AttributeKey.DOWNSTREAM_DIALECT.getKey(), downstream);
        span.setAttribute(AttributeKey.UPSTREAM_DIALECT.getKey(), upstream);
        span.setAttribute(AttributeKey.SESSION_URI.getKey(), sessionUri);
    
        // The session we return has to look like it came from the node, since we might be dealing
        // with a webdriver implementation that only accepts connections from localhost
        Session externalSession =
            createExternalSession(
                session,
                externalUri,
                slotToUse.isSupportingCdp(),
                slotToUse.isSupportingBiDi(),
                desiredCapabilities);
    
        String sessionCreatedMessage = "Session created by the Node";
        LOG.info(
            String.format(
                "%s. Id: %s, Caps: %s",
                sessionCreatedMessage, sessionId, externalSession.getCapabilities()));
    
        return Either.right(
            new CreateSessionResponse(
                externalSession,
                getEncoder(session.getDownstreamDialect()).apply(externalSession)));
      } else {
        slotToUse.release();
        span.setAttribute(AttributeKey.ERROR.getKey(), true);
        span.setStatus(Status.ABORTED);
        span.addEvent("Unable to create session with the driver", attributeMap);
        return Either.left(possibleSession.left());
      }
    } finally {
      lock.unlock();
    Race Condition

    The decrementSessionCount() method first checks drainAfterSessions then decrements the count. This could lead to a race condition if drainAfterSessions changes between the check and decrement.

    private boolean decrementSessionCount() {
      if (this.drainAfterSessions) {
        int remainingSessions = this.sessionCount.decrementAndGet();
        LOG.log(
            Debug.getDebugLogLevel(),
            "{0} remaining sessions before draining Node",
            remainingSessions);
        return remainingSessions >= 0;
      }
      return true;
    }

    Copy link
    Contributor

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Move session count validation before resource allocation to prevent race conditions

    The decrementSessionCount() should be called before acquiring a slot to prevent race
    conditions where multiple sessions could be created before the count is decremented,
    potentially exceeding the drain limit.

    java/src/org/openqa/selenium/grid/node/local/LocalNode.java [502-509]

     if (!decrementSessionCount()) {
    -  slotToUse.release();
       span.setAttribute(AttributeKey.ERROR.getKey(), true);
       span.setStatus(Status.RESOURCE_EXHAUSTED);
       attributeMap.put("drain.after.session.count", configuredSessionCount);
       span.addEvent("Drain after session count reached", attributeMap);
       return Either.left(new RetrySessionRequestException("Drain after session count reached."));
     }
     
    +Optional<SessionSlot> slotToUse = ...
    +
    Suggestion importance[1-10]: 8

    Why: Moving the session count validation before slot allocation is crucial for preventing race conditions that could lead to exceeding session limits. This is a significant improvement for thread safety and resource management.

    8
    Initialize counter values appropriately based on feature enablement to prevent incorrect behavior

    The sessionCount atomic integer should be initialized with the drain count only if
    draining is enabled, otherwise it could lead to premature draining.

    java/src/org/openqa/selenium/grid/node/local/LocalNode.java [182-184]

     this.configuredSessionCount = drainAfterSessionCount;
     this.drainAfterSessions = this.configuredSessionCount > 0;
    -this.sessionCount.set(drainAfterSessionCount);
    +this.sessionCount.set(this.drainAfterSessions ? drainAfterSessionCount : Integer.MAX_VALUE);
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Setting the session count to MAX_VALUE when draining is disabled prevents premature draining and potential system misconfigurations. This is a critical fix for proper system behavior.

    8
    Use appropriate lock type to prevent potential deadlocks in concurrent operations

    The checkSessionCount() method should acquire a read lock instead of trying to
    acquire a write lock to prevent potential deadlocks with the newSession method.

    java/src/org/openqa/selenium/grid/node/local/LocalNode.java [1041-1043]

    -Lock lock = drainLock.writeLock();
    +Lock lock = drainLock.readLock();
     if (!lock.tryLock()) {
       return;
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using a read lock instead of a write lock is more appropriate here as it allows concurrent reads while preventing deadlocks with the newSession method. This improves concurrency and prevents potential system hangs.

    7

    @joerg1985
    Copy link
    Member Author

    @VietND96 i think this might be interresting for you too?

    @VietND96
    Copy link
    Member

    As far as I understand, this PR will handle a case where the node is set e.g --drain-after-session-count=3, and the queue has e.g 4 requests. It ensure that 3 requests are picked up properly and 1 request still waiting without fail?

    @joerg1985
    Copy link
    Member Author

    Yes, the node did start to drain to early due to this and the still running sessions did get stuck.

    @VietND96
    Copy link
    Member

    Yes, it makes sense since this is a case that I was also thinking about.

    @VietND96 VietND96 merged commit c826fec into trunk Dec 31, 2024
    33 checks passed
    @VietND96 VietND96 deleted the drain-after-session-count branch December 31, 2024 11:02
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants