Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDAP-21096] Split ProgramLifecycleService for Appfabric service and processor #15824

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

vsethi09
Copy link
Contributor

@vsethi09 vsethi09 commented Jan 27, 2025

Context

Multiple instances of ProgramLifecycleService cannot be run as they use ProgramRuntimeService which contains in-memory cache of runtimeInfos. Due to this ProgramLifecycleHttpHandler was temporarily moved to singleton Appfabric processor. More details: #15773 (comment)

Change Description

  1. Split ProgramLifecycleHttpHandler into -> ProgramLifecycleHttpHandler (which runs on Appfabric server) and ProgramRuntimeHttpHandler (which runs on Appfabric processor).
    • ProgramRuntimeHttpHandler has handlers for managing runtimeInfo.
    • All the common methods were moved to ProgramUtil to avoid duplication.
    • All methods were moved without any modification.
  2. Split ProgramLifecycleService into -> ProgramLifecycleService(used by ProgramLifecycleHttpHandler) and ProgramUtil
    • ProgramUtil contains common utility for Programs used by handlers and services. Does not reference runtimeInfo.
    • All methods were moved without any modification, except for checkConcurrentExecution.
      • ProgramLifecycleService.checkConcurrentExecution() -> uses store implementation instead of using runtimeInfo.
  3. Made RouterLookupPath changes to route paths for ProgramRuntimeHttpHandler to Appfabric processor.
  4. Moved these handlers back to AppfabricServer (This was temporary change made as per: [CDAP-21096] Split Appfabric into stateless service and stateful processor #15773 (comment)):
    • AppLifecycleHttpHandler
    • AppLifecycleHttpHandlerInternal
    • ProgramLifecycleHttpHandler
    • ProgramLifecycleHttpHandlerInternal
    • WorkflowHttpHandler

Verification

  • Unit Tests
  • CDAP sandbox
  • Distributed docker image

@vsethi09 vsethi09 changed the title [CDAP-21096] Split ProgramLifecycleService for Appfabric service and processor [WIP][CDAP-21096] Split ProgramLifecycleService for Appfabric service and processor Jan 27, 2025
@vsethi09 vsethi09 added the build Triggers github actions build label Jan 27, 2025
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch 9 times, most recently from 619b867 to 6a1e84d Compare January 27, 2025 12:54
@vsethi09 vsethi09 marked this pull request as ready for review January 27, 2025 14:06
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch from 6a1e84d to 4b9b796 Compare January 27, 2025 14:13
@vsethi09 vsethi09 changed the title [WIP][CDAP-21096] Split ProgramLifecycleService for Appfabric service and processor [CDAP-21096] Split ProgramLifecycleService for Appfabric service and processor Jan 27, 2025
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch 5 times, most recently from 4ded2d9 to 27a0875 Compare January 29, 2025 20:17
@vsethi09 vsethi09 requested a review from tivv January 29, 2025 20:24
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch from 27a0875 to 02796c6 Compare January 29, 2025 20:52
@@ -121,11 +127,18 @@ void setRemoteTwillRunnerService(
@Override
public final RuntimeInfo run(ProgramDescriptor programDescriptor, ProgramOptions options,
RunId runId) {
ProgramId programId = programDescriptor.getProgramId();
ProgramRunId programRunId = programId.run(runId);
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to include updateRuntimeInfo into synchronized block to have atomicity, otherwise this synchronized block does not do anything.
Note that updateRuntimeInfo in a turn uses runtimeInfosLock.
So probably correct option would be to take write lock on runtimeInfosLock instead of synchronized and hold onto it until after updateRuntimeInfo. Note that write lock must be taken as upgrading from read to write is not possible

Copy link
Contributor Author

@vsethi09 vsethi09 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Acquired writeLock() on the ReentrantReadWriteLock before the runtimeInfo lookup() till updateRuntimeInfo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: For information sharing sake, ReadWriteLock acquired through a factory was added in a bit "long look forward" thinking on the AppFabric horizontal scaling where there is a central locking service introduced. Not sure if it would ever be used as such as without batching lock service would probably add too much latency to be useful. But at least it marks the place where some non-horizontally-scalable stuff is happening

@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch from 02796c6 to 36a4832 Compare January 30, 2025 16:34
Copy link
Contributor

@tivv tivv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comments

@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch from 36a4832 to 14a767f Compare January 30, 2025 17:14
…processor

Replace inheritance with composition
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_split_ProgramLifecycleService branch from 14a767f to ee0968a Compare January 30, 2025 17:17
@vsethi09 vsethi09 merged commit 8143416 into develop Jan 30, 2025
9 checks passed
@vsethi09 vsethi09 deleted the feature/CDAP-21096_split_ProgramLifecycleService branch January 30, 2025 18:33
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
66.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Triggers github actions build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants