Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database connection issues when using CRaC with quarkus #45857

Open
ybezsonov opened this issue Jan 25, 2025 · 9 comments
Open

Database connection issues when using CRaC with quarkus #45857

ybezsonov opened this issue Jan 25, 2025 · 9 comments
Labels
area/core kind/question Further information is requested

Comments

@ybezsonov
Copy link

ybezsonov commented Jan 25, 2025

Describe the bug

Hi,
I have a Jakarta EE application which runs on Quarkus and uses a PostgreSQL datasource and jakarta.persistence.Entity. I provide datasource parameters via environment variables quarkus.datasource.* at startup, and everything works as expected.

I want to use CRaC with Quarkus. I enabled it, and when I do a checkpoint, I get this error:
#11 6.143 An exception during a checkpoint operation:
#11 6.143 jdk.internal.crac.mirror.CheckpointException
#11 6.143 Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenSocketException: Socket[addr=db-cluster.cluster.rds.amazonaws.com/10.0.3.155,port=5432,localport=39960]
#11 6.143 at java.base/jdk.internal.crac.JDKSocketResourceBase.lambda$beforeCheckpoint$0(JDKSocketResourceBase.java:68)

I tried to implement:
@OverRide
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
dataSource.close();
}

The checkpoint is successful now, but when I try to restore and call an API which uses the database, I get this error:
jakarta.el.ELException: org.hibernate.exception.GenericJDBCException: Unable to acquire JDBC Connection [This pool is closed and does not handle any more connections!]

Is there a bug with handling database connection during CRaC Checkpoint and Restore? Could you please tell me if I need to do something else to handle database connection?
Is there a way to restore connection afterRestore?

Expected behavior

quarkus should close database connection beforeCheckpoint and restores afterRestore

Actual behavior

database connection is open beforeCheckpoint and prevent checkpoint to be successful.
if connection is closed beforeCheckpoint it doesn't restore automatically afterRestore

How to Reproduce?

import org.crac.Context;
import org.crac.Core;
import org.crac.Resource;

import io.agroal.api.AgroalDataSource;
import jakarta.inject.Inject;
import jakarta.servlet.ServletContextEvent;
import jakarta.servlet.ServletContextListener;

public class CheckpointUtil implements Resource, ServletContextListener {
    @Inject
    AgroalDataSource dataSource;

    public void contextInitialized(ServletContextEvent event) {
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        // dataSource.close();
    }

    @Override
    public void afterRestore(Context<? extends Resource> context) throws Exception {
    }
}

/src/main/resources/META-INF/web.xml

com.unicorn.store.data.CheckpointUtil

docker run -p 8080:8080
-e QUARKUS_DATASOURCE_JDBC_URL=$QUARKUS_DATASOURCE_JDBC_URL
-e QUARKUS_DATASOURCE_PASSWORD=$QUARKUS_DATASOURCE_PASSWORD
-e QUARKUS_DATASOURCE_USERNAME=postgres
app:latest

Output of uname -a or ver

6.1.119-129.201.amzn2023.x86_64

Output of java -version

openjdk version "21.0.5" 2024-10-15 LTS OpenJDK Runtime Environment Corretto-21.0.5.11.1 (build 21.0.5+11-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.5.11.1 (build 21.0.5+11-LTS, mixed mode, sharing)

Quarkus version or git rev

<quarkus.platform.version>3.17.7</quarkus.platform.version>

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.9

Additional information

No response

@ybezsonov ybezsonov added the kind/bug Something isn't working label Jan 25, 2025
@geoand geoand added area/jdbc Issues related to the JDBC extensions and removed triage/needs-triage labels Jan 27, 2025
Copy link

quarkus-bot bot commented Jan 27, 2025

/cc @barreiro (jdbc)

@ybezsonov
Copy link
Author

ybezsonov commented Jan 27, 2025

I created a repository to reproduce.

git clone https://github.com/ybezsonov/unicorn-store-jakarta

  • to build a container image without CRaC and run
    cd unicorn-store-jakarta
    ./scripts/quarkus-build.sh
    ./scripts/quarkus-run.sh

App runs on http://localhost:8080

  • to build CRaC image

  • start postgres container in one terminal
    cd unicorn-store-jakarta
    ./scripts/postgres-run.sh

  • in another console start the application build with Dockerfile-crac
    cd unicorn-store-jakarta
    ./scripts/quarkus-crac-build.sh

#15 6.134 An exception during a checkpoint operation:
#15 6.134 jdk.internal.crac.mirror.CheckpointException
#15 6.134 Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenSocketException: Socket[addr=host.docker.internal/192.168.65.254,port=5432,localport=38422]
#15 6.134 at java.base/jdk.internal.crac.JDKSocketResourceBase.lambda$beforeCheckpoint$0(JDKSocketResourceBase.java:68)
#15 6.134 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:170)
#15 6.134 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
#15 6.134 at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
#15 6.134 Caused by: java.lang.Exception: This file descriptor was created by agroal-11 at epoch:1737971395075 here
#15 6.134 at java.base/jdk.internal.crac.JDKFdResource.(JDKFdResource.java:60)
#15 6.135 at java.base/jdk.internal.crac.JDKSocketResourceBase.(JDKSocketResourceBase.java:44)
#15 6.135 at java.base/jdk.internal.crac.JDKSocketResource.(JDKSocketResource.java:38)

@yrodiere
Copy link
Member

Hello,

I want to use CRaC with Quarkus

To be clear, that is not a supported/tested scenario as far as I'm aware. Requalifying this as a question.

Is there a bug with handling database connection during CRaC Checkpoint and Restore? Could you please tell me if I need to do something else to handle database connection?

datasource.close() is incorrect, as that's a very definitive operation.

I think you want to drain the connection pool instead. That would be datasource.flush(FlushMode.ALL). You may have to set a minimum size of 0 for your pool in order for this to work as expected (I'm not entirely sure).

Is there a way to restore connection afterRestore?

Assuming you drain the connection pool before the checkpoint, I don't think so. Agroal will automatically create new connections upon connection request if the pool is empty,

Though I assume CRaC may come with other constraints, so there may other things to do before checkpoint than draining the connection pool.
I think @galderz experimented with it, he might have some information?

@yrodiere yrodiere added kind/question Further information is requested and removed kind/bug Something isn't working labels Jan 30, 2025
@ybezsonov
Copy link
Author

Hello,
Thank you! I tested with dataSource.flush(FlushMode.ALL) and min-size=0, but it didn't help.
Do we have a way to open/recreate connection after dataSource.close()?
It could be event desired way, because we can create snapshot in CI/CD and than restore the application with different datasource settings.

@yrodiere
Copy link
Member

Do we have a way to open/recreate connection after dataSource.close()?

Not that I know. If you close a datasource, you must re-create it, and the Agroal integration in Quarkus just isn't designed to do that. You would have to completely bypass the whole datasource integration. Which I guess is possible, but not something I'd recommend.

I'd focus on finding out why flushing isn't enough -- i.e. why after removing all connections, CRaC still fails. Surely it's no longer about connections? Perhaps some socket related to some protocol-specific side-channel in the PostgreSQL driver?

It could be event desired way, because we can create snapshot in CI/CD and than restore the application with different datasource settings.

This too would be way beyond the anticipated use cases. Looks like you'll either need:

  • you own custom datasource creation code, not integrated with anything else in Quarkus
  • OR a dedicated CRaC integration in Quarkus (not sure if that's planned, @galderz ?) that basically executes runtime init just after the CRaC restore

@ybezsonov
Copy link
Author

Yes, the dedicated CRaC integration to Quarkus lifecycle would be preferrable. Own datasource management is too much for such case.

With flush I have this error
#12 5.028 jdk.internal.crac.mirror.CheckpointException
#12 5.028 Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: /opt/quarkus-app/lib/main/io.agroal.agroal-pool-2.5.jar
#12 5.028 at java.base/jdk.internal.crac.JDKFileResource.lambda$beforeCheckpoint$1(JDKFileResource.java:118)
#12 5.028 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:170)
#12 5.028 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
#12 5.028 at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
#12 5.028 Caused by: java.lang.Exception: This file descriptor was created by agroal-11 at epoch:1738234558706 here
#12 5.028 at java.base/jdk.internal.crac.JDKFdResource.(JDKFdResource.java:60)
#12 5.028 at java.base/jdk.internal.crac.JDKFileResource.(JDKFileResource.java:44)
#12 5.028 at java.base/java.io.RandomAccessFile$1.(RandomAccessFile.java:96)
#12 5.028 at java.base/java.io.RandomAccessFile.(RandomAccessFile.java:96)
#12 5.028 at java.base/java.io.RandomAccessFile.(RandomAccessFile.java:257)
#12 5.029 at java.base/java.util.zip.ZipFile$Source.(ZipFile.java:1509)
#12 5.029 at java.base/java.util.zip.ZipFile$Source.get(ZipFile.java:1475)
#12 5.029 at java.base/java.util.zip.ZipFile$CleanableResource.(ZipFile.java:726)
#12 5.029 at java.base/java.util.zip.ZipFile.(ZipFile.java:253)
#12 5.029 at java.base/java.util.zip.ZipFile.(ZipFile.java:182)
#12 5.029 at java.base/java.util.jar.JarFile.(JarFile.java:345)
#12 5.029 at io.smallrye.common.io.jar.JarFiles.create(JarFiles.java:33)
#12 5.029 at io.quarkus.bootstrap.runner.JarFileReference.asyncLoadAcquiredJarFile(JarFileReference.java:207)
#12 5.029 at io.quarkus.bootstrap.runner.JarFileReference.withJarFile(JarFileReference.java:147)
#12 5.029 at io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:56)
#12 5.029 at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:106)
#12 5.029 at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:72)
#12 5.029 at io.agroal.pool.ConnectionPool$FlushTask.afterFlush(ConnectionPool.java:694)
#12 5.029 at io.agroal.pool.ConnectionPool$FlushTask.run(ConnectionPool.java:635)
#12 5.029 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
#12 5.029 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
#12 5.029 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
#12 5.029 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
#12 5.029 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
#12 5.030 at java.base/java.lang.Thread.run(Thread.java:1583)

@yrodiere
Copy link
Member

yrodiere commented Jan 30, 2025

This looks related to some async classloading in Quarkus that happens to keep references to JAR files -- presumably because the class is lazily-loaded?

You're looking at a general problem in Quarkus here, it really doesn't look related to datasources or Agroal.

So, yeah, we need to talk to Galder, whom I already pinged twice so I'll refrain from doing it again :)

@yrodiere yrodiere removed the area/jdbc Issues related to the JDBC extensions label Jan 30, 2025
@galderz
Copy link
Member

galderz commented Feb 4, 2025

AFAIK there's no support for CRaC in Quarkus to do save/restore. The only reason CRaC is currently in Quarkus is for AWS Lambda snapstart functionality.

@ybezsonov
Copy link
Author

Thank you for the explanation. I think I misunderstood the documentation and thought that other CRaC use cases are supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants