Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Extremely Poor Docker Resource Utilization Efficiency #2730

Open
palisadoes opened this issue Dec 2, 2024 · 81 comments
Open

API: Extremely Poor Docker Resource Utilization Efficiency #2730

palisadoes opened this issue Dec 2, 2024 · 81 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@palisadoes
Copy link
Contributor

Describe the bug

We run a demonstration instance of Talawa-API on a GoDaddy VPS server running Ubuntu. It has the following resources:

  1. 1 core
  2. 2 GB of RAM
  3. 40 GB of disk

Other information:

  1. The demo instance is intended to create an evaluation environment for new GtiHub contributors and users alike as they decide to use Talawa. The DB of the demo instance gets reset every day.
  2. Talawa API runs natively on this VPS server with acceptable performance with one user. The load average is approximately 1, which is the target value for a system with only 1 core.
  3. When Talawa API runs on the server using docker. The load average reaches 130, the swap process is the top CPU resource user. The system is so overloaded that only one ssh session at a time is achievable.

The purpose of this issue is to find ways to tune all Talawa-API Dockerfile and app configurations to lower its CPU and RAM utilization by at least 75%

  1. With the current Docker performance very few developers or end users will want to try Talawa themselves.
  2. This has been a recurring issue with Talawa API. The poor performance threatens the success of our current MongoDB based MVP.

To Reproduce
Steps to reproduce the behavior:

  1. Run Talawa-API on a system
  2. See excessive resource utilization

Expected behavior

  1. Acceptable usage information such that it can run easily on a mid-range laptop without impacting its performance

Actual behavior

  1. Poor performance

Screenshots

image

Additional details
Add any other context or screenshots about the feature request here.

Potential internship candidates

Please read this if you are planning to apply for a Palisadoes Foundation internship

@palisadoes palisadoes added the bug Something isn't working label Dec 2, 2024
@github-actions github-actions bot added feature request unapproved Unapproved for Pull Request labels Dec 2, 2024
@prayanshchh
Copy link
Contributor

can u please assign, I want to work on this issue but I will need guidance

@varshith257
Copy link
Member

This mostly related of reducing docker image size

@varshith257 varshith257 removed the unapproved Unapproved for Pull Request label Dec 3, 2024
@palisadoes palisadoes changed the title Extremely Poor Docker Resource Utilization Efficiency API: Extremely Poor Docker Resource Utilization Efficiency Dec 4, 2024
@palisadoes palisadoes added good first issue Good for newcomers and removed feature request labels Dec 4, 2024
@prayanshchh
Copy link
Contributor

prayanshchh commented Dec 6, 2024

Diffrent ways to approach this issue

1. Multi-Stage Builds
Using a multi-stage build can help separate the build and runtime environments, ensuring that only production-ready artifacts are included in the final image. This can be achieved by:

Installing dependencies and building the application in the first stage.
Copying only the necessary files (e.g., dist, node_modules) into a minimal runtime stage.

2. Optimizing Base Images
Switching to optimized base images can dramatically reduce size:

Baseline Image (Full Node.js): ~900 MB
Using Multi-Stage with Slim: ~400–500 MB
Using Multi-Stage with Alpine: ~250–300 MB
With Distroless: ~150–200 MB

3. Using Compression Tools
Tools like docker-slim can further compress the final image by analyzing and stripping unused dependencies and files:
With docker-slim: ~100–150 MB.

please suggest a method that doesn't impact comaptibility with codebase

@palisadoes
Copy link
Contributor Author

@prayanshchh

Please investigate the best solution and propose it after testing on your system. It's not just RAM, but also ways to reduce the CPU overhead.

@prayanshchh
Copy link
Contributor

alright sir

@vasujain275
Copy link
Contributor

@palisadoes @prayanshchh

The main problem I found with the API is that we have to run it in dev mode in the production Docker environment because our build process for the Talawa API is broken, so we can't use npm run start. If we resolve the build issue, we can drastically improve performance and security of the docker container.

I think @varshith257 also tried to solve the build process issue a few months back, any upadates on that?

@palisadoes
Copy link
Contributor Author

Would this PR by @adithyanotfound provide any insights?

@palisadoes
Copy link
Contributor Author

@vasujain275 Why do you say the build process is broken? Can you create an issue for someone else to try to fix it?

@prayanshchh
Copy link
Contributor

Would this PR by @adithyanotfound provide any insights?

Yes this helps, I will start my work on this in two days, have got end sem exams

@prayanshchh
Copy link
Contributor

am unassigning myself from the issue due to lack of progress

@prayanshchh prayanshchh removed their assignment Dec 14, 2024
@PurnenduMIshra129th
Copy link

@palisadoes plz assign me

@PurnenduMIshra129th
Copy link

@palisadoes what is the load average if the api runs without docker means what is the performance . I need this because i will only focus to improve to docker performance.If not then i have to use profiler to measure what is the exact issue is it related to docker container or in code sue unOptimized query.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Dec 17, 2024

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism. And one doubt is how i give more load to this api because at the time of testing l am the only user .

@vasujain275
Copy link
Contributor

vasujain275 commented Dec 17, 2024

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism.

  1. We don't need k8s
  2. Multistage builds and lightweight base image will not help, we already have multi stage builds with alpine images. The main issue is our build process.
  3. @palisadoes Due to my end semester exams right now I am not able to create that Graphql build Error Issue that is the main performance blocker on this. I will get to in 2-3 days once my exams end. Sorry for the delay.
  4. I think we should close the docker performance related issues as they create unnecessary confusion. Our docker images are well optimised. The main issue is that we are running our api in dev mode in them, once the build is fixed we can modify the docker files to see the performance improvements.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Dec 17, 2024

Build related issue means i don't get means u are saying about unnecessary node modules or something like this are in build at the time of building the docker image there which are causing the issue. I need futher calrity. And in above u commented u are not able to run npm run start it is working fine because api service is starting

@palisadoes
Copy link
Contributor Author

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism.

1. We don't need k8s

2. Multistage builds and lightweight base image will not help, we already have multi stage builds with alpine images. The main issue is our build process.

3. @palisadoes Due to my end semester exams right now I am not able to create that Graphql build Error Issue that is the main performance blocker on this. I will get to in 2-3 days once my exams end. Sorry for the delay.

4. I think we should close the docker performance related issues as they create unnecessary confusion. Our docker images are well optimised. The main issue is that we are running our api in dev mode in them, once the build is fixed we can modify the docker files to see the performance improvements.

OK.

@PurnenduMIshra129th
Copy link

@palisadoes i run a load test on the server with docker and with out docker on the configuration of duration of 30 sec and 2 req/sec and found means total of 60 request will be made in 30 sec in this scenerio both have equal successRate . But when i run the same test for same duration but with different request rate like 5 req/sec means 150 request in 30 sec got the result of slightly better performance of server with out docker . But the thing is server can't handle 150 request in 30 sec as many of request is under processing and not completed the request out of this only 40 request is successful.And if u want run the docker on low end service for a small user base like in 60 sec it makes 50 to 60 (considerable factor like medicore device 4gb of ram and 4core ) it will handle the request easily if talwa-api will reduce its cpu excessive task and if we limit the cpu usage also it will handle but some slowness will be there in this scenerio. What u say?

@palisadoes
Copy link
Contributor Author

@PurnenduMIshra129th please coordinate with @vasujain275

There appears to be multiple causes. The application is clearly over using resources.

Here is additional information.

@PurnenduMIshra129th
Copy link

@vasujain275 yes u are correct build process is broken . After build it is not working properly . Also when i try to run npm run prod it is not running gives multiple error. U have any thoughts on this ? should we have use import instead of require.

@palisadoes
Copy link
Contributor Author

The PR was merged. We now need to:

  1. Deploy the API and Admin instances
  2. Determine the best develop / production strategy for the API

@PurnenduMIshra129th
Copy link

@palisadoes ok working.

@PurnenduMIshra129th
Copy link

@palisadoes for this i have also done for develop-postgress so what i have to do now ? will i make a pr for develop branch or develop postgress.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Dec 27, 2024

@adithyanotfound why it is not working npm run prod ? any suggestion?
image
that file is not present what it try to finds

@adithyanotfound
Copy link

adithyanotfound commented Dec 27, 2024

@adithyanotfound why it is not working npm run prod ? any suggestion?

@PurnenduMIshra129th please run npm run generate:ssl-private-key to generate certs before running npm run prod

@palisadoes
Copy link
Contributor Author

@adithyanotfound why it is not working npm run prod ? any suggestion?

@PurnenduMIshra129th please run npm run generate:ssl-private-key to generate certs before running npm run prod

@adithyanotfound

  1. Is this documented? If not, please open a PR to do so.
  2. I'm assuming that this was done in the GitHub Action pull-request.yml file. Was this done?

@adithyanotfound
Copy link

@adithyanotfound

  1. Is this documented? If not, please open a PR to do so.
  2. I'm assuming that this was done in the GitHub Action pull-request.yml file. Was this done?
  1. I'll update the docs.
  2. Yes it was done in Github actions pull-request.yml file.

@palisadoes
Copy link
Contributor Author

@palisadoes for this i have also done for develop-postgress so what i have to do now ? will i make a pr for develop branch or develop postgress.

Please work on this so that it's done correctly

Take a look at the develop-postgres branch and see if it's appropriate to make any adjustments. You'll need to coordinate with @xoldd as he's working on it behind the scenes

@PurnenduMIshra129th
Copy link

@palisadoes ok

@palisadoes
Copy link
Contributor Author

This will be an interesting issue. Get ready for the migration!

@PurnenduMIshra129th
Copy link

@palisadoes one doubt is that in the development branch there is no need to maintain two separate docker-compose.dev and docker-compose.prod only one is sufficient .can we delete one and write the logic in one file also and give commands for common for all it will start . I will make two pull request one for develop, and for develop-postgres.

@palisadoes
Copy link
Contributor Author

  1. Only one docker file in develop.
    1. Update INSTALLATION.md
  2. The develop-postgres is a complete rewrite of the API. There is already a docker file there.
    1. compose.yml
    2. https://github.com/PalisadoesFoundation/talawa-api/tree/develop-postgres
  3. No work needs to be done on develop-postgres

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Jan 1, 2025

@palisadoes yes i want to do the same for develop branch.If not required then i will change the build process of prod and its compose file with limit restriction.It will work same

@PurnenduMIshra129th
Copy link

see this screenshots after removing some unUsed nodemodules there is massive decrese in image size in prod which will be useful further
image

@palisadoes
Copy link
Contributor Author

Should we add a GitHub workflow test to search for unused packages in package.json if it would make the long term health of the code better? Your explanation gives the impression we should to reduce the docker image size

@PurnenduMIshra129th
Copy link

@palisadoes yes we can do it will very helpful and effictive against reducing the size image size.

@palisadoes
Copy link
Contributor Author

Please add that workflow to your PR

@PurnenduMIshra129th
Copy link

@palisadoes ok will do.

@PurnenduMIshra129th
Copy link

@palisadoes issue i am faced is also an important issue related to .dockerignore in this what is happening is that inside .dockerignore is not actuallly ignoring when it is building the file.As a result videos ,document section and all static images which are not required is also present in our docker images which is increasing docker image size also this is issue no 1. And second issue is after removing dev dependcies from production the code is not working which it shouldn't . Can i open issue for both of these. And in feature request wil i open a issue to create a workflow regarding this finding the unnecceasary nodemodules or packages to remove.

@PurnenduMIshra129th
Copy link

@palisadoes check the pr and tell what i have to change.

@varshith257
Copy link
Member

varshith257 commented Jan 11, 2025

It will be a mess if we don't differentiate between dev and prod env's that's the purpose of where dev and prod docker files are introduced. I haven't gone through the whole conversation but again I want to likely restrict them, so we had two docker files :

  1. The dev image needs to be what it is for in the codebase don't choose to reduce image size or something for what we have it will result in a rabbithole in dev mode to work locally. Mostly the image also should be the standard docker image not the slim versions of them as it will not be helpful.

  2. For Cloud instance or deployment we use prod image( currently it had used dev image due to building issues in prod env but unrelated to optimizing dev for it). Here you can experiment with what you can optimize anyway, multistage, alpine/slim images, etc., if we can start to prod env is not going to take more than 200MB which is best still now

@varshith257
Copy link
Member

@vasujain275, @SiddheshKukade, @varshith257

  1. Is there any way we can get a non-docker instance of the API and Admin apps running using the develop branch?
  2. We'll also need the API to reset its database every 24 hours as we originally planned
  3. The code will need to be updated and the apps restarted whenever there is an update as originally planned

We want to feature this as part of our GSoC 2025 organization application to help us get selected again. It's really important.

Yes, we have ways for it. Can we move this discussion to #maintainers channel?

@vasujain275
Copy link
Contributor

@varshith257 I am almost done with deploying docker instance of api on the vps, the only thing left to fix is caddy. Do you have any idea why are we using self signed certificates for localhost? They are causing problems with caddy in docker setup

@varshith257
Copy link
Member

I have mentioned the solution and its root cause clearly in Slack in a discussion. I have tagged you there

@vasujain275
Copy link
Contributor

@varshith257 following up the discussion on slack

@PurnenduMIshra129th
Copy link

@varshith257 what to do you exactly want can you clarify the points again? Do you want dev and prod image separately if this is the problem then i can the revert the changes no problem is there still my changes is there . so just tell the points where i have to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
8 participants