Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error in Airnode feed #207

Closed
metobom opened this issue Jan 31, 2024 · 25 comments
Closed

Fatal error in Airnode feed #207

metobom opened this issue Jan 31, 2024 · 25 comments
Assignees
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@metobom
Copy link
Member

metobom commented Jan 31, 2024

Nodary Airnode feed process failed with this error and because we read config from raw Github URL and change the deployment file's location (from candidate-deployments to the active-deployments) after the deployment, CF tried to redeploy the app but wget kept throwing 404 error.

CF tried to redeploy the app but wget kept throwing 404 error.

As a solution to this, I will update CF EntryPoint in a way to try candidate-deployments path first if fails try active-deployments.

@Siegrift for visibility.

@Siegrift
Copy link
Collaborator

Siegrift commented Jan 31, 2024

The error message doesn't tell much :/

I couldn't find related issues for Check failed: is_clonable_js_type || is_clonable_wasm_type.. Do we have access to the Nodary Airnode feed to check some metrics? Maybe it was some CPU/memory spike or memory leak.

In regards to Github URLs we need to make sure that the production ones are immutable.

(I've assigned both of us on this issue for now)

@metobom
Copy link
Member Author

metobom commented Jan 31, 2024

Do we have access to the Nodary Airnode feed to check some metrics?

Unfortunately, we don't have any metrics.

@bdrhn9
Copy link
Contributor

bdrhn9 commented Jan 31, 2024

Do we have access to the Nodary Airnode feed to check some metrics?

For debugging purpose, I activated ECS CloudWatch Container Insights to collect metrics from the containers. If we experience the issue again, it will be helpful.

It's charged extra so it wouldn't be enabled by default. But it's easy to enable it in CF template:

"AppCluster": {
      "Type": "AWS::ECS::Cluster",
      "Properties": {
        "ClusterName": "AirnodeFeedCluster-<SOME_ID>",
+       "ClusterSettings": [
+         {
+           "Name": "containerInsights",
+           "Value": "enabled"
+         }
+       ]
      }
    }

@Siegrift
Copy link
Collaborator

We should be able to see at least CPU and memory usage, but when I was stress testing the container it was able to handle the load so I'd be surprised if it was caused by this. Hopefully, we will be able to reproduce it again with more insights.

@metobom
Copy link
Member Author

metobom commented Feb 1, 2024

The same error occurred in TwelveData's deployment too.

@Siegrift Siegrift added the bug Something isn't working label Feb 2, 2024
@Siegrift
Copy link
Collaborator

Siegrift commented Feb 2, 2024

Thanks, for reference this is the error in Grafana.

The service seems operational again after AWS restart, so I suspect there is some memory leak. I will try to reproduce it and fix it.

@Siegrift
Copy link
Collaborator

Siegrift commented Feb 2, 2024

Btw. it seems that the message in Grafana is trimmed out. E.g. the error message pasted in this issue contains more information and suggests race condition inside Node.js.

@bdrhn9
Copy link
Contributor

bdrhn9 commented Feb 2, 2024

I was mistaken, the metrics are still in place. Here is a snapshot of them. I expected to see a gradual increase in memory usage, but it seems that's not the case.

Screenshot from 2024-02-02 14-55-16

@Siegrift
Copy link
Collaborator

Siegrift commented Feb 3, 2024

Happened again with Finage.

@Siegrift
Copy link
Collaborator

Siegrift commented Feb 3, 2024

I created an issue on Node.js repo nodejs/node#51652 and hope someone responds.

An idea would be to try migrating to a different Node.js image (or version). Especially, there are some mentions to use the Slim package instead of Alpine.

@Siegrift Siegrift added the on hold We do not plan to address this at the moment label Feb 3, 2024
@metobom
Copy link
Member Author

metobom commented Feb 4, 2024

It happend to coinpaprika too. One possibly useful information is, it happens with the Airnode feeds that include more data feeds.

@bbenligiray
Copy link
Member

I wonder if this will happen with a configuration that excludes the Grafana log shipping stuff

@aquarat
Copy link
Contributor

aquarat commented Feb 5, 2024

An idea would be to try migrating to a different Node.js image (or version). Especially, there are some mentions to use the Slim package instead of Alpine.

A good idea. AFAIK they use different C libraries 👌 This definitely looks like a runtime issue.

@aquarat
Copy link
Contributor

aquarat commented Feb 6, 2024

I'm currently trying to recreate this issue by simulating lots of feeds.

@vponline
Copy link

vponline commented Feb 6, 2024

It was already mentioned but doesn't seem to be related to memory. I managed to limit RAM for a local airnode-feed and make it crash due to out of memory and the error looks different:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

You can reproduce this by using this script in package.json:

"dev": "NODE_OPTIONS=--max-old-space-size=100 nodemon --ext ts,js,json,env  --exec \"pnpm ts-node src/index.ts\"",

@aquarat
Copy link
Contributor

aquarat commented Feb 6, 2024

I've been running it locally (docker images built from main on amd64, both intervals set to 0) with 15 000 feeds and so far it's been fine. I had to increase the memory (tentatively gave it 2048m). I'll let it run for a few hours and see what happens.

With so many feeds I've noticed there's a bottleneck when running post-processing, so it could be useful to maybe put that logic in a worker thread in future.

@aquarat
Copy link
Contributor

aquarat commented Feb 6, 2024

So it's been running locally now with 15 000 feeds for about 4 hours and it hasn't died - so it may be something specific to AWS or the RAM allocation (which affects processor resources). I'll try with reduced RAM.

@aquarat
Copy link
Contributor

aquarat commented Feb 7, 2024

It ran overnight with 15000 feeds and a reduced-speed CPU to try and simulate resource constraints. It still didn't crash, so I'm thinking even more that this may be specific to AWS.

I'm now running it with fuzzed responses from the data-provider API: randomly every 3rd API response is corrupted and every 2nd response is delayed by 0 to 3000 ms. I'll let it run like this for a few hours.

@aquarat
Copy link
Contributor

aquarat commented Feb 9, 2024

It's been running for three days, 15k feeds, some fuzzing and it's still running, no crashes, so... this is a hard bug to trace 😆

@metobom
Copy link
Member Author

metobom commented Feb 10, 2024

It happened again in TwelveData's Airnode feed.

@aquarat
Copy link
Contributor

aquarat commented Feb 12, 2024

My local instance eventually crashed because it ran out of log space (400 GBs) - so I haven't been able to recreate this locally. Upgrading to Node 20 may help.

@Siegrift
Copy link
Collaborator

Siegrift commented Mar 7, 2024

Let's close this one, otherwise it's going to remain on the board forever. After crashing, the service restarts so we are not affected much by this as of now.

We've tried using a different Node image (didn't help) and upgraded Node version (not confirmed whether it helps).

@Siegrift Siegrift closed this as completed Mar 7, 2024
@Siegrift Siegrift added wontfix This will not be worked on and removed on hold We do not plan to address this at the moment labels Mar 7, 2024
@aquarat
Copy link
Contributor

aquarat commented Mar 15, 2024

Has this happened again with the updated Node version? Just curious.

@bbenligiray
Copy link
Member

I think it did

@metobom
Copy link
Member Author

metobom commented Mar 21, 2024

API providers' current deployments are 0.5.1 and Node 20 is used in 0.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants