Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxdb output seems to leak memory #1081

Closed
mstoykov opened this issue Jul 12, 2019 · 0 comments
Closed

influxdb output seems to leak memory #1081

mstoykov opened this issue Jul 12, 2019 · 0 comments
Assignees
Milestone

Comments

@mstoykov
Copy link
Contributor

Investigating why a user was running out of memory running a simple script after some time, I discovered that apparently influxdb leaks memory ... :(. Using the simple script below (optimized for metrics ;) )

import http from "k6/http";
import {sleep} from "k6";

export let options = {
    duration: "10m",
    thresholds: {
        "responseTime": ["p(95)<2000", "p(70)<1500", "avg<1500", "med<1200", "min<500"],
    },
    discardResponseBodies: true,
};

export default function () {
    http.batch (
        [["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
        ["GET", `${__ENV.HOST}`],
]);
};

And hitting local simple golang http server serving empty files you can run out of memory 16GB of memory in 7 minutes with 200 VUS while without influxdb you can run 2k VUS with somewhere around 7-8GB including calculating thresholds.
Below is the difference between the inuse objects from the 3th to the 4th minute:

pprof) top10
Showing nodes accounting for 7987343, 63.13% of 12653129 total
Dropped 83 nodes (cum <= 63265)
Showing top 10 nodes out of 96
      flat  flat%   sum%        cum   cum%
   1605656 12.69% 12.69%    5633669 44.52%  github.com/loadimpact/k6/vendor/github.com/influxdata/influxdb/client/v2.NewPoint
   1277291 10.09% 22.78%    1419291 11.22%  github.com/loadimpact/k6/vendor/github.com/influxdata/influxdb/models.MakeKey
   1271690 10.05% 32.83%    4072794 32.19%  github.com/loadimpact/k6/vendor/github.com/influxdata/influxdb/models.NewPoint
    813813  6.43% 39.27%     813813  6.43%  strconv.fmtF
    699521  5.53% 44.80%    1668937 13.19%  github.com/loadimpact/k6/lib/netext/httpext.(*transport).measureAndEmitMetrics
    557072  4.40% 49.20%     557072  4.40%  github.com/loadimpact/k6/vendor/github.com/dop251/goja.asciiString.concat
    502463  3.97% 53.17%    1316276 10.40%  github.com/loadimpact/k6/vendor/github.com/influxdata/influxdb/models.appendField
    475148  3.76% 56.92%     475148  3.76%  strings.(*Builder).WriteString
    418758  3.31% 60.23%     418758  3.31%  github.com/loadimpact/k6/stats/influxdb.(*Collector).extractTagsToValues
    365931  2.89% 63.13%    6713318 53.06%  github.com/loadimpact/k6/stats/influxdb.(*Collector).batchFromSamples
(pprof)

Clearly some of those influxdb objects are the problem. I am under the impression that maybe the library shouldn't be used like that, but just in case (and because it will be way quicker) I would like to update the dependancy and test again. Unfortunately due to some influxdb refactoring (and probably the new version 2) It has changed repos ;( to https://github.com/influxdata/influxdb1-client . So maybe if we had been using modules that would've been easy again :(

@na-- na-- changed the title influxdb seems to leak memory influxdb output seems to leak memory Jul 16, 2019
mstoykov added a commit that referenced this issue Aug 13, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
the wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit there will be a configurable number of parallel writes
(10 by default) that will trigger again every 1 second (also now
configurable), but if those get exhausted it will start queueing the
samples each second instead of combining them and than writing a big
chunk which has a chance of hitting the max body.

I tested with a simple script doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it.

Some minor problems with this solution is that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem
2. We can probably better batch stuff if we add/keep all the unsend
samples together
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
mstoykov added a commit that referenced this issue Aug 13, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit a configurable number of parallel writes
(10 by default), starting again every 1 second (also now configurable).
Additionally if we reach the 10 concurrent writes instead of sending all
the data that accumulates we will just queue the samples that were
generated. This should considerably help with no hitting the max body
size of influxdb.

I tested with a simple script, doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

Previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it and I had to kill it ;(.

Some problems with this solution are that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem, but still good idea to fix.
2. We can probably better batch stuff if we add/keep all the unsend
samples together and cut them in let say 50k samples.
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
@na-- na-- added this to the v0.26.0 milestone Aug 27, 2019
mstoykov added a commit that referenced this issue Aug 29, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit a configurable number of parallel writes
(10 by default), starting again every 1 second (also now configurable).
Additionally if we reach the 10 concurrent writes instead of sending all
the data that accumulates we will just queue the samples that were
generated. This should considerably help with no hitting the max body
size of influxdb.

I tested with a simple script, doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

Previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it and I had to kill it ;(.

Some problems with this solution are that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem, but still good idea to fix.
2. We can probably better batch stuff if we add/keep all the unsend
samples together and cut them in let say 50k samples.
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants