Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory leak #341

Closed
urielginsburg opened this issue Oct 20, 2023 · 2 comments · Fixed by #371
Closed

[BUG] Memory leak #341

urielginsburg opened this issue Oct 20, 2023 · 2 comments · Fixed by #371
Labels
bug Something isn't working

Comments

@urielginsburg
Copy link

urielginsburg commented Oct 20, 2023

ECS integration/library project(s) (e.g. Elastic.CommonSchema.Serilog):

ECS schema version (e.g. 1.4.0): 8.6.1

ECS .NET assembly version (e.g. 1.4.2): 8.6.1

Elasticsearch version (if applicable): 8.6.2

.NET framework / OS: Core 7 / Any OS

During a soak test on an AKS cluster, during which we enabled verbose logging, our services reached staggering sizes of memory. I took a dmp with dotnet-dump and it turns out there are huge numbers of EcsDocument objects that remain in the generation 2 pool, even after they had been sent to elasticsearch (I verified several of the document instances actually existed in elasticsearch).

Steps to reproduce:
As far as I can say, simply logging a big amount of log lines reproduces this behavior. I created a sample program to demonstrate this:

using Elastic.Serilog.Sinks;
using Serilog;

var memoryBefore = GC.GetTotalMemory(false);
Console.WriteLine($"Memory before: {memoryBefore} bytes");

Serilog.Debugging.SelfLog.Enable(s => Console.WriteLine(s));


Log.Logger = new LoggerConfiguration()
    .WriteTo.Elasticsearch(new[] { new Uri("http://localhost:9200") }, o =>
    {
        o.BootstrapMethod = Elastic.Ingest.Elasticsearch.BootstrapMethod.Failure;
        o.MinimumLevel = Serilog.Events.LogEventLevel.Verbose;
        o.DataStream = new Elastic.Ingest.Elasticsearch.DataStreams.DataStreamName("logs");
    }) 
    .CreateLogger();


var random = new Random();
Parallel.For(0, 10000, i =>
{
    var randomData = $"Logging information {i} - Random value: {random.NextDouble()}";
    Log.Information(randomData);
});

Log.CloseAndFlush();

await Task.Delay(TimeSpan.FromMinutes(1));

// Force garbage collection
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

var memoryAfter = GC.GetTotalMemory(false);
Console.WriteLine($"Memory after: {memoryAfter} bytes");

var memoryUsed = memoryAfter - memoryBefore;
Console.WriteLine($"Memory used: {memoryUsed} bytes");

Here is the output of 'dotnet-dump analyze' of one of the instances:

> gcroot 0x7F09547ABCD8
HandleTable:
    00007f4904e113a8 (strong handle)
          -> 7f0904c04428     System.Object[]
          -> 7f092d05caf8     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>
          -> 7f092d05cb60     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+PerCoreLockedStacks[]
          -> 7f092712fea0     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+PerCoreLockedStacks
          -> 7f092712feb8     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+LockedStack[]
          -> 7f0927130298     System.Buffers.TlsOverPerCoreLockedStacksArrayPool<Elastic.CommonSchema.EcsDocument>+LockedStack
          -> 7f09271302b8     Elastic.CommonSchema.EcsDocument[][]
          -> 7f09567f2568     Elastic.CommonSchema.EcsDocument[]
          -> 7f09547abcd8     Elastic.CommonSchema.EcsDocument

@urielginsburg urielginsburg added the bug Something isn't working label Oct 20, 2023
@pdeluca93
Copy link

pdeluca93 commented Oct 22, 2023

I'm experiencing a very similar problem with .NET 6 dockerized apps:

My stack:

  • .NET 6 dockerized using both the official Alpine and Debian images
  • NuGet packages for Serilog and Elastic setup:
    <ItemGroup>
        <PackageReference Include="Elastic.Apm.NetCoreAll" Version="1.24.0"/>
        <PackageReference Include="Elastic.Apm.SerilogEnricher" Version="8.6.1"/>
        <PackageReference Include="Elastic.CommonSchema.Serilog" Version="8.6.1"/>
        <PackageReference Include="Elastic.Serilog.Sinks" Version="8.6.1"/>
        <PackageReference Include="Serilog" Version="3.0.1"/>
        <PackageReference Include="Serilog.Enrichers.Environment" Version="2.3.0"/>
        <PackageReference Include="Serilog.Extensions.Hosting" Version="5.0.1"/>
        <PackageReference Include="Serilog.Extensions.Logging" Version="3.1.0"/>
        <PackageReference Include="Serilog.Settings.Configuration" Version="7.0.1"/>
        <PackageReference Include="Serilog.Sinks.Console" Version="4.1.0"/>
    </ItemGroup>
  • Appsettings for Serilog and Elastic setup:
"Serilog": {
   "Using": ["Serilog.Sinks.Console", "Elastic.Serilog.Sinks"],
   "MinimumLevel": {
     "Default": "Information",
     "Override": {
       "Microsoft": "Warning",
       "Microsoft.EntityFrameworkCore": "Information",
       "Microsoft.Hosting.Lifetime": "Information"
     }
   },
   "WriteTo": [
     {
       "Name": "Console",
       "Args": {
         "customFormatter": "Elastic.CommonSchema.Serilog.EcsTextFormatter, Elastic.CommonSchema.Serilog",
         "InlineFields": "True"
       }
     },
     {
       "Name": "ElasticCloud",
       "Args": {
         "cloudId": "xxx",
         "username": "xxx",
         "password": "xxx",
         "customFormatter": "Elastic.CommonSchema.Serilog.EcsTextFormatter"
       }
     }
   ],
   "Enrich": [
     "FromLogContext",
     "WithMachineName",
     "WithElasticApmCorrelationInfo",
     "WithEnvironmentName"
   ]
 },
 "ElasticApm": {
   "SecretToken": "xxx",
   "ServerUrl": "xxx",
   "ServiceName": "xxx",
   "Enabled": "true"
 },
  • Elasticsearch stack veresion: 8.10.2 (I also encountered this issue with version 8.6.2)

The problem:

At random points, my containers begin to consume an excessive amount of memory, even during periods of inactivity. The memory consumption steadily increases without limit.

I've observed the following error is logged when this problem begins:

2023-10-21T22:32:53.5989859Z Observed an exception while writing to logs-dotnet-default
2023-10-21T22:32:53.5990190Z System.ArgumentNullException: Value cannot be null. (Parameter 'source')
   at System.Linq.ThrowHelper.ThrowArgumentNullException(ExceptionArgument argument)
   at System.Linq.Enumerable.Where[TSource](IEnumerable`1 source, Func`2 predicate)
   at Elastic.Serilog.Sinks.SelfLogCallbackListener`1.<>c.<.ctor>b__42_1(BulkResponse response, IWriteTrackingBuffer _)
   at Elastic.Channels.BufferedChannelBase`3.ExportBufferAsync(ArraySegment`1 items, IOutboundBuffer`1 buffer)

As a workaround that decrease the impact (in my context) but doesn't solve the underlying issue, I've set this environment variable to disable the GC Server mode:

DOTNET_gcServer: 0

And applied memory limits to my containers.

As a side note, it's very frustrating that the sink lacks documentation on how to use Serilog.Settings.Configuration to apply all the settings from the appsettings files. I've figured out how to set some properties, but not how to configure BootstrapMethod, ConfigureChannel, and DataStream.

@Mpdreamz
Copy link
Member

Thank you for reporting this to us and apologies for getting to it late.

This memory leak should now be fixed as of: #371

All the documents are owned by the array pools backing System.Channel and get gen2'd rather quickly. We also have a high default for InboundBufferMaxSize of 100k which will eat memory for caching purposes. However since we use BoundedChannelFullMode.Wait, in cases where we extremely overproduce it can allocate way more then expected see: https://github.com/dotnet/runtime/blob/fadd8313653f71abd0068c8bf914be88edb2c8d3/src/libraries/System.Threading.Channels/src/System/Threading/Channels/BoundedChannel.cs#L609. As the channels will maintain their own DeQueue to push these messages at a later time.

While we were disposing the buffers already we localised their lifetimes more tightly so GC can release them faster: elastic/elastic-ingest-dotnet#52

The callbacks no longer get passed an instance of the buffer either just a copy of its metadata: elastic/elastic-ingest-dotnet#53

Both of these greatly ease the memory pressure.

Some more background into how the buffer sizes play into memory usages:

elastic/elastic-ingest-dotnet#54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants