Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v7.9.0] Filebeat constantly trying to migrate the registry to a new format #20705

Closed
GiedriusS opened this issue Aug 20, 2020 · 6 comments · Fixed by #20717
Closed

[v7.9.0] Filebeat constantly trying to migrate the registry to a new format #20705

GiedriusS opened this issue Aug 20, 2020 · 6 comments · Fixed by #20717
Assignees
Labels
bug Team:Services (Deprecated) Label for the former Integrations-Services team

Comments

@GiedriusS
Copy link

GiedriusS commented Aug 20, 2020

Same problem as:
#11834 (comment)

Looking at the goroutine dump this comes up over and over again:

1 @ 0x1794ce6 0x1cb2b56 0x1cb2b2b 0x1cb365d 0x1c3a43c 0x1c964b1 0x1c96388 0x1c96285 0x1c3c9a9 0x1c33f81 0x2b66172 0x2b66156 0x2b660f1 0x2b65962 0x2b6a276 0x2b69ff8 0x2b627df 0x2b6f4e3 0x2b6c7b3 0x2b6ba3e 0x2bfe2f5 0x2b334c5 0x2b3ba1a 0x2b318b8 0x2b4853d 0x17f494d 0x17f553a 0x3505fb1 0x3505f9e 0x165ad2a 0x168b791
#	0x1794ce5	bufio.(*Writer).Write+0x235										/usr/local/go/src/bufio/bufio.go:617
#	0x1cb2b55	github.com/elastic/go-structform/json.writer.write+0x55							/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:78
#	0x1cb2b2a	github.com/elastic/go-structform/json.(*Visitor).writeByte+0x2a						/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:97
#	0x1cb365c	github.com/elastic/go-structform/json.(*Visitor).OnString+0x6c						/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:193
#	0x1c3a43b	github.com/elastic/go-structform/gotype.reFoldString+0x6b						/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_primitives.go:55
#	0x1c964b0	github.com/elastic/go-structform/gotype.makeFieldFold.func1+0xe0					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:308
#	0x1c96387	github.com/elastic/go-structform/gotype.makeFieldsFold.func1+0x87					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:194
#	0x1c96284	github.com/elastic/go-structform/gotype.makeStructFold.func1+0x94					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:184
#	0x1c3c9a8	github.com/elastic/go-structform/gotype.foldAnyReflect+0xb8						/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:533
#	0x1c33f80	github.com/elastic/go-structform/gotype.foldInterfaceValue+0x2d0					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold.go:109
#	0x2b66171	github.com/elastic/go-structform/gotype.(*Iterator).Fold+0x401						/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold.go:86
#	0x2b66155	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*jsonEncoder).Encode+0x3e5		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/json.go:52
#	0x2b660f0	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*diskstore).checkpointTmpFile+0x380	/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/diskstore.go:365
#	0x2b65961	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*diskstore).WriteCheckpoint+0xf1		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/diskstore.go:293
#	0x2b6a275	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*store).logOperation+0x55		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/store.go:204
#	0x2b69ff7	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*store).Set+0x177			/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/store.go:186
#	0x2b627de	github.com/elastic/beats/v7/libbeat/statestore.(*Store).Set+0x13e					/go/src/github.com/elastic/beats/libbeat/statestore/store.go:118
#	0x2b6f4e2	github.com/elastic/beats/v7/filebeat/registrar.writeStates+0xf2						/go/src/github.com/elastic/beats/filebeat/registrar/registrar.go:306
#	0x2b6c7b2	github.com/elastic/beats/v7/filebeat/registrar.(*Migrator).updateToVersion1+0x3d2			/go/src/github.com/elastic/beats/filebeat/registrar/migrate.go:233
#	0x2b6ba3d	github.com/elastic/beats/v7/filebeat/registrar.(*Migrator).Run+0x26d					/go/src/github.com/elastic/beats/filebeat/registrar/migrate.go:96
#	0x2bfe2f4	github.com/elastic/beats/v7/filebeat/beater.(*Filebeat).Run+0x2e4					/go/src/github.com/elastic/beats/filebeat/beater/filebeat.go:336
#	0x2b334c4	github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch+0x794					/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:456
#	0x2b3ba19	github.com/elastic/beats/v7/libbeat/cmd/instance.Run.func1+0x559					/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:189
#	0x2b318b7	github.com/elastic/beats/v7/libbeat/cmd/instance.Run+0x147						/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:190
#	0x2b4853c	github.com/elastic/beats/v7/libbeat/cmd.genRunCmd.func1+0x9c						/go/src/github.com/elastic/beats/libbeat/cmd/run.go:36
#	0x17f494c	github.com/spf13/cobra.(*Command).execute+0x29c								/go/pkg/mod/github.com/spf13/[email protected]/command.go:766
#	0x17f5539	github.com/spf13/cobra.(*Command).ExecuteC+0x2e9							/go/pkg/mod/github.com/spf13/[email protected]/command.go:852
#	0x3505fb0	github.com/spf13/cobra.(*Command).Execute+0x30								/go/pkg/mod/github.com/spf13/[email protected]/command.go:800
#	0x3505f9d	main.main+0x1d												/go/src/github.com/elastic/beats/x-pack/filebeat/main.go:22
#	0x165ad29	runtime.main+0x1f9											/usr/local/go/src/runtime/proc.go:203

The full goroutine dump:

goroutine profile: total 17
2 @ 0x165b0f0 0x16532fa 0x1652865 0x16bd965 0x16be7fb 0x16be7dd 0x1761b4f 0x1774e5e 0x1a54084 0x1792e43 0x1793b5d 0x1793d94 0x19db62c 0x1a4e7a4 0x1a4e7d3 0x1a553b1 0x1a598b4 0x168b791
#	0x1652864	internal/poll.runtime_pollWait+0x54		/usr/local/go/src/runtime/netpoll.go:203
#	0x16bd964	internal/poll.(*pollDesc).wait+0x44		/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x16be7fa	internal/poll.(*pollDesc).waitRead+0x19a	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x16be7dc	internal/poll.(*FD).Read+0x17c			/usr/local/go/src/internal/poll/fd_unix.go:169
#	0x1761b4e	net.(*netFD).Read+0x4e				/usr/local/go/src/net/fd_unix.go:202
#	0x1774e5d	net.(*conn).Read+0x8d				/usr/local/go/src/net/net.go:184
#	0x1a54083	net/http.(*connReader).Read+0xf3		/usr/local/go/src/net/http/server.go:786
#	0x1792e42	bufio.(*Reader).fill+0x102			/usr/local/go/src/bufio/bufio.go:100
#	0x1793b5c	bufio.(*Reader).ReadSlice+0x3c			/usr/local/go/src/bufio/bufio.go:359
#	0x1793d93	bufio.(*Reader).ReadLine+0x33			/usr/local/go/src/bufio/bufio.go:388
#	0x19db62b	net/textproto.(*Reader).readLineSlice+0x6b	/usr/local/go/src/net/textproto/reader.go:58
#	0x1a4e7a3	net/textproto.(*Reader).ReadLine+0xa3		/usr/local/go/src/net/textproto/reader.go:39
#	0x1a4e7d2	net/http.readRequest+0xd2			/usr/local/go/src/net/http/request.go:1015
#	0x1a553b0	net/http.(*conn).readRequest+0x190		/usr/local/go/src/net/http/server.go:966
#	0x1a598b3	net/http.(*conn).serve+0x6d3			/usr/local/go/src/net/http/server.go:1822

1 @ 0x162e4c4 0x167048c 0x1e790a2 0x168b791
#	0x167048b	os/signal.signal_recv+0x9b	/usr/local/go/src/runtime/sigqueue.go:147
#	0x1e790a1	os/signal.loop+0x21		/usr/local/go/src/os/signal/signal_unix.go:23

1 @ 0x165b0f0 0x1628ec7 0x1628b8b 0x1e7a3c2 0x168b791
#	0x1e7a3c1	github.com/elastic/beats/v7/libbeat/service.HandleSignals.func1+0x51	/go/src/github.com/elastic/beats/libbeat/service/service.go:50

1 @ 0x165b0f0 0x1628ec7 0x1628bcb 0x22277bb 0x168b791
#	0x22277ba	k8s.io/klog.(*loggingT).flushDaemon+0x8a	/go/pkg/mod/k8s.io/[email protected]/klog.go:1010

1 @ 0x165b0f0 0x16532fa 0x1652865 0x16bd965 0x16c01a4 0x16c0186 0x1762402 0x1780f42 0x177fcd4 0x1a5e37d 0x1b7ebe4 0x1b7eb7b 0x168b791
#	0x1652864	internal/poll.runtime_pollWait+0x54					/usr/local/go/src/runtime/netpoll.go:203
#	0x16bd964	internal/poll.(*pollDesc).wait+0x44					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x16c01a3	internal/poll.(*pollDesc).waitRead+0x1d3				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x16c0185	internal/poll.(*FD).Accept+0x1b5					/usr/local/go/src/internal/poll/fd_unix.go:384
#	0x1762401	net.(*netFD).accept+0x41						/usr/local/go/src/net/fd_unix.go:238
#	0x1780f41	net.(*TCPListener).accept+0x31						/usr/local/go/src/net/tcpsock_posix.go:139
#	0x177fcd3	net.(*TCPListener).Accept+0x63						/usr/local/go/src/net/tcpsock.go:261
#	0x1a5e37c	net/http.(*Server).Serve+0x25c						/usr/local/go/src/net/http/server.go:2901
#	0x1b7ebe3	net/http.Serve+0x1b3							/usr/local/go/src/net/http/server.go:2468
#	0x1b7eb7a	github.com/elastic/beats/v7/libbeat/api.(*Server).Start.func1+0x14a	/go/src/github.com/elastic/beats/libbeat/api/server.go:65

1 @ 0x165b0f0 0x16532fa 0x1652865 0x16bd965 0x16c01a4 0x16c0186 0x1762402 0x1780f42 0x177fcd4 0x1a5e37d 0x1e7a707 0x1e7a6b2 0x168b791
#	0x1652864	internal/poll.runtime_pollWait+0x54					/usr/local/go/src/runtime/netpoll.go:203
#	0x16bd964	internal/poll.(*pollDesc).wait+0x44					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x16c01a3	internal/poll.(*pollDesc).waitRead+0x1d3				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x16c0185	internal/poll.(*FD).Accept+0x1b5					/usr/local/go/src/internal/poll/fd_unix.go:384
#	0x1762401	net.(*netFD).accept+0x41						/usr/local/go/src/net/fd_unix.go:238
#	0x1780f41	net.(*TCPListener).accept+0x31						/usr/local/go/src/net/tcpsock_posix.go:139
#	0x177fcd3	net.(*TCPListener).Accept+0x63						/usr/local/go/src/net/tcpsock.go:261
#	0x1a5e37c	net/http.(*Server).Serve+0x25c						/usr/local/go/src/net/http/server.go:2901
#	0x1e7a706	net/http.Serve+0x76							/usr/local/go/src/net/http/server.go:2468
#	0x1e7a6b1	github.com/elastic/beats/v7/libbeat/service.BeforeRun.func2+0x21	/go/src/github.com/elastic/beats/libbeat/service/service.go:130

1 @ 0x165b0f0 0x166b3e3 0x1ce9c3b 0x1ceafb0 0x168b791
#	0x1ce9c3a	github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).snapshotLoop+0x3ca	/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:129
#	0x1ceafaf	github.com/elastic/beats/v7/libbeat/monitoring/report/log.MakeReporter.func1+0x4f		/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:107

1 @ 0x165b0f0 0x166b3e3 0x1e5a9e1 0x168b791
#	0x1e5a9e0	github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run+0xd0	/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go:127

1 @ 0x165b0f0 0x166b3e3 0x1e5daa0 0x168b791
#	0x1e5da9f	github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*retryer).loop+0x24f	/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/retry.go:135

1 @ 0x165b0f0 0x166b3e3 0x1e86ee7 0x1e8db99 0x168b791
#	0x1e86ee6	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).run+0x106	/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/ackloop.go:60
#	0x1e8db98	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue.func2+0x58	/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:175

1 @ 0x165b0f0 0x166b3e3 0x1e88de9 0x1e57b6a 0x1e5e7dc 0x168b791
#	0x1e88de8	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*consumer).Get+0xe8	/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/consume.go:65
#	0x1e57b69	github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*eventConsumer).loop+0x199	/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:182
#	0x1e5e7db	github.com/elastic/beats/v7/libbeat/publisher/pipeline.newEventConsumer.func1+0x6b	/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:86

1 @ 0x165b0f0 0x166b3e3 0x1e8a73f 0x1e8daff 0x168b791
#	0x1e8a73e	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).run+0x1ce	/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/eventloop.go:316
#	0x1e8dafe	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue.func1+0x5e		/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:171

1 @ 0x165b0f0 0x166b3e3 0x32ee590 0x168b791
#	0x32ee58f	go.opencensus.io/stats/view.(*worker).start+0xff	/go/pkg/mod/[email protected]/stats/view/worker.go:154

1 @ 0x16a7c45 0x16a5a0a 0x16bf419 0x16bf3a4 0x16c6127 0x16c60f7 0x2b6a90f 0x1794925 0x1794baa 0x1cb2c35 0x1cb2ba2 0x1cb3af1 0x1cb30c0 0x1c36b7f 0x1c96566 0x1c96388 0x1c96285 0x1c3c9a9 0x1c33f81 0x2b66172 0x2b66156 0x2b660f1 0x2b65962 0x2b6a276 0x2b69ff8 0x2b627df 0x2b6f4e3 0x2b6c7b3 0x2b6ba3e 0x2bfe2f5 0x2b334c5 0x2b3ba1a
#	0x16a7c44	syscall.Syscall+0x4											/usr/local/go/src/syscall/asm_linux_amd64.s:18
#	0x16a5a09	syscall.write+0x59											/usr/local/go/src/syscall/zsyscall_linux_amd64.go:914
#	0x16bf418	syscall.Write+0x178											/usr/local/go/src/syscall/syscall_unix.go:214
#	0x16bf3a3	internal/poll.(*FD).Write+0x103										/usr/local/go/src/internal/poll/fd_unix.go:268
#	0x16c6126	os.(*File).write+0x76											/usr/local/go/src/os/file_unix.go:280
#	0x16c60f6	os.(*File).Write+0x46											/usr/local/go/src/os/file.go:153
#	0x2b6a90e	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*ensureWriter).Write+0x7e		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/util.go:56
#	0x1794924	bufio.(*Writer).Flush+0x74										/usr/local/go/src/bufio/bufio.go:591
#	0x1794ba9	bufio.(*Writer).Write+0xf9										/usr/local/go/src/bufio/bufio.go:627
#	0x1cb2c34	github.com/elastic/go-structform/json.writer.write+0xb4							/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:78
#	0x1cb2ba1	github.com/elastic/go-structform/json.(*Visitor).writeString+0x21					/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:101
#	0x1cb3af0	github.com/elastic/go-structform/json.(*Visitor).OnString+0x500						/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:262
#	0x1cb30bf	github.com/elastic/go-structform/json.(*Visitor).OnKey+0x5f						/go/pkg/mod/github.com/elastic/[email protected]/json/visitor.go:137
#	0x1c36b7e	github.com/elastic/go-structform/gotype.foldMapInlineInterface+0x11e					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_map_inline.generated.go:58
#	0x1c96565	github.com/elastic/go-structform/gotype.makeFieldInlineFold.func1+0x85					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:314
#	0x1c96387	github.com/elastic/go-structform/gotype.makeFieldsFold.func1+0x87					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:194
#	0x1c96284	github.com/elastic/go-structform/gotype.makeStructFold.func1+0x94					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:184
#	0x1c3c9a8	github.com/elastic/go-structform/gotype.foldAnyReflect+0xb8						/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold_reflect.go:533
#	0x1c33f80	github.com/elastic/go-structform/gotype.foldInterfaceValue+0x2d0					/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold.go:109
#	0x2b66171	github.com/elastic/go-structform/gotype.(*Iterator).Fold+0x401						/go/pkg/mod/github.com/elastic/[email protected]/gotype/fold.go:86
#	0x2b66155	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*jsonEncoder).Encode+0x3e5		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/json.go:52
#	0x2b660f0	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*diskstore).checkpointTmpFile+0x380	/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/diskstore.go:365
#	0x2b65961	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*diskstore).WriteCheckpoint+0xf1		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/diskstore.go:293
#	0x2b6a275	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*store).logOperation+0x55		/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/store.go:204
#	0x2b69ff7	github.com/elastic/beats/v7/libbeat/statestore/backend/memlog.(*store).Set+0x177			/go/src/github.com/elastic/beats/libbeat/statestore/backend/memlog/store.go:186
#	0x2b627de	github.com/elastic/beats/v7/libbeat/statestore.(*Store).Set+0x13e					/go/src/github.com/elastic/beats/libbeat/statestore/store.go:118
#	0x2b6f4e2	github.com/elastic/beats/v7/filebeat/registrar.writeStates+0xf2						/go/src/github.com/elastic/beats/filebeat/registrar/registrar.go:306
#	0x2b6c7b2	github.com/elastic/beats/v7/filebeat/registrar.(*Migrator).updateToVersion1+0x3d2			/go/src/github.com/elastic/beats/filebeat/registrar/migrate.go:233
#	0x2b6ba3d	github.com/elastic/beats/v7/filebeat/registrar.(*Migrator).Run+0x26d					/go/src/github.com/elastic/beats/filebeat/registrar/migrate.go:96
#	0x2bfe2f4	github.com/elastic/beats/v7/filebeat/beater.(*Filebeat).Run+0x2e4					/go/src/github.com/elastic/beats/filebeat/beater/filebeat.go:336
#	0x2b334c4	github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch+0x794					/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:456
#	0x2b3ba19	github.com/elastic/beats/v7/libbeat/cmd/instance.Run.func1+0x559					/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:189

1 @ 0x1a53bd1 0x168b791
#	0x1a53bd0	net/http.(*connReader).backgroundRead+0x0	/usr/local/go/src/net/http/server.go:677

1 @ 0x1b0c9b5 0x1b0c7d0 0x1b0956a 0x1e7792a 0x1e78385 0x1a5ac44 0x1a5ca85 0x1e7a67b 0x1a5ac44 0x1a5ca85 0x1a5dfd3 0x1a59a4c 0x168b791
#	0x1b0c9b4	runtime/pprof.writeRuntimeProfile+0x94					/usr/local/go/src/runtime/pprof/pprof.go:694
#	0x1b0c7cf	runtime/pprof.writeGoroutine+0x9f					/usr/local/go/src/runtime/pprof/pprof.go:656
#	0x1b09569	runtime/pprof.(*Profile).WriteTo+0x3d9					/usr/local/go/src/runtime/pprof/pprof.go:329
#	0x1e77929	net/http/pprof.handler.ServeHTTP+0x339					/usr/local/go/src/net/http/pprof/pprof.go:248
#	0x1e78384	net/http/pprof.Index+0x734						/usr/local/go/src/net/http/pprof/pprof.go:271
#	0x1a5ac43	net/http.HandlerFunc.ServeHTTP+0x43					/usr/local/go/src/net/http/server.go:2012
#	0x1a5ca84	net/http.(*ServeMux).ServeHTTP+0x1a4					/usr/local/go/src/net/http/server.go:2387
#	0x1e7a67a	github.com/elastic/beats/v7/libbeat/service.BeforeRun.func1+0x4a	/go/src/github.com/elastic/beats/libbeat/service/service.go:115
#	0x1a5ac43	net/http.HandlerFunc.ServeHTTP+0x43					/usr/local/go/src/net/http/server.go:2012
#	0x1a5ca84	net/http.(*ServeMux).ServeHTTP+0x1a4					/usr/local/go/src/net/http/server.go:2387
#	0x1a5dfd2	net/http.serverHandler.ServeHTTP+0xa2					/usr/local/go/src/net/http/server.go:2807
#	0x1a59a4b	net/http.(*conn).serve+0x86b						/usr/local/go/src/net/http/server.go:1895

Filebeat logs on boot with the debug log level and stops doing any useful work:

2020-08-20T08:58:46.093Z        INFO    [monitoring]    log/log.go:118  Starting metrics logging every 30s
2020-08-20T08:58:46.093Z        INFO    instance/beat.go:450    filebeat start running.
2020-08-20T08:58:46.093Z        INFO    registrar/migrate.go:186        Migrate registry version 0 to version 1
2020-08-20T08:58:46.521Z        INFO    memlog/store.go:119     Loading data file of '/usr/share/filebeat/data/registry/filebeat' succeeded. Active transaction id=15969
2020-08-20T08:58:46.521Z        INFO    memlog/store.go:124     Finished loading transaction log file for '/usr/share/filebeat/data/registry/filebeat'. Active transaction id=15969
2020-08-20T08:59:16.096Z        INFO    [monitoring]    log/log.go:145  Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":6450,"time":{"ms":6452}},"total":{"ticks
":13480,"time":{"ms":13489},"value":13480},"user":{"ticks":7030,"time":{"ms":7037}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":14},"info":{"ephemeral_id":"a3fab650-b6d9-4879-85fd-59374917d304","upt
ime":{"ms":30163}},"memstats":{"gc_next":53599232,"memory_alloc":44572368,"memory_total":127486704,"rss":119959552},"runtime":{"goroutines":14}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"co
nfig":{"module":{"running":0}},"output":{"type":"kafka"},"pipeline":{"clients":0,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":24},"load":{"1":3.26,"15":5.68,"5":4.97,"norm"
:{"1":0.1358,"15":0.2367,"5":0.2071}}}}}}

Pull request that broke this most likely: https://github.com/elastic/beats/pull/19633/files

cc @urso

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 20, 2020
@ph ph added Team:Platforms Label for the Integrations - Platforms team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@andresrc andresrc added the Team:Services (Deprecated) Label for the former Integrations-Services team label Aug 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@andresrc andresrc removed the Team:Platforms Label for the Integrations - Platforms team label Aug 20, 2020
@urso
Copy link

urso commented Aug 20, 2020

Filebeat migrates old registry files by reading+deserializing all state from the old state into memory first. Once it has read all state into memory it fixes some old entries, updates the schema and finally opens the new store and writes entry by entry into the new store.

Once the migration is finished the old file is deleted and the meta.json file is updated with the new verson ID.

Unfortunately from the the trace and logs provided I can't tell if this process is just slow. But all in all, the migration happens on startup only. Support for running inputs or autodiscovery is not yet enabled in Filebeat.

From the logs it looks like Filebeat was killed during the migration process:

2020-08-20T08:58:46.521Z        INFO    memlog/store.go:119     Loading data file of '/usr/share/filebeat/data/registry/filebeat' succeeded. Active transaction id=15969
2020-08-20T08:58:46.521Z        INFO    memlog/store.go:124     Finished loading transaction log file for '/usr/share/filebeat/data/registry/filebeat'. Active transaction id=15969

A we are still in the migration process, this log indicates that 15969 entries have already been written. Due to the restart during the migration Filebeat will attempt to migrate again on startup, overwriting already written entries.

How many entries does the original registry file has?

Have you noticed IO errors in older runs?

The stack trace shows that the migration is doing a write. Normally entries are written to an append only log file. If the log file gets full (default 10MiB limit) a checkpoint operation is triggered (the WriteCheckpoint call in the trace). The checkpoint removes duplicates, writes the total state to disk, and calls fsync (pretty how the old registry used to operate). If the registry is very big or IO is impacted by disk, then the process can take a while.

What is the state of the /usr/share/filebeat/data/registry/filebeat ? Do you see it changing?

@urso
Copy link

urso commented Aug 20, 2020

Spend some time debugging and benchmarking. While testing I noticed that the migrator disables the limit check, which forces the store to do an fsync operation after every entry migrated. The sweet spot between, this is fast and this is unbearable depends on the machine. On my machine 10000+ registry entries seem to take forever. Will work on a patch that fixes the fsync on every write (already fixes the issue on my machine with 100k taking <30s) and also will try to do the migration in less IO steps, such that hopefully only one fsync will be required in total.

@GiedriusS
Copy link
Author

GiedriusS commented Aug 20, 2020

Once the migration is finished the old file is deleted and the meta.json file is updated with the new verson ID.

Unfortunately from the the trace and logs provided I can't tell if this process is just slow. But all in all, the migration happens on startup only. Support for running inputs or autodiscovery is not yet enabled in Filebeat.

From the logs it looks like Filebeat was killed during the migration process:

2020-08-20T08:58:46.521Z        INFO    memlog/store.go:119     Loading data file of '/usr/share/filebeat/data/registry/filebeat' succeeded. Active transaction id=15969
2020-08-20T08:58:46.521Z        INFO    memlog/store.go:124     Finished loading transaction log file for '/usr/share/filebeat/data/registry/filebeat'. Active transaction id=15969

A we are still in the migration process, this log indicates that 15969 entries have already been written. Due to the restart during the migration Filebeat will attempt to migrate again on startup, overwriting already written entries.

How many entries does the original registry file has?

Have you noticed IO errors in older runs?

The stack trace shows that the migration is doing a write. Normally entries are written to an append only log file. If the log file gets full (default 10MiB limit) a checkpoint operation is triggered (the WriteCheckpoint call in the trace). The checkpoint removes duplicates, writes the total state to disk, and calls fsync (pretty how the old registry used to operate). If the registry is very big or IO is impacted by disk, then the process can take a while.

What is the state of the /usr/share/filebeat/data/registry/filebeat ? Do you see it changing?

@urso you are correct, this just takes a very long time.

I see that there are files *.json being deleted/created constantly. Their names look like auto-incremented IDs. They have content similar to:

{"_key":"filebeat::logs::native::2023885830-2065","identifier_name":"native","source":"/var/lib/docker/containers/b72df8764272db5c309fa48a229848af6fa2f67dcb2d5020863153d6da3bd573/b72df8764272db5c309fa48a229848af6f
a2f67dcb2d5020863153d6da3bd573-json.log","offset":1.101769e+06,"prev_id":"","timestamp":[2.81471359971532e+14,1.594952678e+09],"type":"container","ttl":-2,"FileStateOS":{"device":2065,"inode":2.02388583e+09},"id":
"native::2023885830-2065"}

checkpoint.new is also being constantly updated.

data.json and meta.json stay the same. What's interesting is that all of this is happening on local storage, a beefy SSD and it is taking this long. The old registry file on this Filebeat node that I am currently looking at takes 6MiB of space.

It took on one Filebeat instance to do this operation 2 hours and 15 minutes which is quite a lot. I can close this as a non-issue but it'd be great to have some performance improvements in this area in the future. Thanks for this software and your work on it!!! ❤️

If you will do any follow-up PR with those improvements then feel free to close this

@urso
Copy link

urso commented Aug 20, 2020

It took on one Filebeat instance to do this operation 2 hours and 15 minutes which is quite a lot. I can close this as a non-issue but it'd be great to have some performance improvements in this area in the future. Thanks for this software and your work on it!!!

Let's keep the issue open. This long migration time is much too bad for my likings and should be treated as a bug.

@urso urso added the bug label Aug 20, 2020
kvch pushed a commit that referenced this issue Aug 25, 2020
## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes #20705
kvch pushed a commit to kvch/beats that referenced this issue Aug 25, 2020
## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes elastic#20705

(cherry picked from commit 03748b3)
kvch pushed a commit to kvch/beats that referenced this issue Aug 25, 2020
## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes elastic#20705

(cherry picked from commit 03748b3)
kvch added a commit that referenced this issue Aug 25, 2020
## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes #20705

(cherry picked from commit 03748b3)

Co-authored-by: Steffen Siering <[email protected]>
kvch added a commit that referenced this issue Aug 25, 2020
#20769)

* Improve registry file migration performance (#20717)

## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes #20705

(cherry picked from commit 03748b3)

* fix changelog

Co-authored-by: Steffen Siering <[email protected]>
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this issue Oct 14, 2020
## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes elastic#20705
leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…formance (elastic#20769)

* Improve registry file migration performance (elastic#20717)

## What does this PR do?

Ensure that fsync is called only once after the migration of old state entries is complete.

## Why is it important?

The registry uses a checkpoint-predicate that when true writes all state to
disk and calls fsync. The checkpoint operation is supposed to be disabled
when migration the old registry file.
During migration, the old state will be directly copied (after cleanup
and schema changes are applied). The old state will only be deleted
after the migration of all states is complete.

Unfortunately, the checkpoint predicate did return true, instead of
false, which did trigger a checkpoint operation per state to be
migrated. The fix disables fsync, and now finalizes the migration by
calling Checkpoint directly.

The PR also provides a benchmark (each "op" is one migration attempt). Before this fix (go test did kill the run after 10min for 10k entries):

```
BenchmarkMigration0To1/1-32  	             286	   4203353 ns/op
BenchmarkMigration0To1/10-32 	              34	  35730680 ns/op
BenchmarkMigration0To1/100-32         	       2	 720890839 ns/op
BenchmarkMigration0To1/1000-32        	       1	31633569085 ns/op
... test timed out after 10min
```

Benchmark results with the fix (migration 100k entries took ~7.6s):

```
BenchmarkMigration0To1/1-32  	             274	   4371400 ns/op
BenchmarkMigration0To1/10-32 	             259	   4639209 ns/op
BenchmarkMigration0To1/100-32         	     100	  13374147 ns/op
BenchmarkMigration0To1/1000-32        	      13	 104220944 ns/op
BenchmarkMigration0To1/10000-32       	       2	 916656798 ns/op
BenchmarkMigration0To1/100000-32      	       1	7616648790 ns/op
PASS
```

Closes elastic#20705

(cherry picked from commit c2be4a7)

* fix changelog

Co-authored-by: Steffen Siering <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Services (Deprecated) Label for the former Integrations-Services team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants