wal: add write ahead log package #332

fabxc · 2018-05-16T12:11:22Z

This adds a new WAL that's agnostic to the actual record contents.
It's much simpler and should be more resilient than the existing one.

It is safe for concurrent reads across different processes.

@brian-brazil @jkohen

Integrating this into the rest of TSDB is another can of worms. I may need to make some slight tweaks here and there, but in general it seems complete and working.

The batch mode is mostly there because I'd like to implement the checkpointing (formerly truncation) by just writing another WAL so we don't need to implement another disk format. But in that batch write scenario, firing off millions of tiny writes wouldn't be ideal.

Also we need some nicer recovery handling than in the last one – but should be simpler to do as well.

krasi-georgiev · 2018-05-16T13:05:06Z

I smell some offline kubecon discussions 🤔
I am just guessing, but probably this is in relation to the WAL replication mentioned in @gouthamve's talk.

Are there any more discussions how and where would this be used?

krasi-georgiev · 2018-05-16T13:28:12Z

I figured it to be in relation with this proposal
https://docs.google.com/document/d/1TEqqE_Stq04drhjSU1I7Ctmuy0dpsvlPL1AKxqEQoSg

fabxc · 2018-05-16T18:28:26Z

@krasi-georgiev yes, that's the right one. No offline discussions and nothing outside of this doc really.

After some initial benchmarking after integrating this, TSDB overall is looking at ~20% sample throughput decrease.
Which essentially still means many millions per second and it would be noise at most in Prometheus's full resource profile.

So a bit worse on the benchmark scale, but the durability guarantees are actually a fair bit better with this WAL.

krasi-georgiev

looked long and hard and didn't see any problems with the implementation.

the code is easy to follow and read 👍

krasi-georgiev · 2018-05-17T09:58:30Z

wal/wal.go

+		if err := prev.Close(); err != nil {
+			level.Error(w.logger).Log("msg", "close previous segment", "err", err)
+		}
+	}


seems that the actor is used only in a couple of places , is there no way to use simple mtx and get rid of the run() all together to simplify?

Unfortunately not, mutex would mean the writing goroutines would block. Fsync can take several seconds or much longer in bad conditions.

krasi-georgiev · 2018-05-17T11:55:44Z

wal/wal.go

+	return &Reader{rdr: bufio.NewReader(r)}
+}
+
+// Next advances the reader to the next records and returns true iff it exists.


Inevitably gets pointed out each time I try to use it ;) https://en.wikipedia.org/wiki/If_and_only_if
Doesn't seem to catch on – when no one knows what it means, it's probably not worth using.

krasi-georgiev · 2018-05-17T12:13:33Z

wal/wal.go

+		default:
+			return errors.Errorf("unexpected record type %d", typ)
+		}
+		// Only increment i for non-zero records since we use it below


below -> above

fabxc · 2018-05-17T13:15:52Z

the code is easy to follow and read 👍

Added 3 commits to change that :P

I know it's quite a bit of code for a single PR, but it's well-separated bottom-up along commits. I think some parts are just better to review with the full picture at hand.

Old WAL code was not removed. It needs to stick around anyway for migration procedures.

jkohen · 2018-05-17T14:20:25Z

checkpoint.go

+		if err != nil || k >= n {
+			continue
+		}
+		if err := os.RemoveAll(filepath.Join(dir, fi.Name())); err != nil {


If one checkpoint can't be deleted (it becomes read-only?), then later checkpoints won't be deleted until that error is solved, and the failure mode will be upgraded from "can't delete this" to "I'm filling up the disk". Would it make sense to continue deleting them?

Also, I see Checkpoint() would fail. Would that cause Prometheus server to abort or something else that could cause data loss? In that case, I think this function shouldn't fail if a directory can't be deleted.

You are right. We shouldn't abort on failure in both places. Especially since those leftover checkpoints don't cause problems in general since we just ignore them.

jkohen · 2018-05-17T14:24:52Z

checkpoint.go

+//
+// If no errors occurs, it returns the number of the highest segment that's
+// part of the checkpoint.
+func Checkpoint(w *wal.WAL, m, n int, keep func(id uint64) bool, mint int64) (*CheckpointStats, error) {


What are m and n? Can you document them?

jkohen · 2018-05-17T14:44:18Z

checkpoint.go

+						repl = append(repl, s)
+						break
+					} else {
+


Delete empty block?

jkohen · 2018-05-17T14:48:54Z

head.go

@@ -139,9 +141,9 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics {
 	}, func() float64 {
 		return float64(h.MinTime())
 	})
-	m.walTruncateDuration = prometheus.NewSummary(prometheus.SummaryOpts{
-		Name: "prometheus_tsdb_wal_truncate_duration_seconds",


Are we assuming nobody out there relies on this metric for monitoring Prometheus? Would it make sense to preserve it?

Metrics are explicitly excluded from our stability guarantees – but for critical metrics the concern makes sense nonetheless, of course.
I think this is more of a debug metric than an alerting/dashboard one. But may make sense to keep it anyway – especially since it's still a generally accurate name.

jkohen · 2018-05-17T15:00:05Z

wal/wal.go

+func (w *WAL) fsync(f *Segment) error {
+	start := time.Now()
+	err := fileutil.Fsync(f.File)
+	w.fsyncDuration.Observe(time.Since(start).Seconds())


Nit, is fsync often taking 10s of seconds or longer? If closer to 1 second (or below), I'd use more granular unit, e.g. ms.

In Prometheus we've a few instrumentation conventions that are not baked into our model. One of them is that we always use base units, i.e. bytes, seconds, ...
The idea is that you don't have to make unit adjustments if you want to do binary ops between metrics and such and can just do a final division/multiplication on the overall result. Dashboard builders also often allow you to set a unit and they'll pick the most sensible scale – you could then just always default to the base unit and don't have to handpick it for each graph.

Since everything is a double in our world, this choice conveniently doesn't really impact data size or similar.

Time.Seconds() returns a float64 by the way, so we aren't losing the sub-seconds in case you meant that.

jkohen · 2018-05-17T16:53:40Z

wal/wal.go

+
+// flushPage writes the contents of the page to disk.
+// If clear is set or no more records will reasonbly fit into it, its
+// remaining size gets written as zero data and its reset for the next physical page.


I'm not sure I understand the last part. Is it saying that the page will be padded with byte-0 if clear is true or the page doesn't have room for another record?

If clear is false, we may fill up the remainder of the page with zero bytes if we think there's not enough space left to fit in another (partial) record. If clear is true, we force the zero padding no matter how much space is left. We need the latter on shutdown or when completing a segment (when no partial records may be written), since we never want to leave a segment with a size that's not a multiple of 32KB.

I will make the comment clearer.

fabxc · 2018-05-18T10:01:23Z

Some initial feedback after running this in a small Prometheus server with high frequency scrapes:

Memory and CPU usage seem about 3-5% lower on average, which may just be noise. Max usage shows no notable difference.
The page flush to page completion ratio when scraping ~450 series is about 10x. So each 32KB WAL page gets written 10 times. Assuming a 4KB physical page size, we are looking at ~1.25 write amplification – so nothing to worry about there even if it had high variance.

Startups got about 3x slower. This is not super concerning but also not great for big setups and worth investigating.

fabxc · 2018-05-22T13:32:13Z

@brian-brazil @gouthamve can I have a review from one of you?

We introduced a regression a while back regarding block order. It never made it into Prometheus though. Fixed it in here and added a test.

jkohen · 2018-05-22T13:44:03Z

This is looking good.

fabxc · 2018-05-24T19:57:41Z

During performance analysis of this change, I found a regression. A buffer that seems stack-allocated doesn't pass escape analysis and gets thrown on the heap. That resulted in big spikes. I pushed a commit to fix this.

This shows memory usage before and after the change. It shows the released v2.2.1 and a version with the new WAL and in-memory metadata store.

brian-brazil

I haven't dug too deeply into this. The new file format should be added to the docs.

brian-brazil · 2018-05-25T12:29:59Z

checkpoint.go

+// This makes it easy to read it through the WAL package and concatenate
+// it with the original WAL.
+//
+// Non-critical errors are locked and not returned.


brian-brazil · 2018-05-25T12:31:53Z

checkpoint.go

+			return nil, errors.New("invalid record type")
+		}
+		if len(buf[start:]) == 0 {
+			continue // all contents discarded


Comment style

brian-brazil · 2018-05-25T12:32:00Z

checkpoint.go

+		}
+		recs = append(recs, buf[start:])
+
+		// flush records in 1 MB increments.


Capital letter

brian-brazil · 2018-05-25T12:33:50Z

db.go

@@ -613,10 +616,6 @@ func OverlappingBlocks(bm []BlockMeta) Overlaps {
 	if len(bm) <= 1 {
 		return nil
 	}
-	sort.Slice(bm, func(i, j int) bool {


Why are you removing this?

This was mutating the input slice in a function that should do a read-only check. Thus current master actually has a regression. The sort is moved properly into the reload function.

This is fixed and tested in master via #335 so no longer relevant for this PR.

brian-brazil · 2018-05-25T12:34:11Z

fileutil/fileutil.go

@@ -23,3 +24,25 @@ func ReadDir(dirpath string) ([]string, error) {
 	sort.Strings(names)
 	return names, nil
 }
+
+// Rename safely renames a file.


This has a race condition between the remove and the rename

Care to elaborate?

The doc says it's "safe" however there's a race condition. In what way is the meant to be safe?

I meant to elaborate on what exactly the race condition is.

From the time the removeall starts to the rename ends, partial or no file may be present.

brian-brazil · 2018-05-25T12:36:47Z

head.go

+	return nil
+}
+
+// Init backfills data from the write ahead log and prepares the head for writes.


This is more a load than a backfill, which is confusing

brian-brazil · 2018-05-25T12:40:38Z

head.go

-	} else {
-		level.Error(h.logger).Log("msg", "WAL truncation failed", "err", err, "duration", time.Since(start))
+	if _, err = Checkpoint(h.logger, h.wal, m, n, keep, mint); err != nil {
+		return errors.Wrap(err, "create checkpoint")
 	}
 	h.metrics.walTruncateDuration.Observe(time.Since(start).Seconds())


We should probably have a separate metric for the checkpoint duration

It was renamed to checkpoint but as @jkohen pointed out, it might break alerts and dashboards. It is essentially still part of the truncation process, so keeping the name seemed fine. The WAL.Truncate call is just simple file removal. I don't think there's anything worth measuring in there.

brian-brazil · 2018-05-25T12:43:45Z

wal/wal.go

+	w.segment = next
+	w.donePages = 0
+
+	// Don't block further writes by fsyinc the last segment.


brian-brazil · 2018-05-25T12:46:33Z

wal/wal.go

+		}
+	}
+
+	donec := make(chan struct{})


I think a single-level channel is sufficient here

It's not, we have to wait until run terminated successfully. Currently it returns right when it reads from the channel. But this is quite implementation dependent and would be an easy regression target.

It looks like unnecessary complexity currently, especially as WAL is a struct rather than an interface.

brian-brazil · 2018-05-25T12:46:58Z

wal/wal.go

+
+func (w *WAL) run() {
+	for {
+		// Processing all pending functions has precedence over shutdown.


What happens if we shutdown with pending functions? The current code seems to allow for that

krasi-georgiev · 2018-05-24T14:11:02Z

head.go

@@ -221,17 +223,27 @@ func (h *Head) processWALSamples(
 				h.metrics.chunksCreated.Inc()
 				h.metrics.chunks.Inc()
 			}
+			if s.T > maxt {
+				maxt = s.T
+			}


what about mint?

I actually uncovered several more issues around this general area. I've adjusted things to make it work, but it touches a few more places. So I'd rather add this in a PR on top.

krasi-georgiev · 2018-05-28T14:31:24Z

btw did you try to run this with --race enabled ?

fabxc · 2018-05-28T21:45:55Z

@krasi-georgiev yea, that looks all okay.

fabxc · 2018-05-29T19:56:01Z

@brian-brazil how does this one look? I've two follow up PRs ready.

brian-brazil · 2018-05-30T11:56:57Z

I'm still waiting on file format docs/protocol before giving it a full review.

fabxc · 2018-05-30T13:02:21Z

The record encoding itself did not change. But so far the WAL format isn't documented at all. I've another PR in the queue that changes the record format and backfills documentation.

I can backport it to this one though without those changes.

fabxc · 2018-05-30T15:12:32Z

@brian-brazil docs added

Allow to repair the WAL based on the error returned by a reader during a full scan over all records. Signed-off-by: Fabian Reinartz <[email protected]>

Create checkpoints from a sequence of WAL segments while filtering out obsolete data. The checkpoint format is again a sequence of WAL segments, which allows us to reuse the serialization format and implementation. Signed-off-by: Fabian Reinartz <[email protected]>

Remove the old WAL and drop in the new one Signed-off-by: Fabian Reinartz <[email protected]>

We assume in multiple places that the block list held by DB has blocks sequential by time. A regression caused us to hold them ordered by ULID, i.e. by creation time instead. Signed-off-by: Fabian Reinartz <[email protected]>

The buffers we allocated were escaping to the heap, resulting in large memory usage spikes during startup and checkpointing in Prometheus. This attaches the buffer to the reader object to prevent this. Signed-off-by: Fabian Reinartz <[email protected]>

Signed-off-by: Fabian Reinartz <[email protected]>

On startup, rewrite the old write ahead log into the new format once. Signed-off-by: Fabian Reinartz <[email protected]>

Signed-off-by: Fabian Reinartz <[email protected]>

fabxc · 2018-07-19T11:35:42Z

Rebased and resolved conflicts.

This fixes various issues when initializing the head time range under different starting conditions. Signed-off-by: Fabian Reinartz <[email protected]>

Signed-off-by: Fabian Reinartz <[email protected]>

brian-brazil

👍

Just some minor doc nits.

brian-brazil · 2018-07-25T12:25:44Z

checkpoint.go

+	DroppedTombstones int
+	TotalSeries       int // Processed series including dropped ones.
+	TotalSamples      int // Processed samples inlcuding dropped ones.
+	TotalTombstones   int // Processed tombstones including droppes ones.


brian-brazil · 2018-07-25T12:28:21Z

docs/format/wal.md

+└─────────────────────────────────────────────────────┘
+```
+
+[1][https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format]


GitHub isn't rendering this properly.

brian-brazil · 2018-07-25T12:29:48Z

wal/wal.go

+
+const (
+	recPageTerm recType = 0 // rest of page is empty
+	recFull     recType = 1 // full record


Full stops and capital letters. There's others in this file too.

Signed-off-by: Fabian Reinartz <[email protected]>

Properly initialize head time

Signed-off-by: Fabian Reinartz <[email protected]>

Migrate write ahead log

Signed-off-by: Fabian Reinartz <[email protected]>

brian-brazil · 2018-08-07T08:58:02Z

👍

fabxc force-pushed the newwal branch from 96eddfd to b4fde6d Compare May 16, 2018 12:25

krasi-georgiev reviewed May 17, 2018

View reviewed changes

fabxc force-pushed the newwal branch 2 times, most recently from 273b107 to fbcedb5 Compare May 17, 2018 13:14

jkohen reviewed May 17, 2018

View reviewed changes

fabxc force-pushed the newwal branch 3 times, most recently from d80e448 to 36ac113 Compare May 18, 2018 06:45

fabxc force-pushed the newwal branch from 36ac113 to 67eadd4 Compare May 22, 2018 13:11

fabxc mentioned this pull request May 24, 2018

Update head time when reading wal #334

Closed

brian-brazil reviewed May 25, 2018

View reviewed changes

krasi-georgiev reviewed May 25, 2018

View reviewed changes

fabxc force-pushed the newwal branch from f04d2fb to 1f0d51a Compare May 25, 2018 21:23

fabxc mentioned this pull request May 28, 2018

Ensure correct block order on reload #335

Merged

fabxc force-pushed the newwal branch from 1f0d51a to 10eef98 Compare May 28, 2018 21:22

fabxc mentioned this pull request May 29, 2018

Properly initialize head time #339

Merged

Fabian Reinartz added 8 commits July 19, 2018 07:24

wal: add segment type and repair procedure

449a2d0

Allow to repair the WAL based on the error returned by a reader during a full scan over all records. Signed-off-by: Fabian Reinartz <[email protected]>

Integrate new WAL and checkpoints

def912c

Remove the old WAL and drop in the new one Signed-off-by: Fabian Reinartz <[email protected]>

Ensure blocks are time-ordered in memory

7841d41

We assume in multiple places that the block list held by DB has blocks sequential by time. A regression caused us to hold them ordered by ULID, i.e. by creation time instead. Signed-off-by: Fabian Reinartz <[email protected]>

move WAL lock

3f53881

Signed-off-by: Fabian Reinartz <[email protected]>

docs: add new WAL format

0ad2b8a

Signed-off-by: Fabian Reinartz <[email protected]>

Address comments

3e76f01

Signed-off-by: Fabian Reinartz <[email protected]>

fabxc force-pushed the newwal branch from 0151efd to 3e76f01 Compare July 19, 2018 11:28

Fabian Reinartz added 3 commits July 19, 2018 07:34

Migrate write ahead log

1a5573b

On startup, rewrite the old write ahead log into the new format once. Signed-off-by: Fabian Reinartz <[email protected]>

Fix close handling

92e1b20

Signed-off-by: Fabian Reinartz <[email protected]>

Deal with zero-length segments

22fd3ef

Signed-off-by: Fabian Reinartz <[email protected]>

Fabian Reinartz added 2 commits July 19, 2018 07:41

Properly initialize head time

45071c6

This fixes various issues when initializing the head time range under different starting conditions. Signed-off-by: Fabian Reinartz <[email protected]>

Address comments

b81e0fb

Signed-off-by: Fabian Reinartz <[email protected]>

krasi-georgiev mentioned this pull request Jul 23, 2018

Unknown series references in WAL after crash #21

Closed

brian-brazil reviewed Jul 25, 2018

View reviewed changes

fabxc force-pushed the newwal branch from cac5e9d to 896a642 Compare August 2, 2018 21:50

Add Replace function

f8ec007

Signed-off-by: Fabian Reinartz <[email protected]>

fabxc force-pushed the newwal branch from 896a642 to f8ec007 Compare August 2, 2018 21:52

fabxc and others added 4 commits August 2, 2018 16:54

Merge pull request #339 from prometheus/inittime

a9a8fab

Properly initialize head time

Fix doc comments

ee7ee05

Signed-off-by: Fabian Reinartz <[email protected]>

Merge pull request #340 from prometheus/wal_migrate

7699051

Migrate write ahead log

Fix Rename call

74b3501

Signed-off-by: Fabian Reinartz <[email protected]>

fabxc merged commit 2a0e96e into master Aug 7, 2018

fabxc deleted the newwal branch August 7, 2018 09:14

krasi-georgiev mentioned this pull request Aug 31, 2018

Document WAL format #255

Closed

brancz mentioned this pull request Aug 13, 2019

tsdb: fsync only called on segment closing prometheus/prometheus#5869

Closed

machine424 mentioned this pull request Mar 15, 2024

[BUGFIX] tsdb/wlog.Checkpoint: Fix counting of histogram samples in stats. prometheus/prometheus#13776

Merged

wal: add write ahead log package #332

wal: add write ahead log package #332

Conversation

fabxc commented May 16, 2018 • edited Loading

krasi-georgiev commented May 16, 2018

krasi-georgiev commented May 16, 2018

fabxc commented May 16, 2018 • edited Loading

krasi-georgiev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabxc May 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabxc commented May 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabxc May 17, 2018 • edited Loading

Choose a reason for hiding this comment

fabxc commented May 18, 2018

fabxc commented May 22, 2018

jkohen commented May 22, 2018

fabxc commented May 24, 2018 • edited Loading

brian-brazil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krasi-georgiev commented May 28, 2018

fabxc commented May 28, 2018

fabxc commented May 29, 2018

brian-brazil commented May 30, 2018

fabxc commented May 30, 2018

fabxc commented May 30, 2018

fabxc commented Jul 19, 2018

brian-brazil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brian-brazil commented Aug 7, 2018

fabxc commented May 16, 2018 •

edited

Loading

fabxc commented May 16, 2018 •

edited

Loading

fabxc May 17, 2018 •

edited

Loading

fabxc May 17, 2018 •

edited

Loading

fabxc commented May 24, 2018 •

edited

Loading