[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines #1962

nick-jones · 2023-06-04T19:16:27Z

What version of Badger are you using?

v4.1.0 (tested on latest main too)

What version of Go are you using?

go version go1.20.3 darwin/arm64

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, CPU, OS)?

MacBook Pro M2

What steps will reproduce the bug?

When 2 concurrent transactions read a non-existent key, and then write to that key, the behavior I currently observe is that one of the transactions will conflict (this, I believe, is expected). For example:

package main

import (
	"log"
	"os"
	"time"

	"github.com/dgraph-io/badger/v4"
	"golang.org/x/sync/errgroup"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	key := []byte("key-1")

	eg := errgroup.Group{}
	eg.Go(func() error {
		return db.Update(func(txn *badger.Txn) error {
			<-time.After(time.Second) // pause long enough to ensure the other goroutine is running
			_, _ = txn.Get(key)
			return txn.Set(key, []byte("value-1"))
		})
	})
	eg.Go(func() error {
		return db.Update(func(txn *badger.Txn) error {
			<-time.After(time.Second) // pause long enough to ensure the other goroutine is running
			_, _ = txn.Get(key)
			return txn.Set(key, []byte("value-2"))
		})
	})
	return eg.Wait()
}

...yields...

$ go run .
2023/06/04 19:26:09 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger3191223375
2023/06/04 19:26:09 Transaction Conflict. Please retry
exit status 1

Another way to observe this behavior is to use 2 manually managed transactions and "inline" the executed steps as it were, i.e. execute one after the other (with the 2 transactions active at the same time). This is generally an easier way to trigger this behavior. For example:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/dgraph-io/badger/v4"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	key := []byte("key-1")

	tx1 := db.NewTransaction(true)
	defer tx1.Discard()

	tx2 := db.NewTransaction(true)
	defer tx2.Discard()

	_, _ = tx1.Get(key)
	_ = tx1.Set(key, []byte("value-1"))

	_, _ = tx2.Get(key)
	_ = tx2.Set(key, []byte("value-1"))

	if err = tx1.Commit(); err != nil {
		return fmt.Errorf("tx1 failed: %w", err)
	}
	if err = tx2.Commit(); err != nil {
		return fmt.Errorf("tx2 failed: %w", err)
	}
	return nil
}

..outputs..

$ go run .
<snip>
2023/06/04 19:27:41 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger3558899794
2023/06/04 19:27:41 tx2 failed: Transaction Conflict. Please retry
exit status 1

Now, I have a couple of transactions that are slightly more convoluted than the above, but they are expected to conflict with each other for the same reasons. These transactions are being executed from different goroutines, so conflicts can happen on a general basis. To simplify things, I've done the same as the above - I've taken what would be executed by 2 goroutines concurrently and "inlined" the steps to highlight a scenario where these 2 transactions could conflict with each other (this is the doWork function below). This "inlined" version conflicts as expected. However, If I run those steps across many goroutines at the same time (each operating on completely different keys), 1 of the executions fails to conflict, unexpectedly.

So, with the following code:

package main

import (
	"crypto/rand"
	"errors"
	"fmt"
	"log"
	"math/big"
	"os"
	"sync/atomic"
	"time"

	"github.com/dgraph-io/badger/v4"
	"github.com/google/uuid"
	"golang.org/x/sync/errgroup"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	eg := errgroup.Group{}
	var conflicts uint64
	for i := 0; i < 1_000; i++ {
		i := i
		eg.Go(func() error {
			err := doWork(db, i)
			if errors.Is(err, badger.ErrConflict) {
				atomic.AddUint64(&conflicts, 1)
				return nil
			}
			return fmt.Errorf("unexpected result: err = %v (i = %d)", err, i)
		})
	}
	if err = eg.Wait(); err != nil {
		log.Printf("failed with %d conflicts", conflicts)
		return err
	}

	log.Printf("completed as expected with %d conflicts", conflicts)

	return nil
}

func doWork(db *badger.DB, i int) error {
	delay()

	key1 := fmt.Sprintf("v:%d:%s", i, uuid.NewString())
	key2 := fmt.Sprintf("v:%d:%s", i, uuid.NewString())

	tx1 := db.NewTransaction(true)
	defer tx1.Discard()
	tx2 := db.NewTransaction(true)
	defer tx2.Discard()

	_ = getValue(tx2, key1)
	_ = getValue(tx2, key2)
	_ = getValue(tx1, key1)
	_ = getValue(tx2, key1)
	setValue(tx2, key1, "value1-placeholder")
	setValue(tx2, key2, "value2")

	if err := tx2.Commit(); err != nil {
		return fmt.Errorf("tx2 failed: %w (key1 = %s, key2 = %s)", err, key1, key2)
	}

	setValue(tx1, key1, "value1")
	_ = getValue(tx1, key1)
	setValue(tx1, key1, "updated-value1")

	delay()
	if err := tx1.Commit(); err != nil {
		return fmt.Errorf("tx1 failed: %w (key1 = %s, key2 = %s)", err, key1, key2)
	}
	return nil
}

func getValue(txn *badger.Txn, key string) string {
	val, err := txn.Get([]byte(key))
	if err != nil {
		if errors.Is(err, badger.ErrKeyNotFound) {
			return ""
		}
		panic(err)
	}
	data, err := val.ValueCopy(nil)
	if err != nil {
		panic(err)
	}
	return string(data)
}

func setValue(txn *badger.Txn, key, value string) {
	if err := txn.Set([]byte(key), []byte(value)); err != nil {
		panic(err)
	}
}

func delay() {
	jitter, err := rand.Int(rand.Reader, big.NewInt(100))
	if err != nil {
		panic(err)
	}
	<-time.After(time.Duration(jitter.Int64()) * time.Millisecond)
}

This is the output I get with repeated running:

$ while go run .; do echo "---"; done
2023/06/04 19:58:58 completed as expected with 1000 conflicts
2023/06/04 19:58:58 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger944970449
---
2023/06/04 19:58:58 completed as expected with 1000 conflicts
2023/06/04 19:58:58 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger358100745
---
2023/06/04 19:58:59 completed as expected with 1000 conflicts
2023/06/04 19:58:59 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger373541670
---
2023/06/04 19:59:00 completed as expected with 1000 conflicts
2023/06/04 19:59:00 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger55216248
---
2023/06/04 19:59:00 failed with 999 conflicts
2023/06/04 19:59:00 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger2814289974
2023/06/04 19:59:00 unexpected result: err = <nil> (i = 101)
exit status 1

So essentially 1 execution of the doWork() function has failed to conflict, whilst 999 (and 4k prior) executions did conflict as expected. Note that the inclusion of the random delays seems to help trigger the conditions for this, though it is unclear to me why.

Expected behavior and actual result.

It's unclear to me if my expectation for things to conflict is reasonable, but it does strike me as odd that the behaviour can apparently differ. Failure to conflict can result in loss of writes.

Additional information

same behavior is observed when using in-memory mode.
behavior is still observed even with a much reduced number of goroutines running (e.g. 10)

The text was updated successfully, but these errors were encountered:

mangalaman93 · 2023-08-29T06:20:47Z

Thanks for filling a detailed bug, I am looking into it. It seems like when txn timestamps (readTs) are zero, it doesn't seem to conflict. I am trying figure out why that is the case.

mangalaman93 · 2023-08-29T10:10:30Z

This seems like a bug to me, the read watermark used by badger does not handle the case when ts=0 well. I am looking for a solution for it. Thanks again for filing a reproducible bug.

Fixes #1962 We have assumed that index won't be zero for a WaterMark but in badger's unmanaged mode we start transactions with readts = 0.

Fixes #1962 We have assumed that index won't be zero for a WaterMark but in badger's unmanaged mode we start transactions with readTs = 0. This affects oracle.readMark that could have values starting at 0.

nick-jones added the kind/bug Something is broken. label Jun 4, 2023

mangalaman93 added a commit that referenced this issue Aug 29, 2023

fix edge case for watermark when index is zero

9eff384

Fixes #1962 We have assumed that index won't be zero for a WaterMark but in badger's unmanaged mode we start transactions with readts = 0.

mangalaman93 mentioned this issue Aug 29, 2023

fix edge case for watermark when index is zero #1999

Merged

mangalaman93 closed this as completed in #1999 Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines #1962

[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines #1962

nick-jones commented Jun 4, 2023 •

edited

Loading

mangalaman93 commented Aug 29, 2023

mangalaman93 commented Aug 29, 2023

[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines #1962

[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines #1962

Comments

nick-jones commented Jun 4, 2023 • edited Loading

What version of Badger are you using?

What version of Go are you using?

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, CPU, OS)?

What steps will reproduce the bug?

Expected behavior and actual result.

Additional information

mangalaman93 commented Aug 29, 2023

mangalaman93 commented Aug 29, 2023

nick-jones commented Jun 4, 2023 •

edited

Loading