Kafka: Reordered segment allocation causes spurious failures #5761

gianm · 2018-05-09T02:37:45Z

In one of our test clusters we saw Kafka index task failures corresponding to sequence_name_prev_id_sha1 unique constraint violations on the overlord. The query getting run was this one:

            handle.createStatement(
                StringUtils.format(
                    "INSERT INTO %1$s (id, dataSource, created_date, start, %2$send%2$s, sequence_name, sequence_prev_id, sequence_name_prev_id_sha1, payload) "
                    + "VALUES (:id, :dataSource, :created_date, :start, :end, :sequence_name, :sequence_prev_id, :sequence_name_prev_id_sha1, :payload)",
                    dbTables.getPendingSegmentsTable(), connector.getQuoteString()
                )
            )
                  .bind("id", newIdentifier.getIdentifierAsString())
                  .bind("dataSource", dataSource)
                  .bind("created_date", DateTimes.nowUtc().toString())
                  .bind("start", interval.getStart().toString())
                  .bind("end", interval.getEnd().toString())
                  .bind("sequence_name", sequenceName)
                  .bind("sequence_prev_id", previousSegmentIdNotNull)
                  .bind("sequence_name_prev_id_sha1", sequenceNamePrevIdSha1)
                  .bind("payload", jsonMapper.writeValueAsBytes(newIdentifier))
                  .execute();

The constraint violation happens when the (sequence_name, previousSegmentIdNotNull) tuple is not unique. So in particular, it happens when the segment sequence "forks" and goes from the same segment X into two subsequent segments Y and Z. And from looking at the tasks involved, the events that occurred looked something like this:

A Kafka index task allocated some segments and then failed for some reason.
A new task was started up to re-read from the same offset.
The new task also tried to allocate segments, and it worked for a while, but eventually it tried to allocate one in a different order than the original task. This fails and so does the whole task.
Another task starts up, but it fails for the same reason.
The cycle of violence continues ad infinitum.

I think the problem is that the allocation fails if you don't do them in the same order as they were originally done in, but, since #4815 it's no longer guaranteed that two Kafka index tasks, starting from the same partition/offsets, will create segments in the same order (due to the mixing of messages from different partitions being non-deterministic). I believe this is meant to be okay, since the hypothetical two tasks would still end up at the same place, due to the restriction that there is just one segment per interval now.

(Aside: if there were more than one segment per interval, the tasks would not be guaranteed to end up at the same place, since they wouldn't necessarily make the same decision about which rows to put in which segments for the same interval.)

So I am thinking we should:

Confirm that there is meant to only be one segment per interval per sequence now, and add sanity checks to enforce this if they do not already exist.
Confirm there is no longer any need for the previous-segment-id check.
Remove the check: allow a sequence to allocate segments out of order, subject to there being at most one per interval per segment.

/cc @pjain1 @jihoonson

The text was updated successfully, but these errors were encountered:

jihoonson · 2018-05-09T19:56:02Z

@gianm nice catch! The supposed solution sounds good to me.

jihoonson · 2018-05-09T23:30:06Z

I'm working on this and #5729.

gianm added Bug Area - Streaming Ingestion labels May 9, 2018

jihoonson mentioned this issue May 25, 2018

Allow reordered segment allocation in kafka indexing service #5805

Merged

gianm closed this as completed in #5805 Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka: Reordered segment allocation causes spurious failures #5761

Kafka: Reordered segment allocation causes spurious failures #5761

gianm commented May 9, 2018

jihoonson commented May 9, 2018

jihoonson commented May 9, 2018

Kafka: Reordered segment allocation causes spurious failures #5761

Kafka: Reordered segment allocation causes spurious failures #5761

Comments

gianm commented May 9, 2018

jihoonson commented May 9, 2018

jihoonson commented May 9, 2018