-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AppenderatorDriver metadata loses information about publishing segments before they are published #4743
Comments
@jihoonson can you confirm this ? |
@pjain1 thanks for pointing out this problem. Your suggestion looks reasonable. I have one comment below.
It seems that now the segments in AppenderatorDriver have 4 states, i.e., active, inactive, publishing, and published. Since the |
True that The main thing is to prevent multiple publish of same sequence on duplicate publish requests and more importantly to actually redo publish of segments for a sequence after the task is restarted if previous publish request did not complete fully. So, if this can be achieved by using only state machine then I am good with that. |
Sorry for insufficient description.
I thought that this could be achieved by maintaining the segments' state machines in an in-memory structure like I think this can solve the problem because all segments are kept in memory until they are published. Please correct me if I missed something. |
So can the segments move directly from The map for maintaining the state of segments can look like Sorry these points are not clear to me, can you please clarify. Thanks |
Sure. Here are more details.
If this is
The state transition can be summarized as active -> (inactive) -> publishing. Transition to inactive state is optional.
Yes, this map should be persisted when committer runs. On a task restart, the map is read from the commit metadata. The segment states on task restart can be different depending on the implementation. Here are some solutions I can think of now.
The driver is able to know that each segment is in which state. When a task restarts, the driver should be able to find the segments to be published in the map because every segment are in there until they are published. |
Thanks for the detailed explanation. The state transitions look good to me, you have already pointed out but mentioning it again that state transition to inactive should not be mandatory as some task may not call For the restart logic, I think rewinding Kafka consumer may not be a good idea, its just extra work and can be problematic if retention on Kafka side is low. I like the idea of driver start publishing the segment by itself, however the only problem I can see there is how the task will get the future for publishing segments, as the task cannot start new publish until the previous one is done and also the task does few things when a publish future is resolved like registering for handoff, adjusting its metadata etc. May be the |
Yeah, it sounds needed. Or it's also possible that |
@jihoonson second approach sounds easier to plug with the current code I have. So, lets go with that if you think it is good. Also, it would be better if the driver can return sequence names instead of segment information. |
@pjain1 sounds good. Thanks! |
Its not blocking me, its a bug that already exists and I was hoping to get this fixed in 0.11. Anyways, I can pick it up once I am done with my current tasks. Thanks |
@pjain1 cool, thanks! |
Fixed in #4815 |
I see a problem with the current way of managing metadata about publishing segments in AppenderatorDriver. Lets say a task calls AppenderatorDriver to publish segments for a sequence, then the driver will remove the sequence information from activeSegments and publishPendingSegments map. Now, if the task is restarted at a point after in memory data is persisted and metadata is committed but before any mergeAndPush or publish happens, then on restart if the task again tells driver to publish the same sequence, there is no way for the driver to know what segments to publish.
If you look at the code for publish or publishAll method of AppenderatorDriver the first thing that is done is to remove the sequence information from activeSegments and publishPendingSegments map. After that push is called on Appenderator with wrapped committer which will contain driver metadata (with the sequence information removed). In the push implementation in memory data is persisted using persist method which also commits the metadata to disk. So, what I was saying is that if task is restarted at this point, the restored metadata might be incomplete as this sequence information will not be restored. Any further calls to publish this sequence wouldn't do anything.
One way to resolve this would be to maintain an additional in-memory structure that contains sequences which are being published. On call to publish, sequence information is removed from activeSegments map but not from publishPendingSegments and added to the new in-memory structure. On successful publish sequence information is removed from the publishPendingSegments and in-memory structure. Any call to publish should try to publish sequence if it is not in the in-memory structure (it may not be in the activeSegment map when task is restarted at the point of time above described).
The text was updated successfully, but these errors were encountered: