-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define git merge compatible storage format #74
Comments
@martin.lysk1 Is there a difference between a conflict in vanilla git, and a conflict in lix? If so, do the requirements listed above apply to git conflicts or lix conflicts? |
lix conflicts and git conflicts are curently the same. |
I think this relates to @jan.johannes feedback in this thread. If lix can remove any unwanted conflicts and find all missing conflicts automatically, then you could argue that the new format is not required in order to meet the requirements listed above. |
Seeking alignment. Different paths exists to "solve" merge conflicts:
I think I favor option 1 (ignore lix for now, design for git) while trying to delay "app scoped" merge resolution in inlang until users run into the problem. The other options seems too early. Having a generalizable lix merge api seems out of scope as long as we have to be git compatible (aka as long as lixnet doesn't exist and is widely adopted). @jan.johannes @martin.lysk1 @jldec do we have alignment for option 1 (basically what martin is exploring here)? |
@samuel.stroschein I agree with the overall strategy (1.) of making the inlang file format work nicely with vanilla git. My concern is that it feels like we are jumping directly into a new non-JSON format when a simpler JSON-based solution might cover many of the requirements above, like avoiding merge conflicts in most cases, and triggering merge conflicts in those cases we want. You mentioned:
Do I interpret ☝️ correctly, that (for now) we should try to rely on developers to use available git merge conflict resolution tools, instead of extending the SDK to parse files with merge conflicts, and sending conflicts to the Fink UI? If the simple solution doesn't work, I'll be fine exploring new formats as well, but I'd prefer if we validated that first. To help with that, I've adjusted the v2-persistence serializer to add linefeeds between bundles and messages in the JSON, so we can try out how that works with git merge conflict detection and resolution. The change is in the PR, screenshot added below (using GitHub) |
@jldec uh i like the idea of combining @martin.lysk1 idea with plain json but it seems like @martin.lysk1 approach of a file format that stores inline json is the way to go: how do you deal with merge markers from git that automatically break json? Any JSOn containing merge markers in invalid json
|
Not quite. I am saying we should wait until users run into merge conflicts to see how we should design merge resolution [and not think about it now (except for minimizing merge conflicts in the first place)] |
I understand introducing another parser for a format and not storing json directly sounds a bit far off. If you want to tinker around and give it a further run i found it usefull to run gits merge directly on files in your cli instead of fighting json stings with replace operations. Create a files like: base.txt
left.txt (translator adds an 'de' message)
right.txt (translator adds an 'es' message)
and just run The three JSON's also outline one of the key properties we will need to solve here:
This is a false positive that we should not have to deal with and we should no get merge conflicts with. Please also comment the properties i outlineds in the top - i expect this list to be NOT complete yet... |
For now, the sdk won't load broken JSON - If a merge was initiated locally, and the inlang messages file contains merge markers, then tools like paraglide would should show an error - and someone needs to resolve using non-inlang tools before the final merge can complete and the changes appear on the branch. I'm not sure how fink users could get into this situation - if at all? My gut says that if we do a good job avoiding merge conflicts, then this situation doesn't feel too unreasonable and other more-commonly used features like comments would add more value |
Its multiple properties (see description) but regarding conflicts the one that freaks me out the most is number 2.
Its not fink that i am worried about - its parglide, and sherlock and devs that have to solve conflicts for translators - that should actually be no merge conflicts.
|
Addition:
A reason why I like the outcome of the research so far more and more - it doesn‘t only load in conflicting state - it even gives the conflict marker a meaning within the format. |
This is a good argument. For things that are no merge conflict, no merge conflict should arise. If we need a custom file format for it, we probably should do it as two fink users shouldn't face a merge conflict that needs to be resolved by a dev for something that shouldn't be a merge conflict anyways. Handing off "real" merge conflicts for developers to solve until we know how we are going to tackle merge conflicts i.e. in inlang apps or lix, is also reasonable. |
@samuel.stroschein why do you think this is the way to go? Little things that we take for granted like validation, syntax highlighting, jq, vscode language support, fast parsers in browsers etc. now will stop working or require that we run files through a preprocessor first, to extract the fragments. |
If we don't find a way to make the merge conflict scenarios like 2 (Parallel creation of records should not lead to Conflicts) work with plain json, a custom format that stores json is the way to go. the value of avoiding merge conflicts seems higher than syntax highlighting, vscode language support, etc. people shouldn't edit internal files anyways.
the sdk should load files with merge markers if no merge conflict should have been raised (scenario 2 again, MESDK-114). a mix between "plain json" and fallback to a custom parser with merge markers could be the solution? @jldec @martin.lysk1 can you research if a a json format is possible that avoids merge conflicts like scenario 2? |
actually……. a .json ending that external tooling is operating on is out question anyways to minimize merge conflicts: external formatters/tooling will mess up the formatting of the json, leading to merge conflicts. we had this multiple times already. (it can be a json under the hood but must not end with |
The latest commit in the v2-persistence PR stores JSON with spaces between messages and adds "slots" for untranslated langagues e.g. [
{"id":"message_key_1","alias":{},"messages":[
{"locale":"da","slot":true},
{"locale":"de","slot":true},
{"locale":"en","declarations":[],"selectors":[],"variants":[{"match":[],"pattern":[{"type":"text","value":"Generated message (1)"}]}]},
{"locale":"es","slot":true},
{"locale":"fr","slot":true}]},
{"id":"message_key_2","alias":{},"messages":[
{"locale":"da","slot":true},
{"locale":"de","slot":true},
{"locale":"en","declarations":[],"selectors":[],"variants":[{"match":[],"pattern":[{"type":"text","value":"Generated message (2)"}]}]},
{"locale":"es","slot":true},
{"locale":"fr","slot":true}]}] This format appears to prevent merge conflicts when multiple language translations are merged at the same time. Test files in the zip if you want to try. |
…sorry for the long reponse With the original proposal in the grapic above, I took inspiration form packfile format - that can be read comparable fast and allows to seek inside of the stream. But I consider this as a premature optimization (and therefore may even into the wrong direction). With your input @jldec we can also model a pure json version of the slot file without the binary parsing that I described earlier.
One important aspect of the slot approach - I didn't outlined in the video yet - is the separation of schema stored inside of the slot and the storage/slot managment themself. Instead of using language tags to pre populate slots i would pre populate slots for objets of any kind - we would have to define on what level we want the file be mergable without conflicts in other words - at what level bundle/message/variants we want parallel inserts to work. Why does this matter? Lets take the current iteration on the json here - I know this is just for tinkering around @jldec - but it outlines the problem pretty well: The structure currently models multiple message in one file and uses slots for languages inside of messages:
This has two problems:
@jan.johannes pointed to the fact that - in conflict state we can load the orignal left and right versions that don't contain conflict markers via lix and do the conflict resolution with two valid json files on application level. We will have to inform lix about our conflict resolution anyway to get the git index out of conflict state anyway - spinning another loop to get the left and right file states and marking those atoms as conflicting state in the application state (SDK) sounds way better than parsing conflict markers. The central question for this topic is:When is it ok to get into conflicting state in git because of inlang files. So lets see when this could happen: Only when the messages are changed:
To specify when conflicts are expected and when we have to avoid them we should think about Atomicity Atomicity - MessageBundle vs. Message vs. VariantProperties of an Atom
There are two ways to reach atomacity with the usual git client - separate topic just to mention it:
Now that we defined properties and conflict behavior we can look at the our MessageBundle, Message, and variant and see what should be an Atom. Atomicity on Message BundleProperty 1
Property 2
Property 3
Atomacy on MessageProperty 1
Property 2
Property 3
Atomacy on VariantsWould say no - do anybody see a reason not to conflict when variants are edited in parallel? With atomicty on Message Level we can do the math on a project like cal.com to check out our storage options: ~2000 k messages times 32 Languages File based atomicity:If we store one file per atom we would store ~64k files
Slot based atomicity:With a slot files we could distribute the messages over ~32 files (with 1000k slots each) |
@jan.johannes good. then we are all aligned that we "ignore lix for now and design the inlang storage format for git" for now. |
Agree. 👍 |
example of sqlite-as-a-file usage in the wild |
@jan.johannes in this case they treat the sqlite file as a build artifact, but I don't think it would be a huge leap for them to treat it as source as well, if we gave them an easy, git-aligned way to track changes to content inside the file. |
I watched the latest loom https://discord.com/channels/897438559458430986/1084748995349446667/1247896920966303814. @martin.lysk1 did you do research on how to build an index? Looping over all possible files and message, potentially even variants, to perform CRUD seems to be O(n^2) or even O(n^3). Seems like we need indices. So yes, we are building a git optimized database :/ But it seems to be the easiest way forward. So, I am OK with it until lix matures. |
cancelled in favor of sqlite https://opral.substack.com/p/accelerate-by-years-iv-the-prototype |
MESDK-114
See loom for more infos: https://www.loom.com/share/2927ef8417a4456589f8e2a5fc0a186a?sid=100077c8-01aa-4c83-9417-44ba577046d8
Term definition:
A record is an atomic entity in a database. Operations on such an entity can happen sequencialy but not in parallel. Updates in parallel should raise a conflict to the application level.
What represents an entity and should be atomic needs to be carefully defined during the design of the data model.
In terms of inlang an atom should most likely be a message:
A change in one part of a message, like editing/removing or adding a variant can usually not happen independently. Changes on separate languages - namely spoken, in separate messages - often do not relate.
An ideal storage format should therefore full fill the following properties:
1. A Merge of edits in different records should not lead to conflicts
Example:
User Alice changes recordA
User Bob changes recordB
Should be merged on database level without conflicts.
In inlang:
If one translator changes a variant in spanish and another changes a variant in german in the same message bundle auto merge should be possible - no conflict expected
2. Parallel creation of records should not lead to Conflicts
Example:
User Alice creates recordA
User Bob creates recordB
In inlang:
Two translators a german and an english one should be able to add a message without a conflicts.
3. A deletion and a creation happening in parallel should not lead to conflicts
In inlang:
if a translator deletes a message (german one) while another inserts a spanish one
4. Parallel edits should lead to a conflict - with conflict marker on the specific record!
4.1. It should raise a conflict when editing the same property
Examples
User Alice changes the name property on recordA and recordB
User Bob changes recordA and recordB on the same property
4.2. It should raise a conflict when editing different properties
User Alice changes name on recordA
User Bob changes age (a different property) on recordA
5. A conflicting file, merged by the default git tools, should be readable by our tools
If a merge leads to a conflict application level needs to be able to handle it. A conflict in json will lead to an unreadable file:
- if message file conflicts because of simulatanous edits fink should still function
Proposal
A first draft of a format definition:
The text was updated successfully, but these errors were encountered: