Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

sherifnada · 2021-05-21T00:23:13Z

Tell us about the problem you're trying to solve

When a user selects some streams from a schema, we only retain the streams which they selected in the catalog we store on the backend. For example, if my sql db has tables 1-10 but I initially only replicate table 1 (say to test out airbyte or because I now need to replicate more data) then on the backend we store a catalog which contains only 1 stream corresponding with the selected table.

If I later want to replicate more streams from my source, I have to refresh the schema.

This is problematic because refreshing the schema requires resetting the state for that connection, which means we end up re-replicating lots of data. This can be very costly if we're pulling data from a rate limited API or a DB with lots of data.

Describe the solution you’d like

There's a few options we can go with:

20/80 approach: Add a metadata field to each configured stream which denotes whether it was selected or not. This way we can store the full catalog at all times and never need to reset the connection. This seems like the simplest approach with the least amount of effort required. It would however require changes to the CDK and our connectors to make sure they are respecting this flag (today the presence of a stream in the catalog passed to a connector implies that it is selected for syncing). Although we can do this in a backwards compatible way inside the worker until all connectors are migrated to this approach.
Whole hog: refreshing schemas should not require resetting the connection state and re-syncing data -- this should only happen if there was a schema migration like a rename or something.

Describe the alternative you’ve considered or used

do nothing about this problem; re-sync data every time you want to update the schema selection.

┆Issue is synchronized with this Asana task by Unito

fixico-abdel · 2021-06-04T12:57:27Z

upvote

harshithmullapudi · 2021-06-11T11:56:22Z

I didn't dig deep but. We can use the way singer does it have all the fields and use selected field to understand which one user selected. I think this can also avoid having both source_catalog and singer_rendered_catalog.json. What say @davinchia @sherifnada ?

JCWahoo · 2021-06-14T15:22:58Z

upvote as well !

tylerdelange · 2021-06-22T19:44:24Z

This functionality is going to be incredibly important to us as we have a single DB with 300 tables we are ingesting. The thought of having to re-import around 1B rows of data every time we have a schema change gives me shivers up my spine.

The ability for a software to be able to handle schema changes is why competitors like FiveTran and Stitch exist. Airbyte needs the ability to handle (automagically) incoming schema changes that occur from the source tables if they are going to want to compete. I am rooting for you and in your corner. I hope that this is something we can put on the roadmap soon.

ChristopheDuong · 2021-07-21T19:06:58Z

related to #4295?

kyle-cheung · 2021-08-17T19:08:41Z

would be nice to even be able to select which tables you want to refresh

sherifnada · 2021-08-17T21:02:12Z

@kyle-cheung agreed - we'd love to ship this. No firm ETA yet but will probably ship late this quarter (september) or next quarter (oct/nov)

bkrausz · 2021-08-19T18:08:22Z

Does anyone have pointers on how to hackily accomplish this today while we wait for feature? It seems like we should be able to get the source schema for the new stream and insert it into the STANDARD_SYNC config, but I want to be cautious since an accidental reset would be unpleasant for us.

Similarly, a guide on how to jumpstart incremental syncs would be excellent as it would give confidence in changing configs being safe. For example, I have a backup snapshot of an old sync that took days to run in Snowflake. Being able to tell Airbyte to only sync details after that backup would be immensely helpful.

kyle-cheung · 2021-09-21T19:52:41Z

Sorry for the double comment but this has been happening a lot as of the past 2 weeks. We're adding more tables in production and also piping more tables into Snowflake. I'm running into purely human error when reselecting the tables after updating the source streams (missing some previously selected tables). A second problem is that I'm using Incremental + Deduped so i actually lose the SCDs when resetting which is totally unnecessary since I only need to bring 1 additional table but am forced to resync the entire db

hoanghapham · 2021-10-15T09:51:21Z

Upvote this. I also think this is a much needed feature.

aforlorncat1 · 2021-11-15T08:19:23Z

Upvoted. This feature is critical for our use cases.

grkhr · 2021-12-07T08:59:17Z

Upvote. We have a large db and resyncing every time a new field is added to a small table is hell.

thomasclavet · 2021-12-15T09:59:52Z

Upvote; very interesting feature IMO as well

acolyer · 2022-04-26T08:23:56Z

Upvote, this is a major pain for us too.

sherifnada · 2022-04-27T01:24:48Z

@acolyer and all - good news! This is happening this quarter. Actively being worked on at the moment.

kyle-cheung · 2022-04-27T16:40:55Z

@sherifnada this is awesome! Will this feature allow us to reset only a single table? Say if the schema changed for only 1 table. Will there also be anything along the lines of auto schema change detection

andyjih · 2022-05-06T22:03:08Z

@kyle-cheung That's correct! If the schema changed only for 1 table, then only that 1 table would be reset.

I've connected this issue to the main Epic where we're tracking all of our work towards this feature, so when that work is done, we'll follow-up here.

Lastly, we are planning on supporting auto-schema change detection! It'll be the next project we tackle after this one. I'd love to get your thoughts on what you'd expect with auto-detecting schema changes!

Soufraz · 2022-06-14T15:38:33Z

Hi @andyjih. Will it be available for all connectors? Or only for postgres as it was the most reported issues mentioned here?

kyle-cheung · 2022-06-14T16:32:10Z

@andyjih that's amazing, super looking forward to this change. For schema resets, I wonder if there's a way for Airbyte to create a temporary table and then swap the new table with the old, so during the reset period, there's no lost data or empty table. This generally is an issue whenver I reset my connection now, there's a few hour lull where the database is empty, my work around is to create a separate dev schema and then swap the schemas once the rseet completes

validumitru · 2022-06-15T12:52:03Z

Upvote! Major pain point for us also! Each table has 100Gigs and we have hundreds of them

cgardens · 2022-06-29T23:45:29Z

duplicate of #6912. will be released shortly.

sherifnada added type/enhancement New feature or request area/connectors Connector related issues area/platform issues related to the platform labels May 21, 2021

ChristopheDuong mentioned this issue Jun 2, 2021

[Epic] Next Steps for DBT / normalization #2566

Closed

12 tasks

cgardens assigned ChristopheDuong Jul 21, 2021

This was referenced Nov 10, 2021

Handle change in schema at source #7809

Closed

Ability to not reset data when schema changes #6089

Closed

sherifnada unassigned ChristopheDuong Jan 7, 2022

sherifnada removed the area/connectors Connector related issues label Jan 7, 2022

jedwards-gc mentioned this issue Jan 28, 2022

🐛 Destination Postgres: Reset Doesn't Handle Source Schema Changes, e.g. doesn't really reset #9880

Closed

bleonard added autoteam team/compose team/platform-move labels Apr 26, 2022

andyjih mentioned this issue May 6, 2022

[EPIC] Per-Stream State #6912

Closed

marcosmarxm mentioned this issue May 17, 2022

[QUESTION] - Why is Airbyte forcing a data reset when adding a new stream? #12918

Closed

cgardens closed this as completed Jun 29, 2022

rx007 mentioned this issue Nov 30, 2023

[Snyk] Fix for 1 vulnerabilities rx007/airbyte#2199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

sherifnada commented May 21, 2021 •

edited by sync-by-unito bot

Loading

fixico-abdel commented Jun 4, 2021

harshithmullapudi commented Jun 11, 2021 •

edited

Loading

JCWahoo commented Jun 14, 2021

tylerdelange commented Jun 22, 2021 •

edited

Loading

ChristopheDuong commented Jul 21, 2021

kyle-cheung commented Aug 17, 2021

sherifnada commented Aug 17, 2021

bkrausz commented Aug 19, 2021

kyle-cheung commented Sep 21, 2021 •

edited

Loading

hoanghapham commented Oct 15, 2021 •

edited

Loading

aforlorncat1 commented Nov 15, 2021

grkhr commented Dec 7, 2021

thomasclavet commented Dec 15, 2021

acolyer commented Apr 26, 2022

sherifnada commented Apr 27, 2022

kyle-cheung commented Apr 27, 2022

andyjih commented May 6, 2022

Soufraz commented Jun 14, 2022

kyle-cheung commented Jun 14, 2022

validumitru commented Jun 15, 2022

cgardens commented Jun 29, 2022

Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

Comments

sherifnada commented May 21, 2021 • edited by sync-by-unito bot Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

Describe the alternative you’ve considered or used

fixico-abdel commented Jun 4, 2021

harshithmullapudi commented Jun 11, 2021 • edited Loading

JCWahoo commented Jun 14, 2021

tylerdelange commented Jun 22, 2021 • edited Loading

ChristopheDuong commented Jul 21, 2021

kyle-cheung commented Aug 17, 2021

sherifnada commented Aug 17, 2021

bkrausz commented Aug 19, 2021

kyle-cheung commented Sep 21, 2021 • edited Loading

hoanghapham commented Oct 15, 2021 • edited Loading

aforlorncat1 commented Nov 15, 2021

grkhr commented Dec 7, 2021

thomasclavet commented Dec 15, 2021

acolyer commented Apr 26, 2022

sherifnada commented Apr 27, 2022

kyle-cheung commented Apr 27, 2022

andyjih commented May 6, 2022

Soufraz commented Jun 14, 2022

kyle-cheung commented Jun 14, 2022

validumitru commented Jun 15, 2022

cgardens commented Jun 29, 2022

sherifnada commented May 21, 2021 •

edited by sync-by-unito bot

Loading

harshithmullapudi commented Jun 11, 2021 •

edited

Loading

tylerdelange commented Jun 22, 2021 •

edited

Loading

kyle-cheung commented Sep 21, 2021 •

edited

Loading

hoanghapham commented Oct 15, 2021 •

edited

Loading