Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520

Closed
sherifnada opened this issue May 21, 2021 · 21 comments
Labels

Comments

@sherifnada
Copy link
Contributor

sherifnada commented May 21, 2021

Tell us about the problem you're trying to solve

When a user selects some streams from a schema, we only retain the streams which they selected in the catalog we store on the backend. For example, if my sql db has tables 1-10 but I initially only replicate table 1 (say to test out airbyte or because I now need to replicate more data) then on the backend we store a catalog which contains only 1 stream corresponding with the selected table.

If I later want to replicate more streams from my source, I have to refresh the schema.

This is problematic because refreshing the schema requires resetting the state for that connection, which means we end up re-replicating lots of data. This can be very costly if we're pulling data from a rate limited API or a DB with lots of data.

Describe the solution you’d like

There's a few options we can go with:

  1. 20/80 approach: Add a metadata field to each configured stream which denotes whether it was selected or not. This way we can store the full catalog at all times and never need to reset the connection. This seems like the simplest approach with the least amount of effort required. It would however require changes to the CDK and our connectors to make sure they are respecting this flag (today the presence of a stream in the catalog passed to a connector implies that it is selected for syncing). Although we can do this in a backwards compatible way inside the worker until all connectors are migrated to this approach.
  2. Whole hog: refreshing schemas should not require resetting the connection state and re-syncing data -- this should only happen if there was a schema migration like a rename or something.

Describe the alternative you’ve considered or used

do nothing about this problem; re-sync data every time you want to update the schema selection.

┆Issue is synchronized with this Asana task by Unito

@sherifnada sherifnada added type/enhancement New feature or request area/connectors Connector related issues area/platform issues related to the platform labels May 21, 2021
@fixico-abdel
Copy link

upvote

@harshithmullapudi
Copy link
Contributor

harshithmullapudi commented Jun 11, 2021

I didn't dig deep but. We can use the way singer does it have all the fields and use selected field to understand which one user selected. I think this can also avoid having both source_catalog and singer_rendered_catalog.json. What say @davinchia @sherifnada ?

@JCWahoo
Copy link

JCWahoo commented Jun 14, 2021

upvote as well !

@tylerdelange
Copy link

tylerdelange commented Jun 22, 2021

This functionality is going to be incredibly important to us as we have a single DB with 300 tables we are ingesting. The thought of having to re-import around 1B rows of data every time we have a schema change gives me shivers up my spine.

The ability for a software to be able to handle schema changes is why competitors like FiveTran and Stitch exist. Airbyte needs the ability to handle (automagically) incoming schema changes that occur from the source tables if they are going to want to compete. I am rooting for you and in your corner. I hope that this is something we can put on the roadmap soon.

@ChristopheDuong
Copy link
Contributor

related to #4295?

@kyle-cheung
Copy link

would be nice to even be able to select which tables you want to refresh

@sherifnada
Copy link
Contributor Author

@kyle-cheung agreed - we'd love to ship this. No firm ETA yet but will probably ship late this quarter (september) or next quarter (oct/nov)

@bkrausz
Copy link
Contributor

bkrausz commented Aug 19, 2021

Does anyone have pointers on how to hackily accomplish this today while we wait for feature? It seems like we should be able to get the source schema for the new stream and insert it into the STANDARD_SYNC config, but I want to be cautious since an accidental reset would be unpleasant for us.

Similarly, a guide on how to jumpstart incremental syncs would be excellent as it would give confidence in changing configs being safe. For example, I have a backup snapshot of an old sync that took days to run in Snowflake. Being able to tell Airbyte to only sync details after that backup would be immensely helpful.

@kyle-cheung
Copy link

kyle-cheung commented Sep 21, 2021

Sorry for the double comment but this has been happening a lot as of the past 2 weeks. We're adding more tables in production and also piping more tables into Snowflake. I'm running into purely human error when reselecting the tables after updating the source streams (missing some previously selected tables). A second problem is that I'm using Incremental + Deduped so i actually lose the SCDs when resetting which is totally unnecessary since I only need to bring 1 additional table but am forced to resync the entire db

@hoanghapham
Copy link

hoanghapham commented Oct 15, 2021

Upvote this. I also think this is a much needed feature.

@aforlorncat1
Copy link

Upvoted. This feature is critical for our use cases.

@grkhr
Copy link
Contributor

grkhr commented Dec 7, 2021

Upvote. We have a large db and resyncing every time a new field is added to a small table is hell.

@thomasclavet
Copy link

Upvote; very interesting feature IMO as well

@acolyer
Copy link

acolyer commented Apr 26, 2022

Upvote, this is a major pain for us too.

@sherifnada
Copy link
Contributor Author

@acolyer and all - good news! This is happening this quarter. Actively being worked on at the moment.

@kyle-cheung
Copy link

@sherifnada this is awesome! Will this feature allow us to reset only a single table? Say if the schema changed for only 1 table. Will there also be anything along the lines of auto schema change detection

@andyjih
Copy link
Contributor

andyjih commented May 6, 2022

@kyle-cheung That's correct! If the schema changed only for 1 table, then only that 1 table would be reset.

I've connected this issue to the main Epic where we're tracking all of our work towards this feature, so when that work is done, we'll follow-up here.

Lastly, we are planning on supporting auto-schema change detection! It'll be the next project we tackle after this one. I'd love to get your thoughts on what you'd expect with auto-detecting schema changes!

@Soufraz
Copy link

Soufraz commented Jun 14, 2022

Hi @andyjih. Will it be available for all connectors? Or only for postgres as it was the most reported issues mentioned here?

@kyle-cheung
Copy link

@andyjih that's amazing, super looking forward to this change. For schema resets, I wonder if there's a way for Airbyte to create a temporary table and then swap the new table with the old, so during the reset period, there's no lost data or empty table. This generally is an issue whenver I reset my connection now, there's a few hour lull where the database is empty, my work around is to create a separate dev schema and then swap the schemas once the rseet completes

@validumitru
Copy link

Upvote! Major pain point for us also! Each table has 100Gigs and we have hundreds of them

@cgardens
Copy link
Contributor

duplicate of #6912. will be released shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests