-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow selecting previously unselected streams in a source without requiring a reset of the connection #3520
Comments
upvote |
I didn't dig deep but. We can use the way singer does it have all the fields and use |
upvote as well ! |
This functionality is going to be incredibly important to us as we have a single DB with 300 tables we are ingesting. The thought of having to re-import around 1B rows of data every time we have a schema change gives me shivers up my spine. The ability for a software to be able to handle schema changes is why competitors like FiveTran and Stitch exist. Airbyte needs the ability to handle (automagically) incoming schema changes that occur from the source tables if they are going to want to compete. I am rooting for you and in your corner. I hope that this is something we can put on the roadmap soon. |
related to #4295? |
would be nice to even be able to select which tables you want to refresh |
@kyle-cheung agreed - we'd love to ship this. No firm ETA yet but will probably ship late this quarter (september) or next quarter (oct/nov) |
Does anyone have pointers on how to hackily accomplish this today while we wait for feature? It seems like we should be able to get the source schema for the new stream and insert it into the Similarly, a guide on how to jumpstart incremental syncs would be excellent as it would give confidence in changing configs being safe. For example, I have a backup snapshot of an old sync that took days to run in Snowflake. Being able to tell Airbyte to only sync details after that backup would be immensely helpful. |
Sorry for the double comment but this has been happening a lot as of the past 2 weeks. We're adding more tables in production and also piping more tables into Snowflake. I'm running into purely human error when reselecting the tables after updating the source streams (missing some previously selected tables). A second problem is that I'm using Incremental + Deduped so i actually lose the SCDs when resetting which is totally unnecessary since I only need to bring 1 additional table but am forced to resync the entire db |
Upvote this. I also think this is a much needed feature. |
Upvoted. This feature is critical for our use cases. |
Upvote. We have a large db and resyncing every time a new field is added to a small table is hell. |
Upvote; very interesting feature IMO as well |
Upvote, this is a major pain for us too. |
@acolyer and all - good news! This is happening this quarter. Actively being worked on at the moment. |
@sherifnada this is awesome! Will this feature allow us to reset only a single table? Say if the schema changed for only 1 table. Will there also be anything along the lines of auto schema change detection |
@kyle-cheung That's correct! If the schema changed only for 1 table, then only that 1 table would be reset. I've connected this issue to the main Epic where we're tracking all of our work towards this feature, so when that work is done, we'll follow-up here. Lastly, we are planning on supporting auto-schema change detection! It'll be the next project we tackle after this one. I'd love to get your thoughts on what you'd expect with auto-detecting schema changes! |
Hi @andyjih. Will it be available for all connectors? Or only for postgres as it was the most reported issues mentioned here? |
@andyjih that's amazing, super looking forward to this change. For schema resets, I wonder if there's a way for Airbyte to create a temporary table and then swap the new table with the old, so during the reset period, there's no lost data or empty table. This generally is an issue whenver I reset my connection now, there's a few hour lull where the database is empty, my work around is to create a separate dev schema and then swap the schemas once the rseet completes |
Upvote! Major pain point for us also! Each table has 100Gigs and we have hundreds of them |
duplicate of #6912. will be released shortly. |
Tell us about the problem you're trying to solve
When a user selects some streams from a schema, we only retain the streams which they selected in the catalog we store on the backend. For example, if my sql db has tables 1-10 but I initially only replicate table 1 (say to test out airbyte or because I now need to replicate more data) then on the backend we store a catalog which contains only 1 stream corresponding with the selected table.
If I later want to replicate more streams from my source, I have to refresh the schema.
This is problematic because refreshing the schema requires resetting the state for that connection, which means we end up re-replicating lots of data. This can be very costly if we're pulling data from a rate limited API or a DB with lots of data.
Describe the solution you’d like
There's a few options we can go with:
Describe the alternative you’ve considered or used
do nothing about this problem; re-sync data every time you want to update the schema selection.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: