-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source shopify: v0.1.26 stream rename causing child stream to run before parent #10595
Comments
Let me know if logs from one of the refunds-first syncs will be helpful at all, this seems to cause it to sift through a far larger number of orders for refunds than I'd expect it to find assuming that its starting parent state is just the state of the orders stream as of the previous sync (3k+ when the previous sync only pulled/updated 100 orders). |
@dsnam |
@bazarnov Still seems to sync in the same order, although if I should expect things to not always be in order in the logs then I will need to test with one of our real syncs to see if the performance is better with this version. Here's the source part of the logs from a test dataset sync:
|
There is no issue with reading order, as far as |
If it's not related to the stream order do you have any idea what could have caused the sudden performance degradation across all of our shopify syncs? I have 6 examples where sync times went from minutes to hours for roughly the same number of records that are synced daily, with one taking over 18 hours. None of these were initial syncs. The only commonality I was able to find between them was that we had manually edited the connections in the web UI to set some flags in our dbt transforms and then the stream order change was the only notable difference in the logs. Using the API to change the sync catalog order back returned our syncs to the expected runtimes. Here's another example of the observed change, with the sync on top being after the connection was edited in the web ui. |
Didn't find any issues with the sync so far, however restored the natural order of the sync for all |
What volume of orders and refunds does your test shop have? I similarly see basically no effect with our test shop but our test shop only has a few hundred orders and a handful of refunds. The issues start to appear when the refunds stream starts making calls for thousands of orders despite the orders stream only syncing 100-200. Also if I am reading that PR right you're fixing this by reverting to the old stream names? I think that will break existing connections created since 1.26, the reverse situation was discussed on another issue earlier: #9994 We went ahead and updated stream names in the sync catalog with the api for our existing connections since it seemed like that change was staying around, this will require doing it again and I don't see any mention of it in the changelog. |
There is Suggestion: sync the
That is the thing, there is no reason of changing that at this point, I'm aware of the reverting issues afterwards, so will revert these PR changes back as they are now, if you will. |
The refunds stream uses the state of the orders stream to know what orders may have new refunds data to fetch, right? If that's the case can you help me understand what is happening in those logs then? In |
May I ask you, have you updated something else related to airbyte so far? Such behavior is not expected at all, this looks like the connection had lost it's stream states somehow, and therefore the orders_refunds stream started to run the full-refresh. Another thing to check is the |
The start date hasn't changed, I'll try to trace what all we did with these connections to give more context: The only change to airbyte itself that we made was downgrading to v0.35.11 from v0.35.15 after we ran into some sync issues that we couldn't quite figure out, so we returned to a version we had no issues with previously as a precaution. As far as the connector goes I attempted to bump the version to 1.30 on feb 1st or so but ran into the config issues I reported in #9994. I downgraded to 1.26, discovered the stream name changes, and then downgraded again to 1.22 while I sorted out renaming the streams in the connection configs over the API. Once that was finished I upgraded to 1.26 and left it there since the fix for the creds config wasn't in yet. On feb 17 we needed to set some flags in our custom transforms to fully rebuild some models, so we did that via the airbyte UI and that is when the sync performance tanked on all of our connections and I observed the refunds stream being read before orders in the logs (note the 2nd sync on 2/17 in the screenshot below that takes 18h, it is the one corresponding to the |
I see, based on that, after updating the connector and airbyte itself, somehow the state was lost, therefore the I'm a bit lost here, if you've considered to maintain the fix by yourself and it's suitable for you to have it as it is, should we proceed further with this issue? Because the normal resolution is to rename the |
Yeah it is strange, the behavior we see when refunds runs first seems to suggest that it incurs a full read of its parent stream every time. I see that this connector uses an def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, Any]]]:
"""
Reading the parent stream for slices with structure:
EXAMPLE: for given nested_record as `id` of Orders,
Output: [ {slice_key: 123}, {slice_key: 456}, ..., {slice_key: 999} ]
"""
parent_stream = self.parent_stream_class(self.config)
parent_stream_state = stream_state_cache.cached_state.get(parent_stream.name)
for record in parent_stream.read_records(stream_state=parent_stream_state, **kwargs):
# to limit the number of API Calls and reduce the time of data fetch,
# we can pull the ready data for child_substream, if nested data is present,
# and corresponds to the data of child_substream we need.
if self.nested_substream:
if record.get(self.nested_substream):
yield {self.slice_key: record[self.nested_record]}
else:
yield {self.slice_key: record[self.nested_record]} If you are not seeing this happen at all though then I am probably just misunderstanding how the state is passed around. To try to narrow this down I made a fresh connection for one of the shops that I know has a decent volume of refunds and I left the current default stream order intact (refunds first). Once it finishes the initial sync I'll wait a day or so to build up some new data to pull and test to see how it performs, hopefully that should isolate us from any weird stuff that may have happened with the state for the other connections with the upgrades/downgrades. |
This PR should resolve the standalone sync of the SubStream without parent stream, another words you can now select the single stream and it will be properly synced without parent stream to be synced. |
Environment
Current Behavior
It seems like streams are read in the order they appear in the sync catalog. The web ui sorts them and
order_refunds
now appears beforeorders
after the stream rename, so if you create a new connection or save an edit to one through the web UI it causes that child stream to run before its parent.Here is an example of the performance change we saw: the sync at the bottom was running refunds before orders, and the sync at the top is running orders before refunds after I fixed the order with a
connections/update
call.Slack thread:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1645646538929519
Expected Behavior
Connector should probably have some way to specify stream order so that the catalog order doesn't matter.
Steps to Reproduce
The text was updated successfully, but these errors were encountered: