Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source shopify: v0.1.26 stream rename causing child stream to run before parent #10595

Closed
Tracked by #10281
dsnam opened this issue Feb 23, 2022 · 15 comments · Fixed by #10915
Closed
Tracked by #10281

source shopify: v0.1.26 stream rename causing child stream to run before parent #10595

dsnam opened this issue Feb 23, 2022 · 15 comments · Fixed by #10915

Comments

@dsnam
Copy link

dsnam commented Feb 23, 2022

Environment

  • Airbyte version: .35.11 & .35.12
  • OS Version / Instance: ubuntu on ec2
  • Deployment: docker
  • Source Connector and version: shopify 0.1.26
  • Severity: High
  • Step where error happened: setup new connection through UI, edit connection through UI. create/update over API not affected

Current Behavior

It seems like streams are read in the order they appear in the sync catalog. The web ui sorts them and order_refunds now appears before orders after the stream rename, so if you create a new connection or save an edit to one through the web UI it causes that child stream to run before its parent.

Here is an example of the performance change we saw: the sync at the bottom was running refunds before orders, and the sync at the top is running orders before refunds after I fixed the order with a connections/update call.
image (9)

Slack thread:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1645646538929519

Expected Behavior

Connector should probably have some way to specify stream order so that the catalog order doesn't matter.

Steps to Reproduce

  1. create a new shopify connection, or edit an existing one that was created with orders ahead of refunds in the catalog using the web ui
  2. run a sync, you should see the refunds stream start before its parent
@dsnam
Copy link
Author

dsnam commented Feb 23, 2022

Let me know if logs from one of the refunds-first syncs will be helpful at all, this seems to cause it to sift through a far larger number of orders for refunds than I'd expect it to find assuming that its starting parent state is just the state of the orders stream as of the previous sync (3k+ when the previous sync only pulled/updated 100 orders).

@misteryeo misteryeo moved this to Prioritized for scoping in GL Roadmap Feb 23, 2022
@oustynova oustynova removed the status in GL Roadmap Feb 28, 2022
@oustynova oustynova moved this to Scoping in progress in GL Roadmap Feb 28, 2022
@oustynova oustynova moved this from Scoping in progress to Prioritized for scoping in GL Roadmap Feb 28, 2022
@bazarnov
Copy link
Collaborator

bazarnov commented Mar 1, 2022

@dsnam
Could you please update the version of the connector to the latest 0.1.33 and try again ?

@bazarnov bazarnov self-assigned this Mar 1, 2022
@dsnam
Copy link
Author

dsnam commented Mar 1, 2022

@bazarnov Still seems to sync in the same order, although if I should expect things to not always be in order in the logs then I will need to test with one of our real syncs to see if the performance is better with this version. Here's the source part of the logs from a test dataset sync:

2022-03-01 14:52:56 destination > 2022-03-01 14:52:56 WARN c.n.s.JsonMetaSchema(newValidator):338 - {} - Unknown keyword examples - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-03-01 14:52:56 destination > 2022-03-01 14:52:56 WARN c.n.s.JsonMetaSchema(newValidator):338 - {} - Unknown keyword airbyte_secret - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
2022-03-01 14:52:56 destination > 2022-03-01 14:52:56 INFO i.a.i.d.j.c.SwitchingDestination(getConsumer):65 - {} - Using destination type: COPY_S3
2022-03-01 14:52:57 destination > 2022-03-01 14:52:57 INFO i.a.i.d.b.BufferedStreamConsumer(startTracked):125 - {} - class io.airbyte.integrations.destination.buffered_stream_consumer.BufferedStreamConsumer started.
2022-03-01 14:53:00 source > Reading order_refunds for order_id: 4637860298995
2022-03-01 14:53:01 source > Read 1 records from order_refunds stream
2022-03-01 14:53:01 source > Finished syncing order_refunds
2022-03-01 14:53:01 source > SourceShopify runtimes:
Syncing stream order_refunds 0:00:05.282927
2022-03-01 14:53:01 source > Syncing stream: orders 
2022-03-01 14:53:03 source > Read 144 records from orders stream
2022-03-01 14:53:03 source > Finished syncing orders
2022-03-01 14:53:03 source > SourceShopify runtimes:
Syncing stream order_refunds 0:00:05.282927
Syncing stream orders 0:00:01.569425
2022-03-01 14:53:03 source > Finished syncing SourceShopify

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 2, 2022

There is no issue with reading order, as far as Order_refunds reads the Orders first to get the actual list of order_ids from it, then fetches the refunds based on that. The Orders is the separate stream. I've got no issues pulling out the data.
The initial sync (very first sync after the connector setup) will take a while, but the next sync will go faster, since we're using the state of the parent stream, in this case Orders.

@bazarnov bazarnov moved this from Prioritized for scoping to Scoping Finished in GL Roadmap Mar 2, 2022
@dsnam
Copy link
Author

dsnam commented Mar 2, 2022

If it's not related to the stream order do you have any idea what could have caused the sudden performance degradation across all of our shopify syncs? I have 6 examples where sync times went from minutes to hours for roughly the same number of records that are synced daily, with one taking over 18 hours. None of these were initial syncs. The only commonality I was able to find between them was that we had manually edited the connections in the web UI to set some flags in our dbt transforms and then the stream order change was the only notable difference in the logs. Using the API to change the sync catalog order back returned our syncs to the expected runtimes. Here's another example of the observed change, with the sync on top being after the connection was edited in the web ui.

image

normal_sync_logs.txt

slow_sync_logs.txt

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 3, 2022

Didn't find any issues with the sync so far, however restored the natural order of the sync for all Orders dependent streams, like orders_refund, orders_risk within this PR.
Please await for it to be merged and try again using the whole new connection (just to have clear test).

@bazarnov bazarnov moved this from Scoping Finished to Implementation in progress in GL Roadmap Mar 3, 2022
@dsnam
Copy link
Author

dsnam commented Mar 3, 2022

What volume of orders and refunds does your test shop have? I similarly see basically no effect with our test shop but our test shop only has a few hundred orders and a handful of refunds. The issues start to appear when the refunds stream starts making calls for thousands of orders despite the orders stream only syncing 100-200.

Also if I am reading that PR right you're fixing this by reverting to the old stream names? I think that will break existing connections created since 1.26, the reverse situation was discussed on another issue earlier: #9994

We went ahead and updated stream names in the sync catalog with the api for our existing connections since it seemed like that change was staying around, this will require doing it again and I don't see any mention of it in the changelog.

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 3, 2022

What volume of orders and refunds does your test shop have? I similarly see basically no effect with our test shop but our test shop only has a few hundred orders and a handful of refunds. The issues start to appear when the refunds stream starts making calls for thousands of orders despite the orders stream only syncing 100-200.

There is refund information for each order inside of Orders stream itself, have you tried to check whether that data suits your needs? If so, you can only sync-up the Orders table. It basically contains the same amount of information as Orders_refund does, the difference is in the amount of fields in the last one, compared to the Orders table. Check the schema for more information what are the fields inside both streams.
The reason Orders_refund sync goes a while is because the Shopify API accepts 1 request at a time for each order_id, which basically takes a while for a big list of orders. This is the issue you have faced with I believe.

Suggestion: sync the Orders only stream to get out the normalised output (using base-normalisation option with Postgres or any other DB) and check the refunds information there, probably you will go with that data same way at with Orders_refund.

Also if I am reading that PR right you're fixing this by reverting to the old stream names? I think that will break existing connections created since 1.26, the reverse situation was discussed on another issue earlier: #9994

We went ahead and updated stream names in the sync catalog with the api for our existing connections since it seemed like that change was staying around, this will require doing it again and I don't see any mention of it in the changelog.

That is the thing, there is no reason of changing that at this point, I'm aware of the reverting issues afterwards, so will revert these PR changes back as they are now, if you will.

@dsnam
Copy link
Author

dsnam commented Mar 3, 2022

The reason Orders_refund sync goes a while is because the Shopify API accepts 1 request at a time for each order_id, which basically takes a while for a big list of orders. This is the issue you have faced with I believe.

The refunds stream uses the state of the orders stream to know what orders may have new refunds data to fetch, right? If that's the case can you help me understand what is happening in those logs then? In normal_sync_logs the orders stream runs first and reads 1217 records. The refunds stream then runs, makes 63 calls, and reads 49 records. In slow_sync_logs (which ran the following day) the refunds stream runs first and makes 58326 calls to read 78 records, then the orders stream runs and reads 1700 records. I know making that many calls will take a while, but it seems like unexpected behavior to make that many at all. I have 6 shopify connections (to different shops) that all had this happen at the same time.

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 3, 2022

May I ask you, have you updated something else related to airbyte so far?

Such behavior is not expected at all, this looks like the connection had lost it's stream states somehow, and therefore the orders_refunds stream started to run the full-refresh.

Another thing to check is the start date of the replication, probably it was changed, which caused a lot more api calls then expected after the first sync, along with that the existed state of streams was flushed. If so, we can make another clean test, making sure that those aspects are constant.

@dsnam
Copy link
Author

dsnam commented Mar 3, 2022

The start date hasn't changed, I'll try to trace what all we did with these connections to give more context:

The only change to airbyte itself that we made was downgrading to v0.35.11 from v0.35.15 after we ran into some sync issues that we couldn't quite figure out, so we returned to a version we had no issues with previously as a precaution.

As far as the connector goes I attempted to bump the version to 1.30 on feb 1st or so but ran into the config issues I reported in #9994. I downgraded to 1.26, discovered the stream name changes, and then downgraded again to 1.22 while I sorted out renaming the streams in the connection configs over the API. Once that was finished I upgraded to 1.26 and left it there since the fix for the creds config wasn't in yet. On feb 17 we needed to set some flags in our custom transforms to fully rebuild some models, so we did that via the airbyte UI and that is when the sync performance tanked on all of our connections and I observed the refunds stream being read before orders in the logs (note the 2nd sync on 2/17 in the screenshot below that takes 18h, it is the one corresponding to the slow_sync_logs file above). The sync at the top of this screenshot is one that I ran manually after using the API to swap the order of the streams in the sync catalog back to having orders ahead of refunds. Since then it has run normally with about the same runtime as the sync at the top of the screenshot.

image

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 4, 2022

I see, based on that, after updating the connector and airbyte itself, somehow the state was lost, therefore the order_refunds stream started to sync using full-refresh to obtain it's state, without the parent state of orders stream.

I'm a bit lost here, if you've considered to maintain the fix by yourself and it's suitable for you to have it as it is, should we proceed further with this issue? Because the normal resolution is to rename the orders-dependent streams as they were previously, to have the natural order of operations. I'm aware of the situation with reseting the connections. WDYT?

@dsnam
Copy link
Author

dsnam commented Mar 4, 2022

Yeah it is strange, the behavior we see when refunds runs first seems to suggest that it incurs a full read of its parent stream every time. I see that this connector uses an EagerlyCachedStreamState util that is populated every time request_params is called. Is there also a place where it gets populated with the initial connector state that gets passed in? Is it possible that stream_state_cache has no entry for orders when refunds runs first? I am not super familiar with all the cdk stuff that is being overridden here/also running so there may be a place where this is taken care of, but I couldn't find where this cache that is used to get the parent stream state in the method below gets initialized as anything other than empty.

    def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, Any]]]:
        """
        Reading the parent stream for slices with structure:
        EXAMPLE: for given nested_record as `id` of Orders,

        Output: [ {slice_key: 123}, {slice_key: 456}, ..., {slice_key: 999} ]
        """
        parent_stream = self.parent_stream_class(self.config)
        parent_stream_state = stream_state_cache.cached_state.get(parent_stream.name)
        for record in parent_stream.read_records(stream_state=parent_stream_state, **kwargs):
            # to limit the number of API Calls and reduce the time of data fetch,
            # we can pull the ready data for child_substream, if nested data is present,
            # and corresponds to the data of child_substream we need.
            if self.nested_substream:
                if record.get(self.nested_substream):
                    yield {self.slice_key: record[self.nested_record]}
            else:
                yield {self.slice_key: record[self.nested_record]}

If you are not seeing this happen at all though then I am probably just misunderstanding how the state is passed around. To try to narrow this down I made a fresh connection for one of the shops that I know has a decent volume of refunds and I left the current default stream order intact (refunds first). Once it finishes the initial sync I'll wait a day or so to build up some new data to pull and test to see how it performs, hopefully that should isolate us from any weird stuff that may have happened with the state for the other connections with the upgrades/downgrades.

@bazarnov
Copy link
Collaborator

bazarnov commented Mar 7, 2022

This PR should resolve the standalone sync of the SubStream without parent stream, another words you can now select the single stream and it will be properly synced without parent stream to be synced.

@bazarnov bazarnov moved this from Implementation in progress to In review (internal) in GL Roadmap Mar 7, 2022
@dsnam
Copy link
Author

dsnam commented Mar 7, 2022

Thanks, I will test it out once it's available. FWIW I did see the same behavior from that fresh test connection - incremental sync of ~24h of data took the same amount of time as the initial sync, so seems like the cache change in that PR will be the fix.

image

@bazarnov bazarnov moved this from In review (internal) to In review (Airbyte) in GL Roadmap Mar 9, 2022
@bazarnov bazarnov moved this from In review (Airbyte) to Done in GL Roadmap Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants