Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

Merged

Conversation

darynaishchenko
Copy link
Collaborator

@darynaishchenko darynaishchenko commented May 17, 2024

What

resolved: https://github.com/airbytehq/airbyte-internal-issues/issues/7824

How

Migrated source to use low-code cdk instead of python cdk.
Regression tests are described here: #38316 (comment)
Main changes:

  • State: Previously all incremental streams used incorrect state without partition. On low-code cdk all incremental streams use per partition state.
  • Lifetime reports: Previously implementation used lifetime=true as request param, which is deprecated on API v1.3. Now lifetime reports use query_lifetime=true, with this param start_date and end_date should not be provided. Exception: advertiser_lifetime_report: API v1.3 doesn't allow query_lifetime=true` with advertiser reports, so this stream was implemented exactly as in py version with start_date and end_date query params(range >=365d)
  • Advertiser Ids stream: schema was changed to use advertiser_id as type of stream to be up to date with API docs.
  • Discover for configs with granularity: In py implementation were missing streams(campaigns_audience_reports, ad_group_audience_reports_by_platform, ad_group_audience_reports_by_country, ads_audience_reports_by_country, advertisers_audience_reports_by_country, campaigns_audience_reports_by_platform, advertisers_audience_reports_by_platform, ads_audience_reports_by_platform, ads_audience_reports_by_province), which users with provided granularity actually can use but streams method didn't return them. For configs with granularity source removes granularity from stream name as it was previously named.

Review guide

User Impact

Breaking change users will need to follow migration guide for affected streams.

Can this PR be safely reverted and rolled back?

Breaking change due to changes in schema and state format.

  • YES 💚
  • NO ❌

@darynaishchenko darynaishchenko self-assigned this May 17, 2024
Copy link

vercel bot commented May 17, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 1, 2024 0:37am

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 23, 2024
@darynaishchenko
Copy link
Collaborator Author

Regression test results:
test_catalog_are_the_same [failed] – updated advertiser_id: integer - string. (breaking change described in the docs)

TestDataIntegrity.test_record_schema_match_without_state [failed] - Value of root['properties']['budget']['type'] changed from "integer" to "number". Value of root['properties']['roas_bid']['type'] changed from "integer" to "number". (same error for all fields with type number in schema but actual type is integer).
Both versions have type number, but default type transformer was added in low code version so 0 value is changed to 0.0. For db with transformations(e.g. BigQuery) it’s not a breaking change as destination already converts this data values to a number.
Streams are in a list of breaking changes affected by state changes, so users will do refresh&clear anyway.
This change from 0 to 0.0 occured due to added Default schema normalization in low code to be compatible with stream schemas that was added for api v1.2.0 and in v1.3.0 some fields have new type. For example *_id was changed from integer to string and stream schemas for v.1.2.0 use integer as type.

TestDataIntegrity.test_all_records_are_the_same_without_state [failed] - Same differences with integer/number as above.

Read URLs: some requests in py version due to HttpAvailabilityStrategy

PS: Reviewer can ask me to send the full html report in slack dm. Regression tests were running locally as I needed to change start date in config and chose testing without state due to breaking changes.

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the manifest and the schemas overall look good and given the size of the manifest and number of streams, I am going to trust that we've carefully run live tests to verify that the changes are working and the breaking changes are expected. I didn't see anything glaring.

I did however have some questions to clarify my understanding for the custom components and some suggestions on the code itself. Especially around why exactly we need for two types of advertiser id (+ids) partition routers.

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on naming and Just one last discussion point on the need for the MultipleAdvertiserIdsPartitionRouter. given it's only used on one stream, depending on how drastically it reduces requests, I think we might want to get rid of it even if that deviates from the original behavior.

What I want to figure out is how much the separate partition router benefits us. Basically, does combining the advertiser ids into a single slice results in them all getting bundled up and we only have to go through a single full iteration? Versus, if we separate them into individual slices and that means we have to perform one full iteration per advertiser_id slice. For example, if we have 5 advertiser_ids, then we end up making 5x the requests. If thats the case we can leave as is.

After we clear that up this is good to go. nice work!

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differences identified

I tried to run regression testing using ad_groups and there is a lot of red. Should we be worried about that? The output didn't allow me to validate if there were actually errors so for the rest, I tried to check a bit manually and noted my observations below. Note that I haven't checked a couple of things like if format of the states were compatible because we require a reset anyway. Also, I haven't done all the streams, only those documented below.

For all streams I've validated

  • [Accepted] The first request isn't performed, probably because of the availability strategy so I'm fine with this
  • [Accepted] state for full refresh now supports RFR as it emits a state like {'__ab_no_cursor_state_message': True}

ads_reports_daily

  • [To review] The records are different. The most recent version has record["metrics"]["cost_per_secondary_goal_result"] == None while the new one has a value of -. There might be other differences

ad_groups_reports_daily

  • [To review] page_size query param is not passed anymore EDIT: This was working
  • [To review] like ads_reports_daily, the records are different. Based on this re-occurrence, I expect this change to have been applied to all the records exposing metrics

audiences

OK

Conclusion

I would like us to understand the report for the regression tests. Once we can explain the differences, I'll approve this PR

@darynaishchenko
Copy link
Collaborator Author

@maxi297
Please take a look at my regression tests results here #38316 (comment)

The most recent version has record["metrics"]["cost_per_secondary_goal_result"] == None while the new one has a value of -. There might be other differences

This was fixed by adding DefaultTransformation in low code. But I double check. This is the reason that we have differences in records for integer values that described as number in stream schema.

Also double check page size failure.
Thanks.

@darynaishchenko
Copy link
Collaborator Author

@maxi297

  • Fixed record["metrics"]["cost_per_secondary_goal_result"] == None while the new one has a value of - for all report based streams by adding TransformEmptyMetrics custom component. Thanks for pointing it.
  • page_size query param is not passed anymore, I guess you just didn't see it in requests because it is now in the end of request(due to order of params in the definition of requester). I double checked it. All request params are identical with py version, except order of params, but it doesn't affect a response.
    Request looks like:
    https://business-api.tiktok.com/open_api/v1.3/report/integrated/get/?service_type=AUCTION&report_type=BASIC&data_level=AUCTION_ADGROUP&dimensions=[dimensions]&metrics=[metrics]&start_date=2022-09-30&end_date=2022-10-29&page_size=1000&advertiser_id=7001035076276387841.

@darynaishchenko darynaishchenko requested a review from maxi297 June 17, 2024 10:10
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, there were no issue with the page size.

LGTM!

@darynaishchenko darynaishchenko merged commit 1b85b28 into master Jul 1, 2024
31 checks passed
@darynaishchenko darynaishchenko deleted the daryna/source-tik-tok-marketing/migartion-to-low-code branch July 1, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/tiktok-marketing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants