Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pyairbyte for SQL source and Blob Destination Performance testing? #561

Open
ramandatascientist opened this issue Dec 16, 2024 · 2 comments

Comments

@ramandatascientist
Copy link

Hello

We are using SQL as a source and Blob as a destination. We have limited number of records in our dev SQL source. We want to run a performance testing to understand how Airbyte will perform if we have 1 M, 10 M, 100 M records in a sync.

Is there a way to run performance tests on Airbyte using Pyairbyte with Fake data library? or any other way?

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 17, 2024

@ramandatascientist - Thanks for raising this. The benchmark CLI command may be of help to you. I'm about to merge this PR to our auto-generated docs - adding a new docs page for the cli module.

Other things which should be helpful:

  1. After each sync, a performance log path is printed, and that file will have a jsonl line for all sync runs, including records/second, bytes/second and many other helpful performance stats.
  2. We also have convenience functions airbyte.sources.get_benchmark_source() and airbyte.destinations.get_noop_destination() - which you can use in your scripts if you want more control than the pyab benchmark CLI command, while still leveraging existing patterns.
  3. However you run the sync operations, stats will be appended to the same performance log file.

Does this help? Let us know how it goes!

@ramandatascientist
Copy link
Author

ramandatascientist commented Dec 17, 2024

Hi @aaronsteers Thank you for your response. Is there a way to mock the datasets against Azure SQL? We are running airbyte against dev Azure SQL databases that has limited number of records, so I am wondering if there is a way to fake/mock the datasets and run performance test against it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants