Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a configurable batch size to reduce memory usage #36

Merged
merged 2 commits into from
Aug 21, 2019

Conversation

drewbanin
Copy link
Contributor

fixes #25

Looks like there are a couple of reports in #25 of Stitch running out of memory while running this tap causing it to fail with a -9 exit code. This PR adds a configurable batch_size parameter which controls the number of records fetched for each stream. When the batch size is set to a number like 100, or 1000, this tap uses considerably less memory than the default of 2500 records per batch.

I tried really hard to make this work natively with the FuelSDK, but I encountered a couple of problems:

  1. The getMoreResults method in the FuelSDK ignores the BatchSize option provided on the cursor object
  2. There are serious logic errors in the FuelSDK codebase, like this one which incorrectly checks the m_filter argument instead of m_options

To make this work, I instead pulled the getMoreResults implementation out of the FuelSDK module and into this tap so that I could change the definition of ET_Continue. In the ET_Continue constructor, I added this line of code:

        ws_continueRequest.Options.BatchSize = batch_size

All told, this correctly sets the batch size for requests to the Salesforce Marketing Cloud / Exact Target API. I verified this with a new log line that reports the number of rows synced in each batch.

As-is, this PR won't actually fix the memory issues we're seeing in Stitch. I preserved all of the old defaults, so this tap should work 100% exactly the way it did prior to this PR. My hope is that Stitch can either expose the batch_size config in the UI, or set it to some lower value (like 1000, or 100) as required on their servers.

Last, I did some quick profiling to figure out what kind of impact this change had on memory usage for the tap. The default batch size (2500 rows) peaks around 1gb of memory, which I assume is the limit that Stitch has in place for taps? The lower batch sizes use considerably less memory, but will require more API calls. This is non-scientific, but it helped confirm for me that changing the batch size reduces memory usage:

Screen Shot 2019-08-07 at 5 49 26 PM

Please let me know if you have any questions - we're keen to get this tap running successfully in Stitch!

@KAllan357
Copy link
Contributor

👍 thanks

@KAllan357 KAllan357 merged commit 22a9b9e into singer-io:master Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tap failed with code -9
2 participants