Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator-based JDBC Source (and Redshift bugfix) #1887

Merged
merged 21 commits into from
Feb 3, 2021

Conversation

cgardens
Copy link
Contributor

@cgardens cgardens commented Jan 29, 2021

What

There are 2 things happening in this PR.

  • Removes usage of streams from the AbstractJdbcSource in favor of iterators.
  • Implements JdbcStandardTest for the Redshift source.

Why?

  • A user (@roshan) this morning report that the Redshift source was failing. The first issue was very obviously that there were . characters in the table names, which was a known issue that we actually had a PR up for, we just hadn't merged it because we didn't want to risk adding unknown bugs right before the HN launch. I pushed a version of this PR to docker hub and gave him access. This fixed the initial issue but then we found the redshift source was hanging...
  • It seems like the redshift source was hanging because we no longer lazily query the database. We made the change to stop lazily querying the db when we ran into the issue with flatMap a few weeks ago (PR)). This seemed relatively safe since we fetch data from the dbs in chunks. So the first chunks would be pulled eagerly but the rest would wait until we started consuming the output. This worked with the other jdbc databases. It did not work for Redshift. I believe this is due to how connection pooling is happening in Redshift where. Empirically the database hangs up after there are multiple (>9) queries queued up. I still need to dig a little deeper into the exact limitation here.
  • That all being said, it was clear that we were eating a bad side effect of our eager querying. We already knew this wasn't good and were struggling with other stream issues in the JdbcSource, so I went ahead switched the internals of the JdbcSource to use iterators. This allows us to get exactly the behavior we want when consuming data from the sources (lazy, element-wize, closeable, etc). Making the querying lazy and making sure we close connections allows the Redshift source to work properly.
  • This took a long time to debug because Redshift was the only database that didn't implement the JdbcStandardTest suite. This is because there are no test containers for Redshift so it was a really big pain to set it up in this test suite. That said, the pain was warranted today, because this was the most reasonable way to dig in and identify where the source was hanging.

How

  • Add MoreIterators to provide us "stream-like" syntactic sugar when dealing with iterators.
  • Migrate AbstractJdbcSource to use iterators
  • Implement JdbcStandardTest for redshift.
  • Modify JdbcStandardTest to handle its own clean up and not assume that there will be a new database available for every test (a constraint of working with redshift without test containers)
  • Modify JdbcStandardTest to handle schemas namespaces more clearly / robustly.

Pre-merge Checklist

  • Unit tests for MoreIterators
  • Verify all integration tests still pass
  • Verify all jdbc standard test pass
  • Bump versions

Recommended reading order

  1. Source.java
  2. IntegrationRunner.java
  3. MoreIterators.java
  4. RedshiftJdbcStandardTest.java
  5. the rest

Other

  • We still have an outstanding issue from a user about the redshift destination hanging. I do not yet know if that is at all related.

@cgardens cgardens force-pushed the cgardens/jdbc_source_iterator branch from 8bcfa2b to 7193ea5 Compare January 29, 2021 00:09
@cgardens cgardens marked this pull request as ready for review January 29, 2021 00:33
* @param <T> type
* @return auto closing iterator
*/
public static <T> Iterator<T> autoCloseIterator(Iterator<T> iterator, SafeVoidCallable onClose) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a VERY dangerous function. It should not return an Iterator, it should return an AutoCloseIterator. Otherwise there is absolutely no way to release resources but to consume the iterator to the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

* @param <T> type
* @return vanilla iterator
*/
public static <T> Iterator<T> toCloseableIterator(CloseableIterator<T> iterator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to kill this function, it cannot exist. Way too dangerous for the reason I mentioned earlier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@cgardens cgardens force-pushed the cgardens/jdbc_source_iterator branch from 7371392 to f181958 Compare January 30, 2021 00:47
@cgardens
Copy link
Contributor Author

cgardens commented Jan 30, 2021

@michel-tricot i think i hit your feedback. the AutoClosingIterator now implements the ResourceIterator interface so it can be closed at any time. I think this removes the danger you were pointing out in your previous review. Let me know what you think.

I'm glad we are removing the headache of streams. The iterator approach is unfortunately taking a ton of boilerplate to recreate some of the features of streams that we need though. Let me know if you feel that there's opportunity to reduce this boilerplate.

Still need to do unit tests.


@Override
public void close() throws Exception {
internalIterator.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do the boolean management here and just not do anything if close has already been called? You probably want to use an AtomicBoolean with a get&set if you do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to push for composition here the way one would thing about a stream. Each of these iterators is designed to do one thing. They can be then composed to get whatever behavior you want.

So if you want close to be called once, then you compose this iterator with the default iterator. I think of this as good default behavior which is why it's called the DefaultResourceIterator but if we think that needs to named more explicitly we can do that to.

Anyway, this attempt at composition is why I am not doing an AutoClosingLazySingleCloseIterator and not combining the behavior you're describing here. Are you not a fan of this approach?


public class ResourceIterators {

public static <T> ResourceIterator<T> emptyIterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: empty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* @param <T> type
* @return closeable iterator
*/
public static <T> ResourceIterator<T> resourceIterator(Iterator<T> iterator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: fromIterator

* @param <T> type
* @return new resource iterator with the close function appended
*/
public static <T> ResourceIterator<T> resourceIterator(Iterator<T> iterator, VoidCallableNoException onClose) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: fromIterator

* @param <T> type
* @return resource iterator
*/
public static <T> ResourceIterator<T> resourceIterator(Stream<T> stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: fromStream

@michel-tricot
Copy link
Contributor

I am not completely satisfied with my suggestion for autoclosable

@cgardens cgardens mentioned this pull request Feb 1, 2021
@cgardens cgardens force-pushed the cgardens/jdbc_source_iterator branch 3 times, most recently from 5bab2a4 to 7b20b59 Compare February 1, 2021 23:28
@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

@michel-tricot I am still working through testing but I think I have now hit all of the feedback you left (and we discussed offline).

I am a little worried that we are being too slow to push out a fix for an issue that makes our redshift source unusable and is potentially causing issues in on jdbc sources that we just haven't had reported yet. Unless there are "one-way" door sort of issues, I think it's time to push this thing over across the finish line and do additional nice to haves later. Definitely agreed that the first set of iterators were too dangerous, but this second round feels like stuff we could do as follow on tasks. @michel-tricot wdyt?

(I am hammering through unit tests now, and will not merge without them)

Copy link
Contributor

@michel-tricot michel-tricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@cgardens cgardens force-pushed the cgardens/jdbc_source_iterator branch from aec251a to de5a5e2 Compare February 2, 2021 22:16
@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

Experiencing OOMs again. Debugging...

Fixed. Actually a problem with the stress test itself, not the source code.

@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

/publish connector=connectors/source-postgres

🕑 connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/532313169
✅ connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/532313169

@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

/publish connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/532315862
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/532315862

@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

/publish connector=connectors/source-mssql

🕑 connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/532316321
✅ connectors/source-mssql https://github.com/airbytehq/airbyte/actions/runs/532316321

@cgardens
Copy link
Contributor Author

cgardens commented Feb 2, 2021

/publish connector=connectors/source-redshift

🕑 connectors/source-redshift https://github.com/airbytehq/airbyte/actions/runs/532316607
✅ connectors/source-redshift https://github.com/airbytehq/airbyte/actions/runs/532316607

@cgardens cgardens merged commit aadfae2 into master Feb 3, 2021
@cgardens cgardens deleted the cgardens/jdbc_source_iterator branch February 3, 2021 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants