🎉 New source: AWS CloudTrail #4122

yaroslav-dudar · 2021-06-15T08:03:01Z

What

#2418 - AWS CloudTrail source

How

Connector implemented using Boto3 library

Connector supports management events as a stream. There no API to generate Insight event (this types of events generated automatically by AWS ML algorithms) so it's hard to implement acceptance tests for it.

Added records_limit option as there may be thouthands of event in account and this will lead to slow execution of acceptance tests

Pre-merge Checklist

Expand the checklist which is relevant for this PR.

Connector checklist

Connector Generator checklist

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed.

…udar/2418-source-aws-cloudtrail

bazarnov · 2021-06-15T14:05:32Z

/test connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/939566443
❌ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/939566443

bazarnov

Got a small question about fetch date periods, could we incorporate the end_date for streams to limit the amount of data if there would be a demand to pull out data for a specific period?

Also, the /test command is failing on the integration-tests because of the credentials on the CI side.
Did you edit ci_credentials.sh with the credentials?
Did you add the test credentials to Github Secrets? If not, please contact @tuliren for this.

docs/integrations/sources/aws-cloudtrail.md

yaroslav-dudar · 2021-06-16T07:10:05Z

/test connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/941961027
✅ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/941961027

…udar/2418-source-aws-cloudtrail

yaroslav-dudar · 2021-06-17T15:05:58Z

@subodh1810 you can put you thoughts about records_limit option here

yaroslav-dudar · 2021-06-17T15:22:57Z

For example I can remove records_limit from spec.json (so it will not be available in UI) but parse it if we receive this in config --config secrets/config.json . Is it accaptable ?

tuliren · 2021-06-17T18:32:10Z

For example I can remove records_limit from spec.json (so it will not be available in UI) but parse it if we receive this in config --config secrets/config.json. Is it accaptable ?

This sounds reasonable if this parameter is only meant for testing.

…m spec

airbyte-integrations/connectors/source-aws-cloudtrail/integration_tests/acceptance.py

airbyte-integrations/connectors/source-aws-cloudtrail/setup.py

sherifnada · 2021-06-22T14:55:12Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+
+    def parse_response(self, response: dict, **kwargs) -> Iterable[Mapping]:
+        for event in response[self.data_field]:
+            # boto3 converts timestamps to datetime object


good job leaving a comment here to explain the non-obvious 👍🏼

sherifnada · 2021-06-22T14:56:14Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+    def limit(self):
+        return super().limit
+
+    # API does not support read in ascending order


maybe we should use stream slicing to set boundaries on start/end dates, and sync one day at a time? If there is a lot of events it would make fault tolerance much better since we can save state continuously

there is a problem if we fail in the middle of operation. it's may lead to data loss as recent event comes first. State always will be with last date

if we fail in the middle of syncing a particular slice, then state is not saved. State is only saved once the full slice has been synced. See Stream Slicing. WDYT?

yes, I got your idea

@sherifnada added stream_slices method for the stream and some unit tests

sherifnada · 2021-06-22T14:57:08Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+
+
+class IncrementalAwsCloudtrailStream(AwsCloudtrailStream, ABC):
+    @property


I think this is not needed since it self.limit would already be equal to super().limit

sherifnada · 2021-06-22T15:03:54Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+
+    def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, Any]:
+        # Increment record time to avoid data dupication in the next syncs
+        record_time = latest_record[self.time_field] + 1


it's ok to duplicate one record. Airbyte offers at-least-once delivery guarantee not exactly-once. This will almost never happen but better be safe (not add 1, potentially duplicate records, but definitely capture all of them) than miss records

but be aware that each new incremental sync will produce at least one duplicated record

sherifnada · 2021-06-22T15:05:19Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+            except botocore.exceptions.ClientError as err:
+                # returns no records if either the start time occurs after the end time
+                # or the time range is outside the range of possible values.
+                if err.response["Error"]["Code"] == "InvalidTimeRangeException":


shouldn't this throw an exception? this seems like a legitimate error which would result in no records returning from the sync?

yes, but abnormal state test will fail https://docs.airbyte.io/contributing-to-airbyte/building-new-connector/source-acceptance-tests#teststatewithabnormallylargevalues if we throw exception in this case

ah I see. I think it is better that we throw an error in this case rather than potentially silently fail in production just to make a test succeed.

If you comment out the abnormal state value in the YML then that test will be skipped. suggest we do that

seems like boto3/ or API by iself add EndTime parameter by default with current time

sherifnada · 2021-06-22T15:08:24Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+        while not pagination_complete:
+            params = self.request_params(stream_state=stream_state, stream_slice=stream_slice, next_page_token=next_page_token)
+            try:
+                response = self.send_request(**params)


send_request should be defined as an abstract method on this class to make it clear child classes must implement it

sherifnada · 2021-06-22T15:09:53Z

docs/integrations/sources/aws-cloudtrail.md

+* AWS Secret access key
+* AWS region name
+
+### Setup guide


could you add a section for the changelog like here

changelogs now live in the external facing docs

sherifnada · 2021-06-22T15:11:45Z

docs/integrations/sources/aws-cloudtrail.md

+
+### Performance considerations
+
+The rate of lookup requests for `events` stream is limited to two per second, per account, per region. If this limit is exceeded, a throttling error occurs.


the connector retries it though right? so there shouldn't be an error under normal circumstances. If so I suggest adding This connector gracefully retries when encountering a throttling error. However if the errors continue repeatedly after multiple retries (for example if you setup many instances of this connector using the same account and region), the connector sync will fail.

sherifnada · 2021-06-22T15:12:00Z

docs/integrations/sources/aws-cloudtrail.md

+| Feature | Supported?\(Yes/No\) | Notes |
+| :--- | :--- | :--- |
+| Full Refresh Sync | Yes |  |
+| Incremental - Append Sync | Yes |  |


tuliren · 2021-06-23T08:38:57Z

docs/integrations/sources/aws-cloudtrail.md

+
+The AWS CloudTrail source supports both Full Refresh and Incremental syncs. You can choose if this connector will copy only the new or updated data, or all rows in the tables and columns you set up for replication, every time a sync is run.
+
+This Source Connector is based on a [Boto3 CloudTrail](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudtrail.html).


I think this is helpful for future reference and debugging.

it's present in setup.py so a developer should be able to see it. Or are you saying a user might actually be interested in it? Fair enough. Apologies for the back and forth yaro, mind adding it back in?

Hmm, this is not in the setup.py in this PR. Are you referring to a file outside of this PR?

Right, I think a user may need it for reference. There are usually detailed explanations of each parameter in this kind of documentation, and we cannot replicate all of those information in spec.json.

tuliren · 2021-06-23T08:38:57Z

docs/integrations/sources/aws-cloudtrail.md

+
+The AWS CloudTrail source supports both Full Refresh and Incremental syncs. You can choose if this connector will copy only the new or updated data, or all rows in the tables and columns you set up for replication, every time a sync is run.
+
+This Source Connector is based on a [Boto3 CloudTrail](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudtrail.html).


I think this is helpful for future reference and debugging.

yaroslav-dudar · 2021-06-23T10:31:08Z

/test connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/963961553
✅ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/963961553

sherifnada

could you also add the connector to docs/SUMMARY.md and airbyte-integrations/builds.md?

sherifnada · 2021-06-23T17:41:42Z

docs/integrations/sources/aws-cloudtrail.md

+
+| Version | Date       | Pull Request | Subject |
+| :------ | :--------  | :-----       | :------ |
+| 0.1.0   | 2021-06-23 | [4122](https://github.com/airbytehq/airbyte/pull/4122) | LookupEvent API |


Suggested change

| 0.1.0 | 2021-06-23 | [4122](https://github.com/airbytehq/airbyte/pull/4122) | LookupEvent API |

| 0.1.0 | 2021-06-23 | [4122](https://github.com/airbytehq/airbyte/pull/4122) | Initial release supporting the LookupEvent API |

sherifnada · 2021-06-23T17:42:07Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/spec.json

@@ -0,0 +1,38 @@
+{
+  "documentationUrl": "https://docsurl.com",


Suggested change

"documentationUrl": "https://docsurl.com",

"documentationUrl": "https://docs.airbyte.io/integrations/sources/aws-cloudtrail",

sherifnada · 2021-06-23T17:45:58Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+        if next_page_token:
+            params["NextToken"] = next_page_token
+
+        if not stream_slice:


i suggest moving these lines (86-95) to the ManagementEvents class since it is where slices are defined. Something like:

def request_params(): params = super().request_params() if stream_slice: .... return params

sherifnada · 2021-06-23T22:22:24Z

airbyte-integrations/connectors/source-aws-cloudtrail/unit_tests/test_event_stream_slices.py

+        stream_state=stream_state,
+    )
+
+    assert len(slices) < ManagementEvents.data_lifetime


do you think this check is useful? it will start failling if we make the slicing period smaller than 24h which doesn't seem like an important detail

sherifnada · 2021-06-23T22:25:08Z

airbyte-integrations/connectors/source-aws-cloudtrail/unit_tests/test_event_stream_slices.py

+        sync_mode=SyncMode.full_refresh, cursor_field=stream.cursor_field
+    )
+
+    assert len(slices) >= ManagementEvents.data_lifetime


should we care how many slices there are? we should ideally just test for the invariants which is that slices are contiguous and mutually exclusive

sherifnada · 2021-06-23T22:26:29Z

airbyte-integrations/connectors/source-aws-cloudtrail/unit_tests/test_event_stream_slices.py

+    "start_date": "2020-05-01",
+}
+
+


we should also add a test case to assert that if start date is larger than state then we start from start date

sherifnada · 2021-06-23T22:27:02Z

airbyte-integrations/connectors/source-aws-cloudtrail/unit_tests/test_event_stream_slices.py

+
+    assert len(slices) < ManagementEvents.data_lifetime
+    # checks that start time not more than 15 days before now
+    assert slices[0]["StartTime"] >= stream_state["EventTime"]


shouldn't this be == not >=?

sherifnada · 2021-06-23T22:32:36Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+        """
+        cursor_data = stream_state.get(self.cursor_field) if stream_state else None
+        # ignores state if start_date option is higher than cursor
+        if cursor_data and cursor_data > self.start_date:


it feels like we can make this block a lot simpler if we just do:

start_time = max(now - 90 days, self.start_date, cursor_data or 0)

and I think in that case we don't really need this if/else, and we don't need normalize_start_time

sherifnada · 2021-06-23T22:35:10Z

airbyte-integrations/connectors/source-aws-cloudtrail/source_aws_cloudtrail/source.py

+                {
+                    "StartTime": last_start_time,
+                    # decrement second as API include records with specified StartTime and EndTime
+                    "EndTime": last_start_time + self.day_in_seconds - 1,


WDYT about using a library like pendulum to do date math to avoid hardcoding constants?
it would let us do something like

last_start_time.add(days=1).int_timestamp

…udar/2418-source-aws-cloudtrail

yaroslav-dudar · 2021-06-24T14:05:32Z

/test connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/968219941
❌ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/968219941

yaroslav-dudar · 2021-06-24T14:40:00Z

/test connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/968328617
✅ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/968328617

yaroslav-dudar · 2021-06-24T14:52:17Z

@sherifnada FYI I fixed the test issue

…udar/2418-source-aws-cloudtrail

yaroslav-dudar · 2021-06-25T06:28:40Z

/publish connector=connectors/source-aws-cloudtrail

🕑 connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/970495990
✅ connectors/source-aws-cloudtrail https://github.com/airbytehq/airbyte/actions/runs/970495990

yaroslav-dudar added 6 commits June 9, 2021 10:32

added event source schema; started read method implementation

8e2e257

fixed event time field parsing; added retry config

ab5f38b

fixed region issue; added first part of integration tests

a126848

finished integration tests for events stream

ae611e9

renamed events stream; minor fixes

ae7688d

Merge branch 'master' of github.com:airbytehq/airbyte into yaroslav-d…

2afd6dd

…udar/2418-source-aws-cloudtrail

yaroslav-dudar requested review from midavadim, bazarnov and TymoshokDmytro June 15, 2021 08:03

github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Jun 15, 2021

yaroslav-dudar changed the title ~~🎉 Add AWS CloudTrail source connector~~ 🎉 New source: AWS CloudTrail Jun 15, 2021

bazarnov reviewed Jun 15, 2021

View reviewed changes

docs/integrations/sources/aws-cloudtrail.md Outdated Show resolved Hide resolved

moved read_records to stream; added creds and improved doc

fb2f414

set state_checkpoint_interval to None

9ac1847

yaroslav-dudar requested a review from bazarnov June 16, 2021 15:47

Merge branch 'master' of github.com:airbytehq/airbyte into yaroslav-d…

2c3eb24

…udar/2418-source-aws-cloudtrail

fixed data duplication in incremental sync; removed records_limit fro…

c468eb1

…m spec

midavadim suggested changes Jun 21, 2021

View reviewed changes

airbyte-integrations/connectors/source-aws-cloudtrail/integration_tests/acceptance.py Show resolved Hide resolved

airbyte-integrations/connectors/source-aws-cloudtrail/setup.py Show resolved Hide resolved

midavadim approved these changes Jun 22, 2021

View reviewed changes

yaroslav-dudar marked this pull request as ready for review June 22, 2021 12:19

yaroslav-dudar requested review from sherifnada, tuliren and davinchia June 22, 2021 12:20

sherifnada suggested changes Jun 22, 2021

View reviewed changes

tuliren reviewed Jun 23, 2021

View reviewed changes

added slices; updated docs; added unit tests for stream_slices method

f5dff40

sherifnada reviewed Jun 23, 2021

View reviewed changes

yaroslav-dudar added 3 commits June 24, 2021 12:08

refactoring in stream_slices; updated docs; updated unit tests

340bc45

Merge branch 'master' of github.com:airbytehq/airbyte into yaroslav-d…

1f0de6e

…udar/2418-source-aws-cloudtrail

fixed format error

b76b5a2

changed start date parsing in config

480a777

sherifnada approved these changes Jun 24, 2021

View reviewed changes

yaroslav-dudar added 2 commits June 25, 2021 09:23

Merge branch 'master' of github.com:airbytehq/airbyte into yaroslav-d…

760a718

…udar/2418-source-aws-cloudtrail

added connector JSON definitions

2e7864e

yaroslav-dudar added 2 commits June 25, 2021 09:50

fixed docker image path

62be9f2

apply format

16bb7f9

yaroslav-dudar merged commit 2ccb1c2 into master Jun 25, 2021

yaroslav-dudar deleted the yaroslav-dudar/2418-source-aws-cloudtrail branch June 25, 2021 08:10

yaroslav-dudar mentioned this pull request Aug 4, 2021

aws-cloudtrail: fix acceptance tests #5152

Merged

26 tasks

ycherniaiev added connectors/source/aws-cloudtrail connectors/sources-api labels Jan 4, 2022



		class IncrementalAwsCloudtrailStream(AwsCloudtrailStream, ABC):
		@property


		### Performance considerations

		The rate of lookup requests for `events` stream is limited to two per second, per account, per region. If this limit is exceeded, a throttling error occurs.

	\| Incremental - Append Sync \| Yes \| \|
	\| Incremental Sync \| Yes \| \|


		The AWS CloudTrail source supports both Full Refresh and Incremental syncs. You can choose if this connector will copy only the new or updated data, or all rows in the tables and columns you set up for replication, every time a sync is run.

		This Source Connector is based on a [Boto3 CloudTrail](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudtrail.html).

	\| 0.1.0 \| 2021-06-23 \| [4122](https://github.com/airbytehq/airbyte/pull/4122) \| LookupEvent API \|
	\| 0.1.0 \| 2021-06-23 \| [4122](https://github.com/airbytehq/airbyte/pull/4122) \| Initial release supporting the LookupEvent API \|

	"documentationUrl": "https://docsurl.com",
	"documentationUrl": "https://docs.airbyte.io/integrations/sources/aws-cloudtrail",

🎉 New source: AWS CloudTrail #4122

🎉 New source: AWS CloudTrail #4122

Conversation

yaroslav-dudar commented Jun 15, 2021 • edited Loading

What

How

Recommended reading order

Pre-merge Checklist

bazarnov commented Jun 15, 2021 • edited by github-actions bot Loading

bazarnov left a comment • edited Loading

Choose a reason for hiding this comment

yaroslav-dudar commented Jun 16, 2021 • edited by github-actions bot Loading

yaroslav-dudar commented Jun 17, 2021

yaroslav-dudar commented Jun 17, 2021

tuliren commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslav-dudar Jun 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslav-dudar commented Jun 23, 2021 • edited by github-actions bot Loading

sherifnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslav-dudar commented Jun 24, 2021 • edited by github-actions bot Loading

yaroslav-dudar commented Jun 24, 2021 • edited by github-actions bot Loading

yaroslav-dudar commented Jun 24, 2021 • edited Loading

yaroslav-dudar commented Jun 25, 2021 • edited by github-actions bot Loading

yaroslav-dudar commented Jun 15, 2021 •

edited

Loading

bazarnov commented Jun 15, 2021 •

edited by github-actions bot

Loading

bazarnov left a comment •

edited

Loading

yaroslav-dudar commented Jun 16, 2021 •

edited by github-actions bot

Loading

yaroslav-dudar Jun 22, 2021 •

edited

Loading

yaroslav-dudar commented Jun 23, 2021 •

edited by github-actions bot

Loading

yaroslav-dudar commented Jun 24, 2021 •

edited by github-actions bot

Loading

yaroslav-dudar commented Jun 24, 2021 •

edited by github-actions bot

Loading

yaroslav-dudar commented Jun 24, 2021 •

edited

Loading

yaroslav-dudar commented Jun 25, 2021 •

edited by github-actions bot

Loading