Releases: awslabs/aws-serverless-data-lake-framework
Serverless Data Lake Framework 2.0.0-beta.0
Work is ongoing on a new major version of the Serverless Data Lake Framework. This is a pre-release, not ready for production workloads.
What’s New
- SDLF components are now CloudFormation modules
- there is one module per component: foundations, team, pipeline, stageA, stageB, dataset.
- datalakeLibrary and pipLibrary are used to build Lambda layers, they’re not CloudFormation modules.
deploy.sh
takes care of deploying the CICD infrastructure used to build these modules, and register them in the private CloudFormation registry of each account. Modules are updated whenever there is a change to their source repository.
- SDLF CICD pipelines now live in the Shared DevOps account
- CloudFormation stacks are created in child accounts through crossaccount IAM roles.
- SDLF can deploy an arbitrary number of child accounts driven from a single devops account.
pDomain
(which defaults todatalake
) can be provided when deploying foundations.- each domain can have the usual three environments (
dev
,test
,prod
).
- Deploying foundations and teams is now done from a new repository called
sdlf-main
.- this repository is created during the initial setup with deploy.sh.
- foundations deployment happens in
foundations-{domain}-{env}.yaml
and teams inteams-{domain}-{env}.yaml
. - sdlf-main works the same way everything works in SDLF -
master
,test
anddev
branches are expected. - it is easier to know which teams have been created, and to remove them as they don’t share the same set of parameters in
parameters-{env}.json
.
- Deploying pipelines and datasets is now done from a new repository called
sdlf-{domain}-{team name}-main
.- this repository is created when a new team is created.
- pipelines deployment happens in
pipelines-{env}.yaml
and datasets indatasets-{env}.yaml
. - sdlf-{team name}-main works the same way everything works in SDLF -
master
,test
anddev
branches are expected. - it is easier to know which pipelines and teams have been created, and to remove them as they don’t share the same set of parameters in
parameters-{env}.json
.
- Mappings between datasets and transforms in stageB is done directly when defining a dataset.
- this mapping used to be done by a CodeBuild project and a script in
sdlf-datalakeLibrary
. They are no longer needed and have been removed. - it is now defined through the
pPipelineDetails
parameter when defining a dataset insdlf-dataset
. This parameter goes even further and can be used to store more information that stages can use. These details are stored in the Datasets DynamoDB table (as was already the case in SDLFv1).
- this mapping used to be done by a CodeBuild project and a script in
- Stages in a pipeline are now driven by EventBridge rules exclusively.
- the rule can be an event pattern or a schedule (cron expression).
- stageA is no longer sending messages to a queue for stageB to process. StageB is configured with an event pattern to listen for stageA runs (
pEventPattern
in the example), and then process these events on a schedule (pSchedule
) - it is easier now to have pipelines with a single stage, pipelines with dependent stages and overall more complex pipelines than in SDLFv1, as long as there is an event pattern to listen for.
- New optional component:
sdlf-monitoring
, with CloudTrail, ELK and SNS.- in SDLFv1 Cloudtrail is optional but enabled by default. Here it is optional and not enabled as long as
sdlf-monitoring
is not deployed.
- in SDLFv1 Cloudtrail is optional but enabled by default. Here it is optional and not enabled as long as
- New optional stage:
sdlf-stage-dataquality
- deequ is now entirely optional. While it wasn’t enabled by default in SDLFv1, dedicated infrastructure was still created while deploying sdlf-foundations. This is no longer the case.
sdlf-stage-dataquality
can now be used as an example on how to add a third stage to the default stageA and stageB pipeline.
- Outside the initial
deploy.sh
, there is no more shell scripts.
Full Changelog: 1.5.2...2.0.0-beta.0
Serverless Data Lake Framework 1.5.2
What's Changed
Full Changelog: 1.5.1...1.5.2
Serverless Data Lake Framework 1.5.1
Serverless Data Lake Framework 1.5.0
Features & Enhancements
- ELK Update by @cnfait in #136
- rework sdlf-cicd rCodeBuildRole IAM role to avoid using wildcards by @cnfait in #130
- avoid wildcards in sdlf-lakeformation-admin role permissions by @cnfait in #132
- avoid wildcards in data quality lambda permissions by @cnfait in #131
- disable cfn_nag W11 on CodeCommit roles by @cnfait in #133
- update awswrangler (aws sdk for pandas) to the latest 2.x version by @ntlohi in #134
Full Changelog: 1.4.0...1.5.0
Thanks
We thank all the contributors/users for their work on this release, in particular @ntlohi.
Serverless Data Lake Framework 1.4.0
Noteworthy
Features & Enhancements
- update codebuild image from standard:4.0 to amazonlinux2-x86_64-standard:4.0 by @cnfait in #113
- validate.sh: replace flake8, isort with ruff by @cnfait in #126
- Support for specifying glue arguments in dynamodb dataset table by @cnfait in #127
- add emr tagging permissions by @cnfait in #129
Full Changelog: 1.3.1...1.4.0
Thanks
We thank all the contributors/users for their work on this release.
Serverless Data Lake Framework 1.3.1
Bug Fixes
- fix pipeline stage dynamodb entry creation by @cnfait in 675517a
- use deequ 1.1.0 instead of 1.2.2 as it breaks glue jobs by @cnfait and @piers-walter-ibm in 5065cd5
- deploy failed when you deploy a new environment (missing sns permission) by @YuliemAlavez in #118
Minor Changes
- gitlab support: readme file by @cnfait in b456258
- remove executable bit from json files by @cnfait in in #112
- fix minor typo by @cnfait in #114
Features & Enhancements
Full Changelog: 1.3.0...1.3.1
Thanks
We thank all the contributors/users for their work on this release, in particular @YuliemAlavez and @piers-walter-ibm.
Serverless Data Lake Framework 1.3.0
Noteworthy
- Third-party SCM support (mirroring to CodeCommit): GitLab🔥
- As of version 1.1.0 released on December, 7th 2022, there is now a public roadmap.
Features & Enhancements
- third-party scm support: gitlab by @cnfait in #104
- enable versioning on central/raw/stage/analytics buckets by @cnfait in #106
- add security configuration to sdlf-dataset glue crawler by @cnfait in #107
- encrypt cloudtrail logs when using externally-provided bucket by @cnfait in #108
Full Changelog: 1.2.0...1.3.0
Thanks
We thank all the contributors/users for their work on this release.
Serverless Data Lake Framework 1.2.0
Noteworthy
- As of version 1.1.0 released on December, 7th 2022, there is now a public roadmap.
- As of version 1.1.0 released on December, 7th 2022, the main branch of the repository has been renamed to
main
frommaster
. This is to be in line with what other projects the team is working on are using.master
is still available with the same content asmain
to avoid breaking existing workflows. Currently onlymaster
is supported by SDLF CICD infrastructure however. - As of version 1.1.0 released on December, 7th 2022, Semantic Versioning is now used for SDLF releases. This is to be in line with other projects from the same team.
Bug Fixes
- correct and clean manifests and cloudfront examples by @mariandumitrascu-p in #71
- fix bitbucket team pipeline when checking repositories by @cnfait in #103 - Thanks @YuliemAlavez!
Features & Enhancements
- Python 3.9 as default for Lambda functions, Lambda layers and CodeBuild runtimes by @cnfait in #93
- Align GlueVersion to 2.0 for all Glue jobs by @cnfait in #94
- Update Deequ from 1.0.X to Deequ 1.2.2-spark2.4 by @cnfait in #95
- Update ElasticSearch domain from 6.3 to 6.8 by @cnfait in #96
- Add simple shell script and configuration files to help improve code quality by @cnfait in #97
- isort by @cnfait in #98
- black by @cnfait in #99
- flake8 by @cnfait in #100
- shellcheck by @cnfait in #101
Full Changelog: 1.1.0...1.2.0
Thanks
We thank all the contributors/users for their work on this release.
Serverless Data Lake Framework 1.1.0
Noteworthy
- This release is just a snapshot of the repository as of December, 7th 2022. There is no new feature or change if you already pulled the code from the main branch.
- There is now a public roadmap.
- The main branch of the repository has been renamed to
main
frommaster
. This is to be in line with what other projects the team is working on are using.master
is still available with the same content asmain
to avoid breaking existing workflows. - Semantic Versioning is now used for SDLF releases. This is to be in line with other projects from the same team.
Features & Enhancements
- Added bucket policies to enforce in transit encryption for s3 buckets #14
- Update catalog lambda to handle S3 multipart upload events #19
- Update catalog lambda to support DeleteMarkerCreated events #24
- 3rd party SCM providers - Azure DevOps integration #22
- Bumping Wrangler to 2.3.0 and removing ListBucket condition
- 3rd party SCM providers - Bitbucket integration #26
- Enable python 3.8 runtime for non-default lambda layers #29
- Add alias option for target e-mail #32
- Enable Manifest Based Processing in SDLF #30
- Adding Glue Jobs Deployer utility #34
- Feature to add pre-existing whl files without having to build them #39
- Adding deploy mode for datasets #40
- Enable NodeToNodeEncryptionOptions (CFN_Nag W85) #43
- Add update stack logic for cross-account team role stack #44
- Adding Data Lake testing #45
- Enable tracing for step functions #49
- Lambda cloudwatch log encryption retention #46
- Add template protection function #48
- Update key and bucket retention policies #50
- Adding PutLifecycleConfiguration permission
- Adding in a CloudFormation template that sets up automated testing for CodeCommit Pull Requests #47
- Datalake Workload Management #52
- Point-in-time recovery (PITR) enabled for DynamoDB tables #53
- Modifying user agent
- Adding few more examples and public references #58
- Sqoop ingestion extension #57
- Reducing size policy #62
- Removing slf4j logger calls
- EMR security configuration #59
- Python runtime updated #67
Bug Fixes
- Adding missing sdlf-utils and reinstating PubRef
- Correct typo of Glue Job's name #33
- Deleting additional Images, fixing README and parameters-dev errors #42
- Fixing Topic Modelling Example
- Sqoop ingestion minor fixes #66
- Fix unsupported resource arn format on rXXBucketLakeFormationS3Registration resources #77
- Fix S3 buckets ARN - Lakeformation integration #75
Documentation
- Adjusting Contributing file to latest template
- Adjusting workshop URLs to support i18n
- Better documentation for new service connection strategy #25
Thanks
We thank all the contributors/users for their work on this release.
Full Changelog: v1.0.4.0...1.1.0