This project builds an AWS infrastructure using AWS CDK (Cloud Development Kit) to split a PDF into chunks, process the chunks via AWS Step Functions, and merge the resulting chunks back using ECS tasks. The infrastructure also includes monitoring via CloudWatch dashboards and metrics for tracking progress.
Before running the AWS CDK stack, ensure the following are installed and configured:
-
AWS Bedrock Access: Ensure your AWS account has access to the Claude Sonnet 3.5 V2 model in Amazon Bedrock.
- Request access to Amazon Bedrock through the AWS console if not already enabled.
-
Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
- Adobe PDF Services API to obtain API credentials.
-
Python (3.7 or later)
- Download Python
- Set up a virtual environment
python -m venv .venv source .venv/bin/activate # For macOS/Linux .venv\Scripts\activate # For Windows
- Also ensure that if you are using windows to confirm the python path in cmd before deploying. That can be done by running:
where python
-
AWS CLI: To interact with AWS services and set up credentials.
-
npm
- npm is required to install AWS CDK. Install npm by installing Node.js:
- Download Node.js (includes npm).
- Verify npm installation:
npm --version
- npm is required to install AWS CDK. Install npm by installing Node.js:
-
AWS CDK: For defining cloud infrastructure in code.
- Install AWS CDK
npm install -g aws-cdk
- Install AWS CDK
-
Docker: Required to build and run Docker images for the ECS tasks.
- Install Docker
- Verify installation:
docker --version
-
AWS Account Permissions
- Ensure permissions to create and manage AWS resources like S3, Lambda, ECS, ECR, Step Functions, and CloudWatch.
- AWS IAM Policies and Permissions
- Also, For the ease of deployment. Create a IAM user in the account you want to deploy to and attach adminstrator access to that user and use the Access key and Secret key for that user.
Ensure your project has the following structure:
├── app.py (Main CDK app)
├── lambda/
│ ├── split_pdf/ (Python Lambda for splitting PDF)
│ └── java_lambda/ (Java Lambda for merging PDFs)
├── docker_autotag/ (Python Docker image for ECS task)
└── javascript_docker/ (JavaScript Docker image for ECS task)
|__ client_credentials.json (The client id and client secret id for adobe)
-
Clone the Repository:
- Clone this repository containing the CDK code, Docker configurations, and Lambda functions.
-
Set Up Your Environment:
- Configure AWS CLI with your AWS account credentials:
aws configure
- Make sure the region is set to
us-east-1
- Configure AWS CLI with your AWS account credentials:
-
Set Up CDK Environment:
- Bootstrap your AWS environment for CDK (run only once per AWS account/region):
cdk bootstrap
- Bootstrap your AWS environment for CDK (run only once per AWS account/region):
-
Create Adobe API Credentials:
- Create a file called
client_credentials.json
in the root directory with the following structure:{ "client_credentials": { "PDF_SERVICES_CLIENT_ID": "<Your client ID here>", "PDF_SERVICES_CLIENT_SECRET": "<Your secret ID here>" } }
- Replace and with your actual Client ID and Client Secret provided by Adobe and not the whole file.
- Create a file called
-
Upload Credentials to Secrets Manager:
- Run this command in the terminal of the project to push the secret keys to secret manager:
- For Mac
aws secretsmanager create-secret \ --name /myapp/client_credentials \ --description "Client credentials for PDF services" \ --secret-string file://client_credentials.json
- For Windows
aws secretsmanager create-secret --name /myapp/client_credentials --description "Client credentials for PDF services" --secret-string file://client_credentials.json
- Run this command if you have already uploaded the keys earlier and would like to update the keys in secret manager.
- For Mac:
aws secretsmanager update-secret \ --secret-id /myapp/client_credentials \ --description "Updated client credentials for PDF services" \ --secret-string file://client_credentials.json
- For Windows:
aws secretsmanager update-secret --secret-id /myapp/client_credentials --description "Updated client credentials for PDF services" --secret-string file://client_credentials.json
-
Install the Requirements:
- For both Mac and Windows
-
pip install -r requirements.txt
-
Connect to ECR:
- Ensure Docker Desktop is running, then execute:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
- Ensure Docker Desktop is running, then execute:
-
Only For Windows - Set a environment variable once for deployment
- For Windows users, an environment variable needs to be set before deployment. This step ensures compatibility and prevents deployment issues.
- Please Checkout Troubleshooting if you would like to know more about this.
set BUILDX_NO_DEFAULT_ATTESTATIONS=1
-
Deploy the CDK Stack:
- Deploy the stack to AWS:
cdk deploy
Once the infrastructure is deployed:
- Create a
pdf/
folder in the S3 bucket created by the CDK stack. - Upload a PDF file to the
pdf/
folder in the S3 bucket. - The process will automatically trigger and start processing the PDF.
- Use the CloudWatch dashboards created by the stack to monitor the progress and performance of the PDF processing pipeline.
If you encounter any issues during setup or deployment, please check the following:
- Ensure all prerequisites are correctly installed and configured.
- Verify that your AWS credentials have the necessary permissions.
- Check CloudWatch logs for any error messages in the Lambda functions or ECS tasks.
- If the CDK Deploy responds with:
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. Subprocess exited with error 9009
, try changing"app": "python3 app.py"
to"app": "python app.py"
in the cdk.json file - If the CDK deploy responds with:
Resource handler returned message: "The maximum number of addresses has been reached.
request additional IPs from AWS. Go to https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas and search for "IP". Then, choose "EC2-VPC Elastic IPs". Note the AWS region is included in the URL, change it to the region you are deploying into. Requests for additional IPs are usually completed within minutes. - If any Docker images are not pushing to ECR, manually deploy to ECR using the push commands provided in the ECR console. Then, manually update the ECS service by creating a new revision of the task definition and updating the image URI with the one just deployed. For further assistance, please open an issue in this repository.
- If you encounter issues with the 9th step, refer to the related discussion on the AWS CDK GitHub repository for further troubleshooting: CDK Github Issue
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes