Skip to content

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

License

Notifications You must be signed in to change notification settings

ASUCICREPO/PDF_Accessibility

Repository files navigation

PDF Processing AWS Infrastructure

This project builds an AWS infrastructure using AWS CDK (Cloud Development Kit) to split a PDF into chunks, process the chunks via AWS Step Functions, and merge the resulting chunks back using ECS tasks. The infrastructure also includes monitoring via CloudWatch dashboards and metrics for tracking progress.

Prerequisites

Before running the AWS CDK stack, ensure the following are installed and configured:

  1. AWS Bedrock Access: Ensure your AWS account has access to the Claude Sonnet 3.5 V2 model in Amazon Bedrock.

  2. Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.

  3. Python (3.7 or later)

    • Download Python
    • Set up a virtual environment
      python -m venv .venv
      source .venv/bin/activate  # For macOS/Linux
      .venv\Scripts\activate     # For Windows
    • Also ensure that if you are using windows to confirm the python path in cmd before deploying. That can be done by running:
      where python
  4. AWS CLI: To interact with AWS services and set up credentials.

  5. npm

    • npm is required to install AWS CDK. Install npm by installing Node.js:
    • Verify npm installation:
      npm --version
  6. AWS CDK: For defining cloud infrastructure in code.

  7. Docker: Required to build and run Docker images for the ECS tasks.

  8. AWS Account Permissions

    • Ensure permissions to create and manage AWS resources like S3, Lambda, ECS, ECR, Step Functions, and CloudWatch.
    • AWS IAM Policies and Permissions
    • Also, For the ease of deployment. Create a IAM user in the account you want to deploy to and attach adminstrator access to that user and use the Access key and Secret key for that user.

Directory Structure

Ensure your project has the following structure:

├── app.py (Main CDK app)
├── lambda/
│   ├── split_pdf/ (Python Lambda for splitting PDF)
│   └── java_lambda/ (Java Lambda for merging PDFs)
├── docker_autotag/ (Python Docker image for ECS task)
└── javascript_docker/ (JavaScript Docker image for ECS task)
|__ client_credentials.json (The client id and client secret id for adobe)

Setup and Deployment

  1. Clone the Repository:

    • Clone this repository containing the CDK code, Docker configurations, and Lambda functions.
  2. Set Up Your Environment:

    • Configure AWS CLI with your AWS account credentials:
      aws configure
    • Make sure the region is set to
      us-east-1
      
  3. Set Up CDK Environment:

    • Bootstrap your AWS environment for CDK (run only once per AWS account/region):
      cdk bootstrap
      
  4. Create Adobe API Credentials:

    • Create a file called client_credentials.json in the root directory with the following structure:
      {
        "client_credentials": {
          "PDF_SERVICES_CLIENT_ID": "<Your client ID here>",
          "PDF_SERVICES_CLIENT_SECRET": "<Your secret ID here>"
        }
      }
    • Replace and with your actual Client ID and Client Secret provided by Adobe and not the whole file.
  5. Upload Credentials to Secrets Manager:

    • Run this command in the terminal of the project to push the secret keys to secret manager:
    • For Mac
      aws secretsmanager create-secret \
          --name /myapp/client_credentials \
          --description "Client credentials for PDF services" \
          --secret-string file://client_credentials.json
      
    • For Windows
      aws secretsmanager create-secret --name /myapp/client_credentials --description "Client credentials for PDF services" --secret-string file://client_credentials.json
    • Run this command if you have already uploaded the keys earlier and would like to update the keys in secret manager.
    • For Mac:
         aws secretsmanager update-secret \
        --secret-id /myapp/client_credentials \
        --description "Updated client credentials for PDF services" \
        --secret-string file://client_credentials.json
      
    • For Windows:
      aws secretsmanager update-secret --secret-id /myapp/client_credentials --description "Updated client credentials for PDF services" --secret-string file://client_credentials.json
  6. Install the Requirements:

    • For both Mac and Windows
    • pip install -r requirements.txt
  7. Connect to ECR:

    • Ensure Docker Desktop is running, then execute:
      aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
      
  8. Only For Windows - Set a environment variable once for deployment

    • For Windows users, an environment variable needs to be set before deployment. This step ensures compatibility and prevents deployment issues.
    • Please Checkout Troubleshooting if you would like to know more about this.
      set BUILDX_NO_DEFAULT_ATTESTATIONS=1
      
  9. Deploy the CDK Stack:

  • Deploy the stack to AWS:
    cdk deploy
    

Usage

Once the infrastructure is deployed:

  1. Create a pdf/ folder in the S3 bucket created by the CDK stack.
  2. Upload a PDF file to the pdf/ folder in the S3 bucket.
  3. The process will automatically trigger and start processing the PDF.

Monitoring

  • Use the CloudWatch dashboards created by the stack to monitor the progress and performance of the PDF processing pipeline.

Troubleshooting

If you encounter any issues during setup or deployment, please check the following:

  • Ensure all prerequisites are correctly installed and configured.
  • Verify that your AWS credentials have the necessary permissions.
  • Check CloudWatch logs for any error messages in the Lambda functions or ECS tasks.
  • If the CDK Deploy responds with: Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. Subprocess exited with error 9009, try changing "app": "python3 app.py" to "app": "python app.py" in the cdk.json file
  • If the CDK deploy responds with: Resource handler returned message: "The maximum number of addresses has been reached. request additional IPs from AWS. Go to https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas and search for "IP". Then, choose "EC2-VPC Elastic IPs". Note the AWS region is included in the URL, change it to the region you are deploying into. Requests for additional IPs are usually completed within minutes.
  • If any Docker images are not pushing to ECR, manually deploy to ECR using the push commands provided in the ECR console. Then, manually update the ECS service by creating a new revision of the task definition and updating the image URI with the one just deployed. For further assistance, please open an issue in this repository.
  • If you encounter issues with the 9th step, refer to the related discussion on the AWS CDK GitHub repository for further troubleshooting: CDK Github Issue

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes

About

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published