Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shared asset bundler error for build on windows and large file type batch job #254

Merged
merged 4 commits into from
Dec 15, 2023

Conversation

QuinnGT
Copy link
Contributor

@QuinnGT QuinnGT commented Dec 3, 2023

Issue #212: and Issue #185:

Description of changes:

Fix 1: Pushed minor change to fix error with shared asset bundler on windows. Full build successful on windows powershell and mac bash.

Fix 2: Upgraded Unstructured.io to 0.11.2, changed instance type to cheaper/more performant type m6a.large, and increased container memory to 2048 and vcpu count to 2. Also added retry attempts in case container errors out.

Tested upload batch files. OId version: 107 uploads, 19 failures. New version: 107 uploads, 0 failures.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@QuinnGT QuinnGT changed the title Fix shared asset bundler error for build on windows Fix shared asset bundler error for build on windows and large file type batch job Dec 3, 2023
@alexeyshishkin01
Copy link

hello, all ! I have tested the docker image creation:

You see that the latest version image size is really huge - 19GB.
Even the older version image size was 9GB, which is also big.
So, when you deploy the application, downloading 9GB takes some time, but now it's going to be the double size of it and double time.

root@wsl2:~/docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
0.11.2 latest 3b15124ac785 4 days ago 19.3GB
0.10.20 latest 62c7d5f47207 7 weeks ago 9.3GB
0.10.19 latest a2c34220d04f 2 months ago 9.1GB

I wonder why the docker images size is so big ?

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Dec 4, 2023

hello, all ! I have tested the docker image creation:

You see that the latest version image size is really huge - 19GB. Even the older version image size was 9GB, which is also big. So, when you deploy the application, downloading 9GB takes some time, but now it's going to be the double size of it and double time.

root@wsl2:~/docker# docker images REPOSITORY TAG IMAGE ID CREATED SIZE 0.11.2 latest 3b15124ac785 4 days ago 19.3GB 0.10.20 latest 62c7d5f47207 7 weeks ago 9.3GB 0.10.19 latest a2c34220d04f 2 months ago 9.1GB

I wonder why the docker images size is so big ?

It’s due to the new version of unstructured. Luckily isn’t a lambda function anymore so we can use the container but I agree, it’s quite large. I’ll look into why that is.

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Dec 5, 2023

I pulled the Unstructured latest docker image and confirmed it's base size is 19.28GB. A lot has changed in the past 11 months since the original image. If we wanted to limit the types of files to parse, we could build our own docker image and shrink the size.

IMO, this pull request solves the ingestion errors and a new issue/task should be created for looking at improving the document ingestion process as a whole. Unstructured is by far the best tool for processing any type of document and I don't think we are using the full features. Something to consider on the next version of the app.

Update to latest version
@alexeyshishkin01
Copy link

Thanks, I tried to create custom Docker image from the mentioned https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile.

So, my Docker file was really simple:
FROM quay.io/unstructured-io/base-images:rocky9.2-8@sha256:68b11677eab35ea702cfa682202ddae33f2053ea16c14c951120781a2dcac1b2 as base
USER root

The image size of it is almost 12GB:
root@wsl2:~/docker# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
1468c6b2beef 3 weeks ago 11.8GB

They create new image from base image which size is 11.8GB.
I wonder what does the base image consist of to be so huge ?

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Dec 15, 2023

Can we get this PR merged? I'd like to submit additional PRs and continue work.

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Dec 15, 2023

Thanks, I tried to create custom Docker image from the mentioned https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile.

So, my Docker file was really simple: FROM quay.io/unstructured-io/base-images:rocky9.2-8@sha256:68b11677eab35ea702cfa682202ddae33f2053ea16c14c951120781a2dcac1b2 as base USER root

The image size of it is almost 12GB: root@wsl2:~/docker# docker images REPOSITORY TAG IMAGE ID CREATED SIZE 1468c6b2beef 3 weeks ago 11.8GB

They create new image from base image which size is 11.8GB. I wonder what does the base image consist of to be so huge ?

It's all of the packages used in unstructured. It's why it can literally parse and split just about any type of file. It is bloated though. They should've split it into individual images based on function desired.

Check out these image layers. They are pulling nvidia cuda drivers, libre office, tesseract-ocr, etc.

Layers by size: First large layer

RUN /bin/sh -c dnf -y update && dnf -y upgrade && dnf -y install poppler-utils xz-devel wget tar make which mailcap dnf-plugins-core compat-openssl11 && ARCH=$(uname -m) && if [[ "$ARCH" == "x86_64" ]] || [[ "$ARCH" == "amd64" ]]; then dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo; dnf -y install cuda-11-8; else dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/cross-linux-sbsa/cuda-rhel9-cross-linux-sbsa.repo && dnf -y install cuda-cross-sbsa-11-8; fi && dnf -y install epel-release && cp /etc/yum.repos.d/rocky-devel.repo /etc/yum.repos.d/Rocky-Devel.repo && dnf config-manager --enable crb && sed -i 's/enabled=0/enabled=1/g' /etc/yum.repos.d/Rocky-Devel.repo && ARCH=$(uname -m) && if [ "$ARCH" == "aarch64" ]; then ARCH="arm64"; else ARCH="amd64"; fi && wget https://github.com/jgm/pandoc/releases/download/3.1.9/pandoc-3.1.9-linux-$ARCH.tar.gz && tar xvzf pandoc-3.1.9-linux-$ARCH.tar.gz --strip-components 1 -C '/usr/local' && rm -rf pandoc-3.1.9-linux-$ARCH.tar.gz && dnf -y install libreoffice-writer libreoffice-base libreoffice-impress libreoffice-draw libreoffice-math libreoffice-core && sed -i 's/enabled=1/enabled=0/g' /etc/yum.repos.d/Rocky-Devel.repo && rm -rf /var/cache/yum/* && rm -f /etc/yum.repos.d/Rocky-Devel.repo && dnf clean all # buildkit

Second
RUN /bin/sh -c set -ex && dnf install -y opencv opencv* zlib zlib-devel perl-core clang libpng libpng-devel libtiff libtiff-devel libwebp libwebp-devel libjpeg libjpeg-devel libjpeg-turbo-devel git-core libtool pkgconfig xz && wget https://github.com/DanBloomberg/leptonica/releases/download/1.83.1/leptonica-1.83.1.tar.gz && tar -xzvf leptonica-1.83.1.tar.gz && cd leptonica-1.83.1 || exit && ./configure && make && make install && cd .. && wget http://mirror.squ.edu.om/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz && tar -xvf autoconf-archive-2017.09.28.tar.xz && cd autoconf-archive-2017.09.28 || exit && ./configure && make && make install && cp m4/* /usr/share/aclocal && cd .. && git clone --depth 1 --branch 5.3.3 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr && cd tesseract-ocr || exit && export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig && ./autogen.sh && ./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/ && make && make install && cd .. && git clone https://github.com/tesseract-ocr/tessdata.git && cp tessdata/*.traineddata /usr/local/share/tessdata && rm -rf /tesseract-ocr /tessdata /autoconf-archive-2017.09.28* /leptonica-1.83.1* && dnf -y remove opencv* perl-core clang libpng-devel libtiff-devel libwebp-devel libjpeg-devel libjpeg-turbo-devel git-core libtool zlib-devel pkconfig xz && r /m -rf /var/cache/yum/* && rm -rf /tmp/* && dnf clean all # buildkit

Third
RUN /bin/sh -c python3.10 -m pip install pip==${PIP_VERSION} && dnf -y groupinstall "Development Tools" && find requirements/ -type f -name "*.txt" -exec python3 -m pip install --no-cache -r '{}' ';' && dnf -y groupremove "Development Tools" && dnf clean all # buildkit

@alexeyshishkin01
Copy link

Thanks, I tried to create custom Docker image from the mentioned https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile.
So, my Docker file was really simple: FROM quay.io/unstructured-io/base-images:rocky9.2-8@sha256:68b11677eab35ea702cfa682202ddae33f2053ea16c14c951120781a2dcac1b2 as base USER root
The image size of it is almost 12GB: root@wsl2:~/docker# docker images REPOSITORY TAG IMAGE ID CREATED SIZE 1468c6b2beef 3 weeks ago 11.8GB
They create new image from base image which size is 11.8GB. I wonder what does the base image consist of to be so huge ?

It's all of the packages used in unstructured. It's why it can literally parse and split just about any type of file. It is bloated though. They should've split it into individual images based on function desired.

Check out these image layers. They are pulling nvidia cuda drivers, libre office, tesseract-ocr, etc.

Layers by size:

many thanks, may I ask you to share the link for the layers commands you have mentioned ?

@bigadsoleiman bigadsoleiman merged commit 976784d into aws-samples:main Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants