-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix shared asset bundler error for build on windows and large file type batch job #254
Conversation
hello, all ! I have tested the docker image creation: You see that the latest version image size is really huge - 19GB. root@wsl2:~/docker# docker images I wonder why the docker images size is so big ? |
It’s due to the new version of unstructured. Luckily isn’t a lambda function anymore so we can use the container but I agree, it’s quite large. I’ll look into why that is. |
I pulled the Unstructured latest docker image and confirmed it's base size is 19.28GB. A lot has changed in the past 11 months since the original image. If we wanted to limit the types of files to parse, we could build our own docker image and shrink the size. IMO, this pull request solves the ingestion errors and a new issue/task should be created for looking at improving the document ingestion process as a whole. Unstructured is by far the best tool for processing any type of document and I don't think we are using the full features. Something to consider on the next version of the app. |
Update to latest version
Thanks, I tried to create custom Docker image from the mentioned https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile. So, my Docker file was really simple: The image size of it is almost 12GB: They create new image from base image which size is 11.8GB. |
Can we get this PR merged? I'd like to submit additional PRs and continue work. |
It's all of the packages used in unstructured. It's why it can literally parse and split just about any type of file. It is bloated though. They should've split it into individual images based on function desired. Check out these image layers. They are pulling nvidia cuda drivers, libre office, tesseract-ocr, etc. Layers by size:First large layer
Second Third |
many thanks, may I ask you to share the link for the layers commands you have mentioned ? |
Issue #212: and Issue #185:
Description of changes:
Fix 1: Pushed minor change to fix error with shared asset bundler on windows. Full build successful on windows powershell and mac bash.
Fix 2: Upgraded Unstructured.io to 0.11.2, changed instance type to cheaper/more performant type m6a.large, and increased container memory to 2048 and vcpu count to 2. Also added retry attempts in case container errors out.
Tested upload batch files. OId version: 107 uploads, 19 failures. New version: 107 uploads, 0 failures.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.