-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build: Allow multiple PDFs #10438
base: main
Are you sure you want to change the base?
Build: Allow multiple PDFs #10438
Conversation
…ature/multiple-pdfs
…ature/multiple-pdfs
Will be following up with test case updates and some additional manual testing (both with and without the feature flag). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I need to do a deeper review and local QA still, tho.
In the meanwhile, I have some questions to understand what's the status of the PR and what is the final goal:
- what are the names of the resulting PDF files when the project has multiple PDFs using our Sphinx builder? (I guess they will be
<project-slug>_<counter>.pdf
. However, I'd like the user to be able to define this by using thelatex_documents
Sphinx config) - does this PR add support for multiple HTMLZip and ePUB as well? (I guess it doesn't)
- what are the resulting URLs when there are multiple PDFs generated?
- what's the API response for a particular version? (see https://docs.readthedocs.io/en/stable/api/v3.html#get--api-v3-projects-(string-project_slug)-versions-(string-version_slug)-)
I think these questions are important since they will tell us how we are going to move forward with this PR. These decisions will make the future integration of the backend easier with the front end.
Also, if projects generating multiple PDF files are not told those cannot be easily accessed by the users, it looks broken to me and it's something I'd prefer to not expose to users yet.
# Isolate temporary files in the _readthedocs/ folder | ||
# Used for new feature ENABLE_MULTIPLE_PDFS | ||
self.absolute_host_tmp_root = os.path.join( | ||
self.project.checkout_path(self.version.slug), | ||
"_readthedocs/tmp", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need a temporary directory we should mktemp --directory
as we are doing in other parts of this code. I prefer to keep _readthedocs/
clean since it's a directory where we output files we are going to serve.
@@ -132,13 +132,13 @@ def _check_suspicious_path(self, path): | |||
def _rclone(self): | |||
raise NotImplementedError | |||
|
|||
def rclone_sync_directory(self, source, destination): | |||
def rclone_sync_directory(self, source, destination, **sync_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we pass filter_extensions
here directly instead of a dictionary without knowing exactly what it has inside?
|
||
This method mostly exists so we have a pattern that is test-friend (can be mocked). | ||
""" | ||
tex_files = glob(os.path.join(self.absolute_host_output_dir, f"*.{extension}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could be combined with _get_epub_files_generated()
and make it generic.
""" | ||
tex_files = glob(os.path.join(self.absolute_host_output_dir, f"*.{extension}")) | ||
if not tex_files: | ||
raise BuildUserError("No *.{extension} files were found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build error string should be an attribute like BuildUserError.NO_ARTIFACT_FILES_FOUND
or similar.
|
||
# Run LaTeX -> PDF conversions | ||
success = self._build_latexmk(self.project_path) | ||
|
||
self._post_build() | ||
if self.project.has_feature(Feature.ENABLE_MULTIPLE_PDFS): | ||
self._post_build_multiple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._post_build_multiple() | |
self._post_build_multiple_pdf() |
# There is only 1 PDF file. We will call it project_slug.pdf | ||
# This is the old behavior. | ||
if len(pdf_files) == 1: | ||
os.rename( | ||
os.path.join(self.absolute_host_output_dir, pdf_files[0]), | ||
os.path.join( | ||
self.absolute_host_output_dir, f"{self.project.slug}.pdf" | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand why we would like to keep "the old behavior here". I think we will always want "the new behavior" for projects using this feature. However, I'm not sure to understand what would be the filename of the resulting PDF in the new behavior --but it should be the name the user has defined in the docs/conf.py
(see my other comment about -jobname
).
# We cannot use '*' in commands sent to the host machine, the asterisk gets escaped. | ||
# So we opt for iterating from outside the container | ||
pdf_file_names = [] | ||
for fname in pdf_files: | ||
shutil.move( | ||
fname, | ||
os.path.join(self.absolute_host_tmp_root, os.path.basename(fname)), | ||
) | ||
pdf_file_names.append(os.path.basename(fname)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably use self.run(..., escape=False)
or similar if we want to avoid this.
# Map media types to their know extensions. | ||
# This is used for validating and uploading artifacts. | ||
MEDIA_TYPES_EXTENSIONS = { | ||
MEDIA_TYPE_PDF: ("pdf",), | ||
MEDIA_TYPE_EPUB: ("epub",), | ||
MEDIA_TYPE_HTMLZIP: ("zip",), | ||
MEDIA_TYPE_JSON: ("json",), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using a tuple in the value here because we want to have multiple extensions per each media type?
@app.task(queue="web") | ||
def sync_downloadable_artifacts( | ||
version_pk, commit, build, artifacts_found_for_download | ||
): | ||
""" | ||
Create ImportedFile objects for downloadable files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this task is part of the search tasks. I understand it's not related with search itself, but more with the build process, in my opinion. So, I would probably move it to projects.tasks.utils
or similar.
ImportedFile.objects.create( | ||
name=name, | ||
project=version.project, | ||
version=version, | ||
path=fpath, | ||
commit=commit, | ||
build=build, | ||
ignore=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only "creating new objects here", but don't we need to delete the old ImportedFile
for this particular version. Otherwise, wouldn't be exposing "old filenames" to the user?
Reviewing notes
This is intended to be a very limited change to our builders, whereby we can enable the Feature flag for a couple of projects and have PDFs build with the new method.
The filename for single-file projects is left intact, which should mean that we can switch on the feature for a couple of live-projects as well, and the generated PDFs should be served through the current URLs and APIs.
Would be great with some after-thoughts on using the ImportedFile model.
TODO
Fixes #10424
Fixes #2045
trigger_sync_downloadable_artifacts
APIs will be added in a separate PR.
Other aspects covered