-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform on-host conversion for the pixels to PDF stage #748
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty incredible. Congrats! 🥳 A lot of work went before this and now this feels like the cherry on top. I have some minor code improvement suggestions.
What I still have to do:
- test on windows and macOS
Other observations:
- thanks for removing the dead code!
- the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.
- ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?
- dummy can have
pixels_to_pdf
removed - Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"
ae9090d
to
8884cb8
Compare
I'll reply to some of your observations as well:
In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?
I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.
Yeap, you're right.
Yeap, you're right. |
da0dd54
to
10522c2
Compare
I worked on this. The code is in the branch On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio. |
48eba2b
to
4d70bd9
Compare
8f918c8
to
3125a59
Compare
Add a new way to detect where the Tesseract data are stored in a user's system. On Linux, the Tesseract data should be installed via the package manager. On macOS and Windows, they should be bundled with the Dangerzone application. There is also the exception of running Dangerzone locally, where even on Linux, we should get the Tesseract data from the Dangerzone share/ folder.
The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.
Update .deb/.rpm specs to include PyMuPDF as a required package.
Extend the base isolation provider to immediately convert each page to a PDF, and optionally use OCR. In contract with the way we did things previously, there are no more two separate stages (document to pixels, pixels to PDF). We now handle each page individually, for two main reasons: 1. We don't want to buffer pixel data, either on disk or in memory, since they take a lot of space, and can potentially leave traces. 2. We can perform these operations in parallel, saving time. This is more evident when OCR is not used, where the time to convert a page to pixels, and then back to a PDF are comparable.
Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).
Make the Dummy isolation provider follow the rest of the isolation providers and perform the second part of the conversion on the host. The first part of the conversion is just a dummy script that reads a file from stdin and prints pixels to stdout.
ef45fb4
to
1302a1f
Compare
The PR is ready for review once more. The commit messages may require a bit more ❤️ and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work Alex! I've tested the branch locally and it works (macOS m1), congrats 👍 🎉
Additionally to the review comments I left inline, I believe we could check that the tesseract data is present before asking PyMuPDF to use it, disabling this behavior if not present. Right now, it fails if not installed (which should not happen, but I believe it's the right timing to disable this).
I see two ways of doing this:
- Show a warning next to the OCR setting, mentioning that the tesseract data is not installed (for the selected language?)
- If no tesseract data is detected, remove the OCR setting and put a warning instead.
text = ( | ||
f"Converting page {page}/{n_pages} from pixels to {searchable}PDF" | ||
) | ||
percentage += step | ||
self.print_progress(document, False, text, percentage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be the responsibility of self.print_progress
to actually decide on the the message to be shown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yeah. Any issue you see with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my previous message wasn't clear :-) I'm thinking about passing a different context to print_progress
so that it actually prints the progress itself. Here the message is computed outside of this function, and I'm thinking we could do this inside the method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, something like passing a context that holds the total number of pages and the current page, so that print_progress()
can construct the proper "Converting page..." message? I guess we can, but we'll still need to call `print_progress() with an arbitrary message, e.g., in case of errors. Is there another benefit that we gain from this?
install/common/download-tessdata.py
Outdated
files = {f.name for f in tessdata_dir.iterdir()} | ||
if files == expected_files: | ||
msg = "> Skipping tessdata download, language data already exists" | ||
print(msg, file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned elsewhere, we might want to use the logging module here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did so in c37ff73.
cmd = ["poetry", "export", "--only", "container"] | ||
container_requirements_txt = subprocess.check_output(cmd) | ||
|
||
# XXX: Hack for Ubuntu Focal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫣 You probably have this on your mind already, but it's probably worth adding a comment here about why this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so actually in the #940 PR. Let me know what you think there.
tests/isolation_provider/base.py
Outdated
@@ -46,7 +46,7 @@ def test_max_pages_client_enforcement( | |||
doc = Document(sample_doc) | |||
p = provider.start_doc_to_pixels_proc(doc) | |||
with pytest.raises(errors.MaxPagesException): | |||
provider.doc_to_pixels(doc, tmpdir, p) | |||
provider._convert(doc, None, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to test private methods here. Should we rename it to a public method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d9eaec4.
This reverts commit 6a5b6e4.
This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.
What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.
Fixes #625
Note
This PR still has some rough edges. Off the top of my head, we need to:
Removetool.poetry.group.container.dependencies
section frompyproject.toml
, as it's duplicated info.--userns keep-id
option in Podman.donwload-tessdata.py
cacheable in our CI runs.share/tessdata
in our .debs / .rpms.ARCHITECTURE.md
, which will be the source of truth on how Dangerzone works now.All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.