Perform on-host conversion for the pixels to PDF stage #748

apyrgio · 2024-03-14T11:37:27Z

This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.

What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.

Fixes #625

Note

This PR still has some rough edges. Off the top of my head, we need to:

Test the changes across all of our supported platforms, and fix all of our CI errors.
~~Remove tool.poetry.group.container.dependencies section from pyproject.toml, as it's duplicated info.~~
- Actually, it still has its uses
Remove --userns keep-id option in Podman.
Make donwload-tessdata.py cacheable in our CI runs.
Turn OCR language deps into recommendations in Linux systems, and handle if some are not installed.
Improve our Dummy isolation provider, so that the steps that run in the host actually run in our Windows / macOS CI runners.
Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.
Update our wording in various places, so that we no longer refer to using two containers for the sanitization.
Draft an ARCHITECTURE.md, which will be the source of truth on how Dangerzone works now.

All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.

install/linux/dangerzone.spec

deeplow

This is pretty incredible. Congrats! 🥳 A lot of work went before this and now this feels like the cherry on top. I have some minor code improvement suggestions.

What I still have to do:

test on windows and macOS

Other observations:

thanks for removing the dead code!
the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.
ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?
dummy can have pixels_to_pdf removed
Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

dangerzone/isolation_provider/base.py

dangerzone/conversion/common.py

dangerzone/isolation_provider/base.py

dangerzone/conversion/pixels_to_pdf.py

dangerzone/isolation_provider/container.py

dangerzone/isolation_provider/base.py

apyrgio · 2024-03-27T15:03:25Z

I'll reply to some of your observations as well:

the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.

In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?

ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?

I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.

dummy can have pixels_to_pdf removed

Yeap, you're right.

Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

Yeap, you're right.

install/linux/dangerzone.spec

deeplow · 2024-03-28T17:23:34Z

Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.

I worked on this. The code is in the branch 625-host-stream-tessdata-packaging. A lot of stuff had to be moved and I didn't manage to finish testing this week. I tested on fedora and debian and it seems to be building fine. The only thing is that it includes the .gitkeep in share/container.

On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio.

stdeb.cfg

dangerzone/isolation_provider/base.py

dangerzone/util.py

Add a new way to detect where the Tesseract data are stored in a user's system. On Linux, the Tesseract data should be installed via the package manager. On macOS and Windows, they should be bundled with the Dangerzone application. There is also the exception of running Dangerzone locally, where even on Linux, we should get the Tesseract data from the Dangerzone share/ folder.

The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.

Update .deb/.rpm specs to include PyMuPDF as a required package.

Extend the base isolation provider to immediately convert each page to a PDF, and optionally use OCR. In contract with the way we did things previously, there are no more two separate stages (document to pixels, pixels to PDF). We now handle each page individually, for two main reasons: 1. We don't want to buffer pixel data, either on disk or in memory, since they take a lot of space, and can potentially leave traces. 2. We can perform these operations in parallel, saving time. This is more evident when OCR is not used, where the time to convert a page to pixels, and then back to a PDF are comparable.

Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).

Make the Dummy isolation provider follow the rest of the isolation providers and perform the second part of the conversion on the host. The first part of the conversion is just a dummy script that reads a file from stdin and prints pixels to stdout.

apyrgio · 2024-10-08T18:10:43Z

The PR is ready for review once more. The commit messages may require a bit more ❤️ and make lint complains, but other than that, it's as ready and tested as it can be.

almet

Awesome work Alex! I've tested the branch locally and it works (macOS m1), congrats 👍 🎉

Additionally to the review comments I left inline, I believe we could check that the tesseract data is present before asking PyMuPDF to use it, disabling this behavior if not present. Right now, it fails if not installed (which should not happen, but I believe it's the right timing to disable this).

I see two ways of doing this:

Show a warning next to the OCR setting, mentioning that the tesseract data is not installed (for the selected language?)
If no tesseract data is detected, remove the OCR setting and put a warning instead.

.github/workflows/ci.yml

almet · 2024-10-09T12:01:12Z

dangerzone/isolation_provider/base.py

+                text = (
+                    f"Converting page {page}/{n_pages} from pixels to {searchable}PDF"
+                )
+                percentage += step
                self.print_progress(document, False, text, percentage)


Should it be the responsibility of self.print_progress to actually decide on the the message to be shown?

I think so, yeah. Any issue you see with this approach?

Sorry, my previous message wasn't clear :-) I'm thinking about passing a different context to print_progress so that it actually prints the progress itself. Here the message is computed outside of this function, and I'm thinking we could do this inside the method instead.

Hm, something like passing a context that holds the total number of pages and the current page, so that print_progress() can construct the proper "Converting page..." message? I guess we can, but we'll still need to call `print_progress() with an arbitrary message, e.g., in case of errors. Is there another benefit that we gain from this?

dangerzone/isolation_provider/base.py

dangerzone/isolation_provider/dummy.py

dangerzone/util.py

almet · 2024-10-09T12:27:28Z

install/common/download-tessdata.py

+        files = {f.name for f in tessdata_dir.iterdir()}
+        if files == expected_files:
+            msg = "> Skipping tessdata download, language data already exists"
+            print(msg, file=sys.stderr)


As mentioned elsewhere, we might want to use the logging module here instead.

Did so in c37ff73.

install/common/download-tessdata.py

install/linux/vendor-pymupdf.py

almet · 2024-10-09T12:33:45Z

install/linux/vendor-pymupdf.py

+    cmd = ["poetry", "export", "--only", "container"]
+    container_requirements_txt = subprocess.check_output(cmd)
+
+    # XXX: Hack for Ubuntu Focal.


🫣 You probably have this on your mind already, but it's probably worth adding a comment here about why this is.

I did so actually in the #940 PR. Let me know what you think there.

almet · 2024-10-09T12:35:52Z

tests/isolation_provider/base.py

@@ -46,7 +46,7 @@ def test_max_pages_client_enforcement(
        doc = Document(sample_doc)
        p = provider.start_doc_to_pixels_proc(doc)
        with pytest.raises(errors.MaxPagesException):
-            provider.doc_to_pixels(doc, tmpdir, p)
+            provider._convert(doc, None, p)


It seems weird to test private methods here. Should we rename it to a public method instead?

Fixed in d9eaec4.

This reverts commit 6a5b6e4.

apyrgio mentioned this pull request Mar 14, 2024

Sandbox all document processing in gVisor #590

Merged

deeplow reviewed Mar 14, 2024

View reviewed changes

install/linux/dangerzone.spec Outdated Show resolved Hide resolved

deeplow requested changes Mar 14, 2024

View reviewed changes

apyrgio force-pushed the 625-host-stream branch 2 times, most recently from ae9090d to 8884cb8 Compare March 27, 2024 12:24

deeplow reviewed Mar 28, 2024

View reviewed changes

install/linux/dangerzone.spec Show resolved Hide resolved

apyrgio force-pushed the 625-host-stream branch 5 times, most recently from da0dd54 to 10522c2 Compare March 28, 2024 16:29

apyrgio force-pushed the 625-host-stream branch 3 times, most recently from 48eba2b to 4d70bd9 Compare March 28, 2024 17:59

apyrgio mentioned this pull request Apr 1, 2024

OSError: [Errno 39] Directory not empty: 'pixels' when aborting during doc to pixels stage #759

Open

apyrgio mentioned this pull request Apr 9, 2024

Handle various termination scenarios of the conversion process #772

Merged

deeplow reviewed Apr 15, 2024

View reviewed changes

stdeb.cfg Outdated Show resolved Hide resolved

apyrgio mentioned this pull request Apr 18, 2024

pixels-to-pdf failed #781

Closed

apyrgio mentioned this pull request May 22, 2024

Catch out of RAM errors in client and server #578

Open

apyrgio added this to the 0.7.0 milestone Jun 3, 2024

almet reviewed Jun 5, 2024

View reviewed changes

dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved

almet reviewed Jun 5, 2024

View reviewed changes

dangerzone/util.py Outdated Show resolved Hide resolved

apyrgio force-pushed the 625-host-stream branch from 4d70bd9 to c69feba Compare June 11, 2024 17:01

almet removed this from the 0.7.0 milestone Jun 12, 2024

apyrgio force-pushed the 625-host-stream branch 2 times, most recently from 8f918c8 to 3125a59 Compare June 17, 2024 16:48

apyrgio mentioned this pull request Aug 8, 2024

GUI v2: MVP #894

Open

12 tasks

eloquence mentioned this pull request Aug 19, 2024

Update "How it works" section and add some articles about Dangerzone freedomofpress/dangerzone.rocks#39

Merged

apyrgio added 10 commits October 8, 2024 19:14

Make PyMuPDF a main Dangerzone dependency

0f2be58

The PyMuPDF package was previously mainly used within the Dangerzone container, as well as on Qubes. With on-host conversion, PyMuPDF will be used in all supported platforms by default. For this reason, we can promote it to a main dependency.

Update .deb/.rpm dependencies

afe8179

Update .deb/.rpm specs to include PyMuPDF as a required package.

Update the way we get debug logs

62c3267

Move the logic for grabbing debug logs to a new place, now that we have merged the two conversion stages (doc to pixels, pixels to PDF).

Remove dead code

1ab3aab

Remove dead docs

2c08b5f

tests: Remove provider_wait fixtures

328ddbe

tests: Improve test for top-level conversion errors

1302a1f

apyrgio force-pushed the 625-host-stream branch from ef45fb4 to 1302a1f Compare October 8, 2024 16:17

almet reviewed Oct 9, 2024

View reviewed changes

apyrgio added 9 commits October 9, 2024 18:26

FIXUP: Use 'in' instead of '=='

4c7db48

FIXUP: Fix progress percentages

3e12aa3

ci: Check OCR in Debian/Fedora tests

80e972b

FIXUP: Use pathlib.Path for newer code

09bb125

FIXUP: debian: Explain why we ignore share/tessdata

79f6fcc

FIXUP: Add tesseract-ocr-all as a required dependency for Debian

e690b25

FIXUP: Factor out git_root

6a5b6e4

FIXUP: Detect proper tessdata dir for Linux systems

b3d8ddc

Revert "FIXUP: Factor out git_root"

8db9261

This reverts commit 6a5b6e4.

almet mentioned this pull request Oct 9, 2024

Put dev scripts into their own python module #946

Open

apyrgio added 7 commits October 9, 2024 21:54

FIXUP: Replace print statements with logging

c37ff73

FIXUP: Fix a deprecation warning for filter=

149ba23

FIXUP: Make _convert a public method

d9eaec4

FIXUP: Fix lint errors

d31a10f

FIXUP: Make run-tests CI job require cached tessdata

297fe5e

FIXUP: Fix lint and log to stderr

073a6e6

FIXUP: Include more tessdata dirs

6b65881

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform on-host conversion for the pixels to PDF stage #748

Perform on-host conversion for the pixels to PDF stage #748

apyrgio commented Mar 14, 2024 •

edited

Loading

deeplow left a comment

apyrgio commented Mar 27, 2024

deeplow commented Mar 28, 2024

apyrgio commented Oct 8, 2024

almet left a comment

almet Oct 9, 2024

apyrgio Oct 9, 2024

almet Oct 9, 2024

apyrgio Oct 9, 2024

almet Oct 9, 2024

apyrgio Oct 9, 2024

almet Oct 9, 2024

apyrgio Oct 9, 2024

almet Oct 9, 2024

apyrgio Oct 9, 2024

Perform on-host conversion for the pixels to PDF stage #748

Are you sure you want to change the base?

Perform on-host conversion for the pixels to PDF stage #748

Conversation

apyrgio commented Mar 14, 2024 • edited Loading

deeplow left a comment

Choose a reason for hiding this comment

apyrgio commented Mar 27, 2024

deeplow commented Mar 28, 2024

apyrgio commented Oct 8, 2024

almet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apyrgio commented Mar 14, 2024 •

edited

Loading