Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform on-host conversion for the pixels to PDF stage #748

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

apyrgio
Copy link
Contributor

@apyrgio apyrgio commented Mar 14, 2024

This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.

What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.

Fixes #625

Note

This PR still has some rough edges. Off the top of my head, we need to:

  • Test the changes across all of our supported platforms, and fix all of our CI errors.
  • Remove tool.poetry.group.container.dependencies section from pyproject.toml, as it's duplicated info.
    • Actually, it still has its uses
  • Remove --userns keep-id option in Podman.
  • Make donwload-tessdata.py cacheable in our CI runs.
  • Turn OCR language deps into recommendations in Linux systems, and handle if some are not installed.
  • Improve our Dummy isolation provider, so that the steps that run in the host actually run in our Windows / macOS CI runners.
  • Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.
  • Update our wording in various places, so that we no longer refer to using two containers for the sanitization.
  • Draft an ARCHITECTURE.md, which will be the source of truth on how Dangerzone works now.

All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.

Copy link
Contributor

@deeplow deeplow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty incredible. Congrats! 🥳 A lot of work went before this and now this feels like the cherry on top. I have some minor code improvement suggestions.

What I still have to do:

  • test on windows and macOS

Other observations:

  • thanks for removing the dead code!
  • the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.
  • ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?
  • dummy can have pixels_to_pdf removed
  • Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/conversion/common.py Show resolved Hide resolved
dangerzone/conversion/common.py Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Show resolved Hide resolved
dangerzone/isolation_provider/container.py Show resolved Hide resolved
dangerzone/isolation_provider/container.py Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
@apyrgio apyrgio force-pushed the 625-host-stream branch 2 times, most recently from ae9090d to 8884cb8 Compare March 27, 2024 12:24
@apyrgio
Copy link
Contributor Author

apyrgio commented Mar 27, 2024

I'll reply to some of your observations as well:

the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.

In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?

ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?

I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.

dummy can have pixels_to_pdf removed

Yeap, you're right.

Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

Yeap, you're right.

@apyrgio apyrgio force-pushed the 625-host-stream branch 5 times, most recently from da0dd54 to 10522c2 Compare March 28, 2024 16:29
@deeplow
Copy link
Contributor

deeplow commented Mar 28, 2024

Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.

I worked on this. The code is in the branch 625-host-stream-tessdata-packaging. A lot of stuff had to be moved and I didn't manage to finish testing this week. I tested on fedora and debian and it seems to be building fine. The only thing is that it includes the .gitkeep in share/container.

On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio.

stdeb.cfg Outdated Show resolved Hide resolved
@apyrgio apyrgio mentioned this pull request Apr 18, 2024
@apyrgio apyrgio added this to the 0.7.0 milestone Jun 3, 2024
dangerzone/util.py Outdated Show resolved Hide resolved
Add a new way to detect where the Tesseract data are stored in a user's
system. On Linux, the Tesseract data should be installed via the package
manager. On macOS and Windows, they should be bundled with the
Dangerzone application.

There is also the exception of running Dangerzone locally, where even
on Linux, we should get the Tesseract data from the Dangerzone share/
folder.
The PyMuPDF package was previously mainly used within the Dangerzone
container, as well as on Qubes. With on-host conversion, PyMuPDF will be
used in all supported platforms by default. For this reason, we can
promote it to a main dependency.
Update .deb/.rpm specs to include PyMuPDF as a required package.
Extend the base isolation provider to immediately convert each page to
a PDF, and optionally use OCR. In contract with the way we did things
previously, there are no more two separate stages (document to pixels,
pixels to PDF). We now handle each page individually, for two main
reasons:

1. We don't want to buffer pixel data, either on disk or in memory,
   since they take a lot of space, and can potentially leave traces.
2. We can perform these operations in parallel, saving time. This is
   more evident when OCR is not used, where the time to convert a page
   to pixels, and then back to a PDF are comparable.
Move the logic for grabbing debug logs to a new place, now that we have
merged the two conversion stages (doc to pixels, pixels to PDF).
Make the Dummy isolation provider follow the rest of the isolation
providers and perform the second part of the conversion on the host. The
first part of the conversion is just a dummy script that reads a file
from stdin and prints pixels to stdout.
@apyrgio
Copy link
Contributor Author

apyrgio commented Oct 8, 2024

The PR is ready for review once more. The commit messages may require a bit more ❤️ and make lint complains, but other than that, it's as ready and tested as it can be.

Copy link
Contributor

@almet almet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Alex! I've tested the branch locally and it works (macOS m1), congrats 👍 🎉

Additionally to the review comments I left inline, I believe we could check that the tesseract data is present before asking PyMuPDF to use it, disabling this behavior if not present. Right now, it fails if not installed (which should not happen, but I believe it's the right timing to disable this).

I see two ways of doing this:

  • Show a warning next to the OCR setting, mentioning that the tesseract data is not installed (for the selected language?)
  • If no tesseract data is detected, remove the OCR setting and put a warning instead.

.github/workflows/ci.yml Show resolved Hide resolved
text = (
f"Converting page {page}/{n_pages} from pixels to {searchable}PDF"
)
percentage += step
self.print_progress(document, False, text, percentage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be the responsibility of self.print_progress to actually decide on the the message to be shown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yeah. Any issue you see with this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my previous message wasn't clear :-) I'm thinking about passing a different context to print_progress so that it actually prints the progress itself. Here the message is computed outside of this function, and I'm thinking we could do this inside the method instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, something like passing a context that holds the total number of pages and the current page, so that print_progress() can construct the proper "Converting page..." message? I guess we can, but we'll still need to call `print_progress() with an arbitrary message, e.g., in case of errors. Is there another benefit that we gain from this?

dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/dummy.py Show resolved Hide resolved
dangerzone/util.py Outdated Show resolved Hide resolved
files = {f.name for f in tessdata_dir.iterdir()}
if files == expected_files:
msg = "> Skipping tessdata download, language data already exists"
print(msg, file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere, we might want to use the logging module here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did so in c37ff73.

install/common/download-tessdata.py Show resolved Hide resolved
install/linux/vendor-pymupdf.py Show resolved Hide resolved
cmd = ["poetry", "export", "--only", "container"]
container_requirements_txt = subprocess.check_output(cmd)

# XXX: Hack for Ubuntu Focal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫣 You probably have this on your mind already, but it's probably worth adding a comment here about why this is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so actually in the #940 PR. Let me know what you think there.

@@ -46,7 +46,7 @@ def test_max_pages_client_enforcement(
doc = Document(sample_doc)
p = provider.start_doc_to_pixels_proc(doc)
with pytest.raises(errors.MaxPagesException):
provider.doc_to_pixels(doc, tmpdir, p)
provider._convert(doc, None, p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to test private methods here. Should we rename it to a public method instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d9eaec4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On-host pixels to PDF conversion
3 participants