Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skopeo-based adapter for working with OCI images #136

Closed
wants to merge 19 commits into from

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 6, 2020

This series adds an adapter that, like the existing docker-archive-based adapter, can be used to track a Docker image in a dataset. The disadvantage is that skopeo needs to be installed, while the main advantage is that the layers downloaded from docker:// sources can be linked to registry URLs.

This depends on support for registry URLs in datalad. Although the approach will likely change, there is a draft PR at datalad/datalad#5129.

Related:
#98
#135

Closes #106.


@codecov
Copy link

codecov bot commented Nov 7, 2020

Codecov Report

Merging #136 (fc7b847) into master (0f41d68) will decrease coverage by 7.75%.
The diff coverage is 50.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
- Coverage   86.54%   78.79%   -7.76%     
==========================================
  Files          17       22       +5     
  Lines         929     1193     +264     
==========================================
+ Hits          804      940     +136     
- Misses        125      253     +128     
Impacted Files Coverage Δ
datalad_container/adapters/docker.py 43.05% <28.57%> (+7.10%) ⬆️
datalad_container/adapters/tests/test_oci_more.py 30.30% <30.30%> (ø)
datalad_container/adapters/utils.py 30.76% <30.76%> (ø)
datalad_container/adapters/oci.py 42.67% <42.67%> (ø)
datalad_container/containers_add.py 89.85% <50.00%> (-3.16%) ⬇️
datalad_container/adapters/tests/test_oci.py 100.00% <100.00%> (ø)
datalad_container/tests/test_utils.py 100.00% <100.00%> (ø)
datalad_container/utils.py 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f41d68...fc7b847. Read the comment docs.

@yarikoptic
Copy link
Member

yarikoptic commented Nov 9, 2020

just a note (sorry if I am repeating what is already known etc): Installing "yet another tool" (which is not automagically installed from pypi) could challenge some users. I wondered if we could somehow simplify deployment of skopeo... but it seems to rely on a handful of libraries

$> ldd /usr/bin/skopeo
	linux-vdso.so.1 (0x00007fff1295c000)
	libgpgme.so.11 => /lib/x86_64-linux-gnu/libgpgme.so.11 (0x00007fd9aa6c6000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd9aa6a4000)
	libdevmapper.so.1.02.1 => /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1 (0x00007fd9aa637000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd9aa472000)
	libassuan.so.0 => /lib/x86_64-linux-gnu/libassuan.so.0 (0x00007fd9aa45c000)
	libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007fd9aa436000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fd9aa754000)
	libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fd9aa408000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fd9aa3e1000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd9aa29d000)
	libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007fd9aa20d000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd9aa207000)

good news is that skopeo is available from debian and conda-forge : https://anaconda.org/conda-forge/skopeo so at least in those packaging/deployment scenarios we could streamline (add as a dependency or Recommends: in debian).

edit: just now mentioned -- no windows build on conda-forge, only osx and linux :-/ filed conda-forge/skopeo-feedstock#9

@kyleam kyleam force-pushed the skopeo branch 2 times, most recently from 290743a to c45cef6 Compare November 10, 2020 19:51
@kyleam
Copy link
Contributor Author

kyleam commented Nov 10, 2020

I've pushed an update to avoid duplicating logic across the docker and skopeo adapters. The first eight commits of the series do that, with the first commit being a temporary commit with @adswa's Windows-related docker adapter fixes from gh-137 and gh-140.

range-diff
 -:  -------- >  1:  6074460a [drop] base for recent docker adapter PRs
 -:  -------- >  2:  22118280 MV: docker adapter: Move some bits to adapters.utils
 -:  -------- >  3:  01b1a202 RF: adapters.utils: Rename _list_images to get_docker_image_ids
 -:  -------- >  4:  c52a0c0c RF: adapters: Prefer subprocess.run over check_* methods
 -:  -------- >  5:  3072f6d3 MV: adapters: Move logging configuration to utils
 -:  -------- >  6:  faf5256c ENH: adapters.utils: Avoid double logging
 -:  -------- >  7:  59ed18f7 ENH: adapters.utils: Add helper for main() handling
 -:  -------- >  8:  94363806 ENH: adapters.utils: Display captured stderr on exit
 1:  ac3db32a !  9:  cdea6a52 NF: oci: Add adapter for working with OCI images
    @@ Commit message
         via podman in the future.

         Add an initial skopeo-based OCI adapter.  At this point, it has the
    -    same functionality as the docker adapter.  Structurally it's fairly
    -    similar to the docker adapter, but it's not clear that it's worth the
    -    effort to try to share any common parts, at least until the oci
    -    interface settles and we decide to keep the docker adapter around.
    +    same functionality as the docker adapter.

      ## datalad_container/adapters/oci.py (new) ##
     @@
    @@ datalad_container/adapters/oci.py (new)
     +
     +import json
     +import logging
    -+import os
     +from pathlib import Path
     +import re
     +import subprocess as sp
     +import sys
     +
    ++from datalad_container.adapters.utils import (
    ++    docker_run,
    ++    get_docker_image_ids,
    ++    log_and_exit,
    ++    setup_logger,
    ++)
    ++
     +lgr = logging.getLogger("datalad.container.adapters.oci")
     +
     +
    @@ datalad_container/adapters/oci.py (new)
     +    return info["config"]["digest"]
     +
     +
    -+def _get_docker_image_ids():
    -+    out = sp.run(
    -+        ["docker", "images", "--all", "--quiet", "--no-trunc"],
    -+        capture_output=True, universal_newlines=True, check=True)
    -+    return out.stdout.splitlines()
    -+
    -+
     +def load(path):
     +    """Load OCI image from `path`.
     +
    @@ datalad_container/adapters/oci.py (new)
     +    An image ID (str)
     +    """
     +    image_id = get_image_id(path)
    -+    if image_id not in _get_docker_image_ids():
    ++    if image_id not in get_docker_image_ids():
     +        lgr.debug("Loading %s", image_id)
     +        # The image is copied with a datalad-container/ prefix to reduce the
     +        # chance of collisions with existing names registered with the Docker
    @@ datalad_container/adapters/oci.py (new)
     +
     +def cli_run(namespace):
     +    image_id = load(namespace.path)
    -+    prefix = ["docker", "run",
    -+              # FIXME: The -v/-w settings are convenient for testing, but they
    -+              # should be configurable. Another option would be to not specify
    -+              # any arguments to docker-run and let everything be handled by
    -+              # the caller.
    -+              "-v", "{}:/tmp".format(os.getcwd()),
    -+              "-w", "/tmp",
    -+              # Make it possible for the output files to be added to the
    -+              # dataset without the user needing to manually adjust the
    -+              # permissions.
    -+              #
    -+              # FIXME: But if the container doesn't have these IDs, some
    -+              # commands may fail.
    -+              "-u", "{}:{}".format(os.getuid(), os.getgid()),
    -+              "--rm"]
    -+    if sys.stdin.isatty():
    -+        prefix.extend(["--interactive", "--tty=true"])
    -+    prefix.append(image_id)
    -+    cmd = prefix + namespace.cmd
    -+    lgr.debug("Running %r", cmd)
    -+    sp.run(cmd, check=True)
    ++    docker_run(image_id, namespace.cmd)
     +
     +
     +def main(args):
    @@ datalad_container/adapters/oci.py (new)
     +
     +    namespace = parser.parse_args(args[1:])
     +
    -+    lgr.setLevel(logging.DEBUG if namespace.verbose else logging.INFO)
    ++    setup_logger(lgr, logging.DEBUG if namespace.verbose else logging.INFO)
     +
     +    namespace.func(namespace)
     +
     +
     +if __name__ == "__main__":
    -+    try:
    ++    with log_and_exit(lgr):
     +        main(sys.argv)
    -+    except Exception as exc:
    -+        lgr.exception("Failed to execute %s", sys.argv)
    -+        if isinstance(exc, sp.CalledProcessError):
    -+            excode = exc.returncode
    -+            if exc.stderr:
    -+                sys.stderr.write(exc.stderr)
    -+        else:
    -+            excode = 1
    -+        sys.exit(excode)
 2:  62c3e97e = 10:  9013fdf6 DOC: oci: Add todo comment about image ID mismatch
 3:  5710e8f8 <  -:  -------- ENH: oci: Wire up logging handler when script called via path
 4:  1bbfac5d = 11:  27cd9417 DOC: oci: Add comment about alternative destinations
 5:  8cae9302 = 12:  02fa4922 ENH: oci: Silence 'skopeo copy' output when loading image
 6:  45a7d332 ! 13:  954b5d84 ENH: oci: Add utility for parsing Docker reference
    @@ datalad_container/adapters/oci.py
     +from collections import namedtuple
      import json
      import logging
    - import os
    + from pathlib import Path
     @@
      lgr = logging.getLogger("datalad.container.adapters.oci")

 7:  a51391b9 = 14:  47a2054e ENH: oci: Copy over tag when saving OCI directory
 8:  cb262015 = 15:  aa7653a3 ENH: oci: Add utilities for storing and reading annotation field
 9:  c398c789 = 16:  a04e6a5f ENH: oci: Record source for skopeo-copy in image's annotation
10:  fa6b5388 = 17:  d2f33fd4 ENH: oci: Register a more informative name with docker-daemon
11:  3d670b99 ! 18:  ee1a33af ENH: containers-add: Wire up OCI adapter
    @@ datalad_container/adapters/tests/test_oci_more.py (new)
     +    WitlessRunner,
     +)
     +from datalad_container.adapters import oci
    ++from datalad_container.adapters.utils import get_docker_image_ids
     +from datalad.tests.utils import (
     +    assert_in,
     +    integration,
    @@ datalad_container/adapters/tests/test_oci_more.py (new)
     +
     +    image_path = ds.repo.pathobj / ".datalad" / "environments" / "bb" / "image"
     +    image_id = oci.get_image_id(image_path)
    -+    existed = image_id in oci._get_docker_image_ids()
    ++    existed = image_id in get_docker_image_ids()
     +
     +    try:
     +        out = WitlessRunner(cwd=ds.path).run(
12:  a5a13621 = 19:  b4d73a17 MV: Add utils module with containers-add's _ensure_datalad_remote
13:  290743a9 ! 20:  c45cef66 ENH: containers-add: Try to link layers in OCI directory
    @@ datalad_container/adapters/tests/test_oci_more.py
     +    DATALAD_SPECIAL_REMOTES_UUIDS,
     +)
      from datalad_container.adapters import oci
    + from datalad_container.adapters.utils import get_docker_image_ids
      from datalad.tests.utils import (
          assert_in,
          integration,

These parts will be useful for the upcoming skopeo adapter as well.
"get" is probably clearer than "list", and tacking on "_ids" makes it
clearer what the return value is.

Also, drop the leading underscore, which is a holdover from the
function being in the adapters.docker module.
The minimum Python version of DataLad is new enough that we can assume
subprocess.run() is available.  It's recommended by the docs, and I
like it more, so switch to it.

Note that we might want to eventually switch to using WitlessRunner
here.  The original idea with using the subprocess module directly was
that it'd be nice for the docker adapter to be standalone, as nothing
in the adapter depended on datalad at the time.  That's not the case
anymore after the adapters.utils split and the use of datalad.utils
within it.  (And the upcoming skopeo adapter will make heavier use of
datalad for adding URLs to the layers.)
This logic will get a bit more involved in the next commit, and it
will be needed by the skopeo adapter too.
When the adapter is called from the command line (as containers-run
does) and datalad gets imported, the level set via the --verbose
argument doesn't have an effect and logging happens twice, once
through datalad's handler and once through the adapter's.

Before 313c4f0 (WIN/Workaround: don't pass gid and uid to docker run
call, 2020-11-10), the above was the case when docker.main() was
triggered with the documented `python -m datalad_container.adapters
...` invocation, but not when the script path was passed to python.
Following that commit, the adapter imports datalad, so datalad's
logger is always configured.

Adjust setup_logger() to set the log level of loggers under the
datalad.containers.adapters namespace so that the adapter's logging
level is in effect for command line calls to the adapter.

As mentioned above, datalad is now loaded in all cases, so a handler
is always configured, but, in case this changes in the future, add a
simpler handler if one isn't already configured.
The same handling will be needed in the skopeo adapter.  Avoid
repeating it.
Some of the subprocess calls capture stderr.  Show it to the caller on
failure.
In order to be able to track Docker containers in a dataset, we
introduced the docker-save-based docker adapter in 68a1462 (Add
prototype of a Docker adapter, 2018-05-18).  It's not clear how much
this has been used, but at least conceptually it seems to be viable.
One problem, however, is that ideally we'd be able to assign Docker
registry URLs to the image files stored in the dataset (particularly
the large non-configuration files).  There doesn't seem to be a way to
do this with the docker-save archives.

Another option for storing the image in a dataset is the Open
Container Initiative image format.  Skopeo can be used to copy images
in Docker registries (and some other destinations) to an OCI-compliant
directory.  When Docker Hub is used as the source, the resulting
layers blobs can be re-obtained via GET /v2/NAME/blobs/ID.

Using skopeo/OCI also has the advantage of making it easier to execute
via podman in the future.

Add an initial skopeo-based OCI adapter.  At this point, it has the
same functionality as the docker adapter.
After running `skopeo copy docker://docker.io/... oci:<dir>`, we can
link up the layer to the Docker registry.  However, other digests
aren't preserved.  One notable mismatch is between the image ID if you
run

    docker pull x

versus

    skopeo copy docker://x oci:x && skopeo copy oci:x docker-daemon:x

I haven't really wrapped my head around all the different digests and
when they can change.  However, skopeo's issue tracker has a good deal
of discussion about this, and it looks complicated (e.g., issues 11,
469, 949, 1046, and 1097).

The adapter docstring should probably note this, though at this point
I'm not sure I could say something coherent.  Anyway, add a to-do
note...
I _think_ containers-storage: is what we'd use for podman-run, but I
haven't attempted it.
Prevent skopeo-copy output from being shown, since it's probably
confusing to see output under run's "Command start (output follows)"
tag for a command that the user didn't explicitly call.  However, for
large images, this has the downside that the user might want some
signs of life, so this may need to be revisited.
We'll need this information in order to add a tag to the oci:
destination and to make the entry copied to docker-daemon more
informative.  I've tried to base the rules on containers/image
implementation, which is what skopeo uses underneath.
An image stored as an OCI directory can have a tag.  If the source has
a tag specified, copy it over to the destination.

Note that in upcoming commits will store the full source specification
as an image annotation, so we won't rely on this when copying the
image to docker-daemon:, but it still seems nice to have (e.g., when
looking at the directory with skopeo-inspect).
These will be used to store the value of the skopeo-copy source and
then retrieve it at load time to make the docker-daemon: entry more
informative.
The OCI format allows annotations.  Add one with the source value
(which will be determined by what the caller gives to containers-add)
so that we can use this information when copying the information to a
docker-daemon: destination.
The images copied to the daemon look like this

    $ docker images
    REPOSITORY             TAG                 IMAGE ID            CREATED             SIZE
    datalad-container/bb   sha256-98345e4      98345e418eb7        3 weeks ago         69.2MB

That tag isn't useful because it just repeats the image ID.  And the
name after "datalad-container/" is the name of the directory, so with
the default containers-add location it would be an uninformative
"image".

With the last commit, we store the source specification as an
annotation in the OCI directory.  Parse it and reuse the original
repository name and tag.

   REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
   datalad-container/debian   buster-slim         98345e418eb7        3 weeks ago         69.2MB

If the source has a digest instead of the tag, construct the daemon
tag from that.
Add a new oci: scheme.  The stacking of the schemes isn't ideal
(oci:docker://, oci:docker-daemon:), but it allows for any skopeo
transport to be used.

Note: I'm not avoiding appending "//" for a conceptual reason
(although there might be a valid one), but because I find
"oci://docker://" to be ugly.  Perhaps the consistency with "shub://"
and "dhub://" outweighs that though.
The next commit will use this logic in the oci adapter as well, and,
it'd be nice (though not strictly necessary) to avoid oci and
containers_add importing each other.
TODO: Finalize approach in Datalad for Docker Registry URLs.
@mih
Copy link
Member

mih commented Oct 9, 2023

I am closing this draft after some years. It is evident that nobody is working on it, and #106 exists to remind that this effort was made.

@mih mih closed this Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider skopeo-based adapter
3 participants