Action with Pangea-3 installation reproduction and ppc64le emulation #257

Algiane · 2024-01-29T18:11:33Z

New job that:

emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of qemu through the qemu-user-static image);
deploy a AlmaLinux-8 image on which TPLs' dependencies are installed with respect to the pangea3 modules needed to build the TPLs:
- CMake-3.26
- gcc-9.4.0
- ompi-4.1.2
- cuda-11.5.0
- openblas-0.3.18
- lsf-10.1
The Dockerfile used to build this image is provided in docker/TotalEnergies/Pangea3-base.Dockerfile and available on my DockerHub account under the pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18 name with tag 4: 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:4.
adds a docker/TotalEnergies/Pangea3.Dockerfile file that allows to build the ppc64le docker image with built and and installed TPLs for geos;
adds a RUNS_ON matrix variable to the job matrix to allow the use of different runners (it is needed to run on a self-hosted runner more powerful than the default github runners due to the slowdown introduced by the emulation layer);
removes the push step from the docker_build_and_push.sh script and rename this script docker_build.sh;
moves the authentication to docker before the attempt to push the docker image and do not logout on streak2: it solves errors when pushing images (access denied) due to race condition between jobs (if 2 jobs run at the same time on the machine, one job may remove the login credentials between the moment the first job login to docker and the moment it attempts to push the image);
adds a dedicated step for the docker push command.

Linked to EPIC TTE Builds and Geos PR 3159

- rename DOCKERFILE variable as TPL_DOCKERFILE to avoid conflicts with run-on-action variable - call a ppc64le ubuntu20 to install docker and call docker build to build the suitable TPL image

… env.

Algiane · 2024-02-26T17:37:41Z

Preliminary Remarks

for now, the Docker images and associated Dockerfiles have been produce as a PoC and no attempt was made to reduce their memory sizes that are pretty large (~4GB for the compressed size on DockerHub, ~9GB once image is run).
Cuda install for example uses lot of storage (~4GB), I guess that it can be reduced by copying only the needed library files...
the pangea 3 job is a draft too and a little time (~5 / 10 mins) could be saved by creating a specific image dedicated to the uraimo job (we need a linux OS running on ppc64le arch and with Docker available). For now, docker is installed by the job on the very light ubuntu image provided by uraimo.

Job Failure

Compilation fails due to time limit (TPLs compilation takes too much time);

Evaluation of emulation layer slowdown

qemu layer: slowdown by a factor about 14
The slowdown linked to the call of the qemu-user-static emulation layer is evaluated by comparing the compilation time of the finitElement library of Geos repository on 4 cores with 32GB of memory:
- without qemu: real 2m11.743s - user 4m22.582s
- with qemu: real 27m58.993s - user 73m23.212s
uraimo/run-on-arch-action: slowdown by a factor about 15 (no particular degradation comparing to qemu layer)
The slowdown linked to the call of the run-on-arch-action has been evaluated on an external code without any dependencies (for sake of simplicity as it removes the need to construct a suitable docker image for target architecture). Test results are available here:
- without run-on-arch emu: real 1m20 - user 1m13
- with run-on-arch emu: real 20m32.581s - user 20m3.864s

Perspectives

Even if we succeed to build TPLs in the suitable time, as GEOS Cuda build is a lot slower (~73m on 4 cores in Debug mode and 100m in Release one), it will not be possible to use the emulation layer as it.

Nevertheless we can list some improvement paths for the current PR and the TPL build:

use ccache or sscache to speedup build (Speed is too slow. Is there any way to speed up? uraimo/run-on-arch-action#4)
disable TPLs that will not be used by the pangea3 GEOs job (Trilinos for example)

From a more global perspective (GEOS project), thanks to @sframba propositions:

we may attempt to cross compile the TPLs and GEOS for the target arch and to run the unit tests using a self-hosted runner with GPUs and the emulation layer (tests are about 2 or 3 minutes in Release mode). Note that it may be a little tricky to deploy (I am not expert in cross compilation so I don't understand how we will ensure to have the suitable version of the various needed shared libs)
we may try to get a ppc64le host to run our tests. It seems that it is not natively provided by github (see related doc but that some workaround exists (see Github:Self-hosted runners on ppc64le architectures or the list of self-hosted GitHub Action runners).

Algiane · 2024-03-04T09:55:55Z

@sframba @TotoGaz : you can read the PR comments if interested by a feedback on this work that you initiated with @XL64 .

TotoGaz · 2024-03-05T17:13:25Z

Hello @Algiane thank you for your comments.

The timing issue is surely something to keep in mind, but before getting to this, I'd like to get a little more information about the process.

Using qemu, are you able to compile a ppc executable that runs on P3?
Same question, but with some simple CUDA program. Can you compile it and run it on P3?

Algiane · 2024-03-11T13:53:44Z

Hi @TotoGaz ,

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built.
I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

The geos TPLs and geos binary have been built:

on amd64;
with the deployment of the qemu-user-static docker image for the emulation layer;
using the 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:3 ppc64le docker image.

For now, the test of the executable on P3 is tweaked. I:

copied the TPL install directory, the lvarray shared library created by the geos build and the geos binary on P3;
created a symlink from the python3.6 library toward the python3.8 one (I didn't take care of the python version on my docker image... of course it is not the same than on P3) ;
exported the suitable paths to the TPLs libraries in my LD_LIBRARY_PATH;
loaded the suitable gcc, cuda, ompi and openblas modules.

Please let me know if you need more tests.

Best

TotoGaz · 2024-03-11T15:29:52Z

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built. I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

For that specific purpose, you can run geos with the --trace-data-migration, Trace host-device data migration command line option. You'll be able to see data moving from and to the device.

TotoGaz · 2024-03-11T15:34:56Z

@Algiane Is it fair to state that now the issue is really a timing issue? That if we had a very very powerful machine, that would work OK?

Cross compiling is something that can be very challenging. Furthermore, cross compiling the TPLs means cross compiling ~20 libs with their sometimes clunky build systems. And you add CUDA on top of that. I do not know how to manage that, that would require a lot of dedication, to say the least.

Algiane · 2024-03-11T16:01:06Z

Thanks for the --trace-data-migration tip: it confirms that some LvArrays are moved on/from GPUs.

For me, with this method we have 2 issues:

1. the compilation time;
1. the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

For now, as the emulation seems to be a dead-end but we still don't have a solution to test the P3 configuration, I will let this PR as a draft and try to see if we can connect a ppc64 runner to the github-actions as a self-hosted runner: it can be an alternative way if we can buy a small ppc64 machine.

Best

TotoGaz · 2024-03-11T16:14:32Z

the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

Algiane · 2024-03-11T17:11:05Z

the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

Maybe: it depends on the time needed to build the TPLs and Geos on this machine. We can multiply these times by 15 to have an order of the times needed with the emulation layer.

the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

I have about the same size for the image on DockerHub but it uses compression. Once pulled, for example, the pecan-gpu image is about 10.8 GB and I quickly get stuck with no space left.
It is less annoying than the time issue (as it is possible to work in an external volume).

Algiane · 2024-03-12T14:47:56Z

@sframba : I have tested the connection of a ppc64le self-hosted runner to github-actions using a non official runner (https://github.com/ChristopherHX/github-act-runner). It worked smoothly for a simple script execution.

… configure-tpl.sh.

… prepare the split of build and push steps.

…just before push and do not logout on streak2.

…on P3.Dockerfile.

Algiane added 6 commits January 29, 2024 19:12

First attempt to deploy TPLs over a ppc64 emulation.

5e1ee4e

Action syntax: indent run section.

969743d

Temporary trigger jobs on push.

790cf0b

Actions syntaxe.

4019bd9

Add dockerfile to build the base image for pangea3 reproduction.

9250dd7

Fixes for Pangea3 job

17a52a9

- rename DOCKERFILE variable as TPL_DOCKERFILE to avoid conflicts with run-on-action variable - call a ppc64le ubuntu20 to install docker and call docker build to build the suitable TPL image

Algiane force-pushed the feature/algiane/pangea3-action branch from 1620a6a to 17a52a9 Compare January 29, 2024 18:12

Algiane added 6 commits January 29, 2024 19:57

Remove steps in pangea3 dockerfile.

18e1391

Change verbbosity level for docker build.

a5edc50

Merge branch 'master' into feature/algiane/pangea3-action

ad49803

Attempt to launch ppc64 emulation on new self-hosted runner.

c6c69c9

Comment all jobs except emulated one to see if it runs on self-hosted…

46e40a9

… env.

Fix syntax

65c9db6

Algiane marked this pull request as draft February 26, 2024 15:22

Algiane added 4 commits February 26, 2024 16:33

Small fixes in base docker file for pangea3 reproduction.

9e01b6e

Merge branch 'master' into feature/algiane/pangea3-action

df51acf

Enable TPL compilation on ppc64 emu.

bff3c9e

Remove old test (attempt to run on llnl GPUs).

d65641c

Algiane added 2 commits May 17, 2024 10:04

Whitespace cleanup.

12cd3a4

Fix merge error due to renaming of configure_tpl_build.sh script into…

ac965f3

… configure-tpl.sh.

Algiane added 3 commits June 4, 2024 15:53

Definition of Docker tag outside the docker-build-and-push script (to…

a3edaa6

… prepare the split of build and push steps.

Split docker build and push processes.

e64f057

Attempt to solve identification issues on streak2: Move docker login …

df39ed1

…just before push and do not logout on streak2.

Algiane force-pushed the feature/algiane/pangea3-action branch from 3a5d6ba to 8701d91 Compare June 4, 2024 14:05

Update login-action version.

08c41c3

Algiane force-pushed the feature/algiane/pangea3-action branch from 8701d91 to 08c41c3 Compare June 4, 2024 14:09

Algiane mentioned this pull request Jun 6, 2024

Feature/algiane/pangea3 action debug #268

Closed

Pangea3-base.Dockerfile refactoring + Extraction of TPL install only …

fc924a5

…on P3.Dockerfile.

Algiane force-pushed the feature/algiane/pangea3-action branch from 702f8be to fc924a5 Compare June 18, 2024 11:27

Merge branch 'master' into feature/algiane/pangea3-action

bc3f528

Algiane marked this pull request as ready for review June 21, 2024 17:24

Algiane added the enhancement New feature or request label Jun 21, 2024

Algiane self-assigned this Jun 21, 2024

Algiane mentioned this pull request Jun 25, 2024

ci: Action with Pangea-3 installation reproduction and ppc64le emulation GEOS-DEV/GEOS#3159

Closed

4 tasks

Algiane requested review from rrsettgast, TotoGaz and sframba June 25, 2024 07:21

rrsettgast requested a review from CusiniM July 8, 2024 21:37

Merge branch 'master' into feature/algiane/pangea3-action

90f03a7

Algiane mentioned this pull request Jul 22, 2024

[EPIC] Update TTE builds GEOS-DEV/GEOS#3172

Open

6 tasks

rrsettgast approved these changes Jul 22, 2024

View reviewed changes

Algiane and others added 3 commits July 23, 2024 10:46

Fix wrong variable name after GEOSX to GEOS renaming in geos.

0e93775

Whitespace cleanup.

9155059

Merge branch 'master' into feature/algiane/pangea3-action

b241e85

sframba removed the enhancement New feature or request label Aug 19, 2024

Algiane and others added 2 commits August 27, 2024 16:08

Merge branch 'master' into feature/algiane/pangea3-action

22a4f4b

Merge branch 'master' into feature/algiane/pangea3-action

6ebb04b

sframba approved these changes Sep 4, 2024

View reviewed changes

rrsettgast approved these changes Sep 13, 2024

View reviewed changes

rrsettgast merged commit becdf06 into master Sep 13, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Action with Pangea-3 installation reproduction and ppc64le emulation #257

Action with Pangea-3 installation reproduction and ppc64le emulation #257

Algiane commented Jan 29, 2024 •

edited

Loading

Algiane commented Feb 26, 2024

Algiane commented Mar 4, 2024

TotoGaz commented Mar 5, 2024 •

edited

Loading

Algiane commented Mar 11, 2024

TotoGaz commented Mar 11, 2024

TotoGaz commented Mar 11, 2024 •

edited

Loading

Algiane commented Mar 11, 2024

TotoGaz commented Mar 11, 2024

Algiane commented Mar 11, 2024 •

edited

Loading

Algiane commented Mar 12, 2024

Action with Pangea-3 installation reproduction and ppc64le emulation #257

Action with Pangea-3 installation reproduction and ppc64le emulation #257

Conversation

Algiane commented Jan 29, 2024 • edited Loading

Algiane commented Feb 26, 2024

Preliminary Remarks

Job Failure

Evaluation of emulation layer slowdown

Perspectives

Algiane commented Mar 4, 2024

TotoGaz commented Mar 5, 2024 • edited Loading

Algiane commented Mar 11, 2024

TotoGaz commented Mar 11, 2024

TotoGaz commented Mar 11, 2024 • edited Loading

Algiane commented Mar 11, 2024

TotoGaz commented Mar 11, 2024

Algiane commented Mar 11, 2024 • edited Loading

Algiane commented Mar 12, 2024

Algiane commented Jan 29, 2024 •

edited

Loading

TotoGaz commented Mar 5, 2024 •

edited

Loading

TotoGaz commented Mar 11, 2024 •

edited

Loading

Algiane commented Mar 11, 2024 •

edited

Loading