Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action with Pangea-3 installation reproduction and ppc64le emulation #257

Merged
merged 34 commits into from
Sep 13, 2024

Conversation

Algiane
Copy link
Contributor

@Algiane Algiane commented Jan 29, 2024

New job that:

  • emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of qemu through the qemu-user-static image);

  • deploy a AlmaLinux-8 image on which TPLs' dependencies are installed with respect to the pangea3 modules needed to build the TPLs:

    • CMake-3.26
    • gcc-9.4.0
    • ompi-4.1.2
    • cuda-11.5.0
    • openblas-0.3.18
    • lsf-10.1

    The Dockerfile used to build this image is provided in docker/TotalEnergies/Pangea3-base.Dockerfile and available on my DockerHub account under the pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18 name with tag 4: 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:4.

  • adds a docker/TotalEnergies/Pangea3.Dockerfile file that allows to build the ppc64le docker image with built and and installed TPLs for geos;

  • adds a RUNS_ON matrix variable to the job matrix to allow the use of different runners (it is needed to run on a self-hosted runner more powerful than the default github runners due to the slowdown introduced by the emulation layer);

  • removes the push step from the docker_build_and_push.sh script and rename this script docker_build.sh;

  • moves the authentication to docker before the attempt to push the docker image and do not logout on streak2: it solves errors when pushing images (access denied) due to race condition between jobs (if 2 jobs run at the same time on the machine, one job may remove the login credentials between the moment the first job login to docker and the moment it attempts to push the image);

  • adds a dedicated step for the docker push command.

Linked to EPIC TTE Builds and Geos PR 3159

  - rename DOCKERFILE variable as TPL_DOCKERFILE to avoid conflicts with run-on-action variable
  - call a ppc64le ubuntu20 to install docker and call docker build to build the suitable TPL image
@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch from 1620a6a to 17a52a9 Compare January 29, 2024 18:12
@Algiane Algiane marked this pull request as draft February 26, 2024 15:22
@Algiane
Copy link
Contributor Author

Algiane commented Feb 26, 2024

Preliminary Remarks

  • for now, the Docker images and associated Dockerfiles have been produce as a PoC and no attempt was made to reduce their memory sizes that are pretty large (~4GB for the compressed size on DockerHub, ~9GB once image is run).
    Cuda install for example uses lot of storage (~4GB), I guess that it can be reduced by copying only the needed library files...

  • the pangea 3 job is a draft too and a little time (~5 / 10 mins) could be saved by creating a specific image dedicated to the uraimo job (we need a linux OS running on ppc64le arch and with Docker available). For now, docker is installed by the job on the very light ubuntu image provided by uraimo.

Job Failure

  • Compilation fails due to time limit (TPLs compilation takes too much time);

Evaluation of emulation layer slowdown

  1. qemu layer: slowdown by a factor about 14
    The slowdown linked to the call of the qemu-user-static emulation layer is evaluated by comparing the compilation time of the finitElement library of Geos repository on 4 cores with 32GB of memory:

    • without qemu: real 2m11.743s - user 4m22.582s
    • with qemu: real 27m58.993s - user 73m23.212s
  2. uraimo/run-on-arch-action: slowdown by a factor about 15 (no particular degradation comparing to qemu layer)
    The slowdown linked to the call of the run-on-arch-action has been evaluated on an external code without any dependencies (for sake of simplicity as it removes the need to construct a suitable docker image for target architecture). Test results are available here:

    • without run-on-arch emu: real 1m20 - user 1m13
    • with run-on-arch emu: real 20m32.581s - user 20m3.864s

Perspectives

Even if we succeed to build TPLs in the suitable time, as GEOS Cuda build is a lot slower (~73m on 4 cores in Debug mode and 100m in Release one), it will not be possible to use the emulation layer as it.

Nevertheless we can list some improvement paths for the current PR and the TPL build:

From a more global perspective (GEOS project), thanks to @sframba propositions:

  • we may attempt to cross compile the TPLs and GEOS for the target arch and to run the unit tests using a self-hosted runner with GPUs and the emulation layer (tests are about 2 or 3 minutes in Release mode). Note that it may be a little tricky to deploy (I am not expert in cross compilation so I don't understand how we will ensure to have the suitable version of the various needed shared libs)

  • we may try to get a ppc64le host to run our tests. It seems that it is not natively provided by github (see related doc but that some workaround exists (see Github:Self-hosted runners on ppc64le architectures or the list of self-hosted GitHub Action runners).

@Algiane
Copy link
Contributor Author

Algiane commented Mar 4, 2024

@sframba @TotoGaz : you can read the PR comments if interested by a feedback on this work that you initiated with @XL64 .

@TotoGaz
Copy link
Contributor

TotoGaz commented Mar 5, 2024

Hello @Algiane thank you for your comments.

The timing issue is surely something to keep in mind, but before getting to this, I'd like to get a little more information about the process.

  • Using qemu, are you able to compile a ppc executable that runs on P3?
  • Same question, but with some simple CUDA program. Can you compile it and run it on P3?

@Algiane
Copy link
Contributor Author

Algiane commented Mar 11, 2024

Hi @TotoGaz ,

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built.
I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

The geos TPLs and geos binary have been built:

  • on amd64;
  • with the deployment of the qemu-user-static docker image for the emulation layer;
  • using the 7g8efcehpff/pangea-almalinux8-gcc9.4-openmpi4.1.2-cuda11.5.0-openblas0.3.18:3 ppc64le docker image.

For now, the test of the executable on P3 is tweaked. I:

  • copied the TPL install directory, the lvarray shared library created by the geos build and the geos binary on P3;
  • created a symlink from the python3.6 library toward the python3.8 one (I didn't take care of the python version on my docker image... of course it is not the same than on P3) ;
  • exported the suitable paths to the TPLs libraries in my LD_LIBRARY_PATH;
  • loaded the suitable gcc, cuda, ompi and openblas modules.

Please let me know if you need more tests.

Best

@TotoGaz
Copy link
Contributor

TotoGaz commented Mar 11, 2024

On pangea III, I can run the acous3D_abc_smoke.xml test case with the geos binary I built. I don't know how to monitor that it really uses the GPUs but running the same test on the P3_USERS_CPU queue fails with the no CUDA-capable device error.

For that specific purpose, you can run geos with the --trace-data-migration, Trace host-device data migration command line option. You'll be able to see data moving from and to the device.

@TotoGaz
Copy link
Contributor

TotoGaz commented Mar 11, 2024

@Algiane Is it fair to state that now the issue is really a timing issue? That if we had a very very powerful machine, that would work OK?

Cross compiling is something that can be very challenging. Furthermore, cross compiling the TPLs means cross compiling ~20 libs with their sometimes clunky build systems. And you add CUDA on top of that. I do not know how to manage that, that would require a lot of dedication, to say the least.

@Algiane
Copy link
Contributor Author

Algiane commented Mar 11, 2024

Thanks for the --trace-data-migration tip: it confirms that some LvArrays are moved on/from GPUs.

For me, with this method we have 2 issues:

    1. the compilation time;
    1. the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

For now, as the emulation seems to be a dead-end but we still don't have a solution to test the P3 configuration, I will let this PR as a draft and try to see if we can connect a ppc64 runner to the github-actions as a self-hosted runner: it can be an alternative way if we can buy a small ppc64 machine.

Best

@TotoGaz
Copy link
Contributor

TotoGaz commented Mar 11, 2024

    1. the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

    1. the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

@Algiane
Copy link
Contributor Author

Algiane commented Mar 11, 2024

    1. the compilation time;

We have a powerful self-hosted machine. Do you think that could do it?

Maybe: it depends on the time needed to build the TPLs and Geos on this machine. We can multiply these times by 15 to have an order of the times needed with the emulation layer.

    1. the size of the docker images: the image with the pre-built TPLs is very close to the 10 GB limit and I think that the base image (the image with the copy of the pangea modules that are needed to build the TPLs but without the TPLs built) is not very far. The cuda module alone is already more than 4 GB. Finally, it was not possible to work directly inside the containers and I had to mount my home to avoid the no space left on device error.

I'm surprised that this gets so big. E.g. https://hub.docker.com/r/geosx/pecan-gpu-gcc8.2.0-openmpi4.0.1-mkl2019.5-cuda11.5.119/tags is ~4.4GB (still very big, but half). Do you know what get's it so big? We're using a lot the multi-stage approach to remove the temporaries. Are you doing the same?

Also, if we manage to run it on a comfortable self-hosted machine, would the size issue become secondary?

I have about the same size for the image on DockerHub but it uses compression. Once pulled, for example, the pecan-gpu image is about 10.8 GB and I quickly get stuck with no space left.
It is less annoying than the time issue (as it is possible to work in an external volume).

@Algiane
Copy link
Contributor Author

Algiane commented Mar 12, 2024

@sframba : I have tested the connection of a ppc64le self-hosted runner to github-actions using a non official runner (https://github.com/ChristopherHX/github-act-runner). It worked smoothly for a simple script execution.

@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch from 3a5d6ba to 8701d91 Compare June 4, 2024 14:05
@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch from 702f8be to fc924a5 Compare June 18, 2024 11:27
@Algiane Algiane marked this pull request as ready for review June 21, 2024 17:24
@Algiane Algiane added the enhancement New feature or request label Jun 21, 2024
@Algiane Algiane self-assigned this Jun 21, 2024
@rrsettgast rrsettgast requested a review from CusiniM July 8, 2024 21:37
@sframba sframba removed the enhancement New feature or request label Aug 19, 2024
@rrsettgast rrsettgast merged commit becdf06 into master Sep 13, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants