Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sierra #4

Merged
merged 36 commits into from
Apr 11, 2024
Merged

Sierra #4

merged 36 commits into from
Apr 11, 2024

Conversation

selamberson
Copy link
Contributor

The changes in this branch are needed to get sierra versions 5.X to build and run with yogimpi. Commits a9bf7ae and de7576f are the most "surprising" and are probably the most important changes for everyone to understand.

Before merging this branch, we need to field test the branch with current Helios and Kestrel. I have Kestrel working on my workstation with gcc and both openmpi and mpich backends. I recommend rebuilding avtools in addition to the final products as some of the function signatures will have changed.

@bpittman, let me know when you have enough infrastructure in place for me to run Jenkins and ATS, and I'll run those.

- The --version argument is now passed down to the "real" mpi wrapper,
  which is then passed on to the serial compiler.
- The mpif77 and mpif90 wrappers are removed, and replaced with the
  mpifort wrapper.
- mpiCC and mpic++ are installed as symlinks to mpicxx
- mpif77 and mpif90 are installed as symlinks to mpifort
- Add missing constant MPI_MAX_LIBRARY_VERSION_STRING
- Add missing constant MPI_ERRCODES_IGNORE
- Add missing function MPI_Comm_spawn
- Added "YogiMPI_VERSION_STR" to yogimpi.h and mpif.h, which should be
  set to the version of YogiMPI used at compile time.
- Added "make dist" target, which will create a yogimpi-<VERSION>.tar.gz
  file, where <VERSION> is determined by $(git describe --tags).
By inspection, there is more to do here.
Before this commit, when client code would pass a copy or delete
callback to MPI_Comm_create_keyval, if the callbacks had any MPI calls
inside them then the code would segfault.  The segfault was due to the
native MPI passing in the native MPI_Comm to the callback, and then
YogiMPI would segfault when tyring to covert the comm to a MPI_Comm
because Yogi did not know the comm was already converted.

This change captures the callback functions and wraps them in lambdas
that un-converts the comm back to a YogiMPI_Comm so it is safe for the
callbacks to contain MPI calls.

There are likely other failure modes associated with user-supplied
callbacks to MPI routines.  Hopefully this will be a good pattern to
repeat if we find similar issues with other callbacks.
This commit wraps almost every single native MPI call with a callDepth
increment/decrement.  Currently this is only used for indenting calls in
the yogimpi.log.* files.

There are a certain class of problems that are hard to diagnose, and the
commonality is that YogiMPI has called the native MPI and then that MPI
has called a callback that was wrapped by YogiMPI.  Adding this
callDepth and indenting the log file based on callDepth makes it much
easier to diagnose the problem.
This commit wraps MPI_User_functions in lambdas, so we can revert the
MPI_Datatype back to a YogiMPI_Datatype before passing it to the
callback so that the callback can be called from the native MPI routines
(e.g. MPI_Reduce, MPI_Scan, etc.).
@rtrigg
Copy link
Contributor

rtrigg commented Jun 22, 2023

I tried with intel and also a pure gnu build on rhel7 with mpich, and all system tests pass. I also built some third party solvers against this and those tests passed, so I'm pretty happy in that regard.

I haven't done any large-scale testing yet, but I can do that next week if you want. Alternatively, I'm okay with approving this after you or @bpittman does that testing.

I'm going to be out Fri-Wed, but if you ping me on here I'll see it and respond.

@selamberson
Copy link
Contributor Author

@rtrigg, thanks. I'd say approve it whenever you are comfortable with it. The changes are significant enough that I don't want to merge it until we have run our ATS with it. We did go ahead and tag it v1.4.0rc1 though so @bpittman doesn't need to rename all his VMs and such if we don't find any problems.

@selamberson
Copy link
Contributor Author

@rtrigg and @bpittman, I have made progress on my yogimpi+sierra work. I have a new rc to test. Please test v1.4.0rc2 and let me know if it works for you. If it works, please "approve" this pull request so github will let me merge it.

@selamberson
Copy link
Contributor Author

@rtrigg and @bpittman, the Kestrel ATS on v1.4.0rc2 indicates that rc2 is ready from my point of view. Can you two please look at the merge request, do whatever testing you need to do, and approve the pull request if it is ready so github will let me merge it.

Copy link
Contributor

@rtrigg rtrigg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system tests are passing, I'm happy. I converted the README to README.md a while back, so I added a change for make dist

@rtrigg
Copy link
Contributor

rtrigg commented Dec 14, 2023

it really looks like I broke the CI with my update, but I swear I didn't :) I think this is an issue with a change to oneAPI; I'll look into it and report back.

I'll add, this doesn't change my opinion about the merge: I think it's ready. this is more of a build issue in the CI.

Still have more testing to do and a bit more cleanup, but I need to move
my exploration to another machine.
Now only datatypes are using the new insertIntoPoolIfNotExists call.

When using the new call for comms and requests things stop working for
some reason I don't yet understand, so I'm only using the new call for
the datatype which was causing the most issues (I think ...).
@selamberson selamberson merged commit 90d7866 into master Apr 11, 2024
6 checks passed
@selamberson selamberson deleted the sierra branch April 11, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants