Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

axom::Array constructor crash on CUDA. #1432

Closed
BradWhitlock opened this issue Sep 30, 2024 · 5 comments · Fixed by #1434
Closed

axom::Array constructor crash on CUDA. #1432

BradWhitlock opened this issue Sep 30, 2024 · 5 comments · Fixed by #1434

Comments

@BradWhitlock
Copy link
Member

BradWhitlock commented Sep 30, 2024

Code like the following resulted in Array::Array trying to initialize elements of a device-allocated array using placement new on the host. The code SEGV'd.

using ExecSpace = axom::CUDA_EXEC<256>;
const int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<int> arr(n, n, allocatorID);

This method calls initialize() with 2 arguments, making the 3rd argument the detault of true, which is to default-construct.

initialize(num_elements, capacity);

OpHelper {m_allocator_id, m_executeOnGPU}.init(m_data, 0, num_elements);

I think the root of the problem could be that Array::m_executeOnGPU is not initialized anywhere. Valgrind was logging uninitialized memory in this area and m_executeOnGPU is probably the culprit.

Calling axom::Array(n, n, allocatorID) where allocatorID is a CUDA allocator should not cause a SEGV and it should initialize the data as needed on device.

I was told that ATS might have some bearing here too.

zansel7{whitlocb}103: detect_ats
rzansel7     ATS detected
@BradWhitlock
Copy link
Member Author

I'm trying a fix to set m_executeOnGPU based on the memory space, inside of Array::initialize, Array::initialize_from_other, and one of the constructors that calls neither of those methods.

@bmhan12
Copy link
Contributor

bmhan12 commented Sep 30, 2024

As mentioned previously:

Link to documentation on setting/disabling the Address Translation Services (ATS), and checking if it is enabled/disabled (Point 19): https://lc.llnl.gov/confluence/display/SIERRA/Quickstart+Guide

@rhornung67
Copy link
Member

Thanks @BradWhitlock . We can have @publixsubfan look into this to make sure other issues don't occur.

@BradWhitlock
Copy link
Member Author

BradWhitlock commented Oct 1, 2024

Update. I've had some trouble reproducing the crash on develop. The Array::m_executeOnGPU member is uninitialized but it does not seem to matter much. When it fails in my branch, it seems like some bad optimization might be at work. I was getting the allocatorID to pass from execution_space<ExecSpace>::allocatorID() and it seemed (in Totalview) that the allocatorID was getting optimized out. If I make it "volatile" to prevent inlining then I can see it returns 3 and it works normally. The code resembles:

void buildShapeMap(axom::ArrayView<axom::IndexType> &values, axom::ArrayView<axom::IndexType> &ids, int allocatorID)
{
  const axom::IndexType n = // get the size
  values = axom::Array<IndexType>(n, n, allocatorID);
  ids = axom::Array<IndexType>(n, n, allocatorID);
  // Fill values, ids here
}
...

/*volatile*/ int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<axom::IndexType> values, ids;
buildShapeMap(values, ids, allocatorID);

@publixsubfan
Copy link
Contributor

Yes, I believe we need to initialize m_executeOnGPU to an appropriate default value. Good catch @BradWhitlock.

But is this happening with CUDA device-only memory? The value of that variable should be immaterial -- we should be passing through to special logic for that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants