axom::Array constructor crash on CUDA. #1432

BradWhitlock · 2024-09-30T22:04:27Z

Code like the following resulted in Array::Array trying to initialize elements of a device-allocated array using placement new on the host. The code SEGV'd.

using ExecSpace = axom::CUDA_EXEC<256>;
const int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<int> arr(n, n, allocatorID);

This method calls initialize() with 2 arguments, making the 3rd argument the detault of true, which is to default-construct.

axom/src/axom/core/Array.hpp

Line 1084 in 70b3608

initialize(num_elements, capacity);

axom/src/axom/core/Array.hpp

Line 1591 in 70b3608

OpHelper {m_allocator_id, m_executeOnGPU}.init(m_data, 0, num_elements);

I think the root of the problem could be that Array::m_executeOnGPU is not initialized anywhere. Valgrind was logging uninitialized memory in this area and m_executeOnGPU is probably the culprit.

Calling axom::Array(n, n, allocatorID) where allocatorID is a CUDA allocator should not cause a SEGV and it should initialize the data as needed on device.

I was told that ATS might have some bearing here too.

zansel7{whitlocb}103: detect_ats
rzansel7     ATS detected

The text was updated successfully, but these errors were encountered:

BradWhitlock · 2024-09-30T22:05:30Z

I'm trying a fix to set m_executeOnGPU based on the memory space, inside of Array::initialize, Array::initialize_from_other, and one of the constructors that calls neither of those methods.

bmhan12 · 2024-09-30T22:07:29Z

As mentioned previously:

Link to documentation on setting/disabling the Address Translation Services (ATS), and checking if it is enabled/disabled (Point 19): https://lc.llnl.gov/confluence/display/SIERRA/Quickstart+Guide

rhornung67 · 2024-09-30T22:12:30Z

Thanks @BradWhitlock . We can have @publixsubfan look into this to make sure other issues don't occur.

BradWhitlock · 2024-10-01T06:49:55Z

Update. I've had some trouble reproducing the crash on develop. The Array::m_executeOnGPU member is uninitialized but it does not seem to matter much. When it fails in my branch, it seems like some bad optimization might be at work. I was getting the allocatorID to pass from execution_space<ExecSpace>::allocatorID() and it seemed (in Totalview) that the allocatorID was getting optimized out. If I make it "volatile" to prevent inlining then I can see it returns 3 and it works normally. The code resembles:

void buildShapeMap(axom::ArrayView<axom::IndexType> &values, axom::ArrayView<axom::IndexType> &ids, int allocatorID)
{
  const axom::IndexType n = // get the size
  values = axom::Array<IndexType>(n, n, allocatorID);
  ids = axom::Array<IndexType>(n, n, allocatorID);
  // Fill values, ids here
}
...

/*volatile*/ int allocatorID = axom::execution_space<ExecSpace>::allocatorID();
axom::Array<axom::IndexType> values, ids;
buildShapeMap(values, ids, allocatorID);

publixsubfan · 2024-10-01T07:36:56Z

Yes, I believe we need to initialize m_executeOnGPU to an appropriate default value. Good catch @BradWhitlock.

But is this happening with CUDA device-only memory? The value of that variable should be immaterial -- we should be passing through to special logic for that case.

BradWhitlock mentioned this issue Oct 1, 2024

Initialize Array::m_executeOnGPU #1434

Merged

BradWhitlock closed this as completed in #1434 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

axom::Array constructor crash on CUDA. #1432

axom::Array constructor crash on CUDA. #1432

BradWhitlock commented Sep 30, 2024 •

edited

Loading

BradWhitlock commented Sep 30, 2024

bmhan12 commented Sep 30, 2024

rhornung67 commented Sep 30, 2024

BradWhitlock commented Oct 1, 2024 •

edited

Loading

publixsubfan commented Oct 1, 2024

axom::Array constructor crash on CUDA. #1432

axom::Array constructor crash on CUDA. #1432

Comments

BradWhitlock commented Sep 30, 2024 • edited Loading

BradWhitlock commented Sep 30, 2024

bmhan12 commented Sep 30, 2024

rhornung67 commented Sep 30, 2024

BradWhitlock commented Oct 1, 2024 • edited Loading

publixsubfan commented Oct 1, 2024

BradWhitlock commented Sep 30, 2024 •

edited

Loading

BradWhitlock commented Oct 1, 2024 •

edited

Loading