Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kab163/update readmes #16

Merged
merged 9 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Intro_Tutorial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,6 @@ You can find lessons in the lessons subdirectory. Each lesson has a README file
which will introduce new concepts and provide instructions to move forward.

Each lesson builds upon the previous one, so if you get stuck, you can look at
the next lesson to see the complete code.
the next lesson to see the complete code. Additionally, some tutorials have
solutions folder with a provided solution.

2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Additionally, since we have configured this project to use CUDA, BLT provides a
`cuda` target to ensure that executables will be built with CUDA support.

The `blt_add_executable` macro has another argument, `DEPENDS_ON`, that you can
use to list dependendencies.
use to list dependencies.

```
blt_add_executable(
Expand Down
5 changes: 5 additions & 0 deletions Intro_Tutorial/lessons/03/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,15 @@ The Allocator class provides methods for allocating and deallocating memory. You
can view these methods in the Umpire source code documentation here:
https://umpire.readthedocs.io/en/develop/doxygen/html/classumpire_1_1Allocator.html

To use an Umpire allocator, use the following code, replacing "size in bytes" with
the desired size for your allocation:

```
void* memory = allocator.allocate(size in bytes);
```

Don't forget to deallocate your memory afterwards!

For more details, you can check out the Umpire documentation:
https://umpire.readthedocs.io/en/develop/sphinx/tutorial/allocators.html

Expand Down
7 changes: 2 additions & 5 deletions Intro_Tutorial/lessons/04/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ The lambda expression needs to take one argument, the loop index:
[=](int i) { // loop body }
```

the `[=]` syntax tells the lambda to capture arguments by value (e.g. create a
The `[=]` syntax tells the lambda to capture arguments by value (e.g. create a
copy, rather than a reference).

The `EXEC_POLICY` template argument controls how the loop will be executed. In
this example, we will use the `RAJA::loop_exec` policy to execute this loop on
this example, we will use the `RAJA::seq_exec` policy to execute this loop on
the CPU. In later lessons, we will learn about other policies that allow us to
run code on a GPU.

Expand All @@ -46,6 +46,3 @@ Address of data:
data[50] = 50
```




2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/05/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
blt_add_executable(
NAME five
SOURCES five.cpp
DEPENDS_ON RAJA umpire cuda)
DEPENDS_ON RAJA umpire)
7 changes: 5 additions & 2 deletions Intro_Tutorial/lessons/05/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@ https://raja.readthedocs.io/en/develop/sphinx/user_guide/feature/policies.html#r
The second parameter, the `TYPE` parameter, is just the data type of the
variable, such as `int`.

In the file `five.cpp`, follow the instruction in the `TODO` comment. Once
you have filled in the correct reduction statement, compile and run:
In the file `five.cpp`, follow the instruction in the `TODO` comment to create
a RAJA Reduction using `seq_exec`.


Once you have filled in the correct reduction statement, compile and run:

```
$ make five
Expand Down
6 changes: 5 additions & 1 deletion Intro_Tutorial/lessons/05/five.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,18 @@ int main()
a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
b = static_cast<double*>(allocator.allocate(N*sizeof(double)));

// TODO: Change this dot variable to instead use a RAJA reduction
// TODO: to calculate and output the dot product of a and b
double dot{0.0};

RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
a[i] = 1.0;
b[i] = 1.0;
}
);

// TODO: use a reduction to calculate and output the dotproduct of a and b
std::cout << "dot product is "<< dot << std::endl;

allocator.deallocate(a);
allocator.deallocate(b);
Expand Down
2 changes: 0 additions & 2 deletions Intro_Tutorial/lessons/05/solution/five_solution.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ int main()
}
);

// TODO: use a reduction to calculate and output the dotproduct of a and b

std::cout << "dot product is "<< dot << std::endl;

allocator.deallocate(a);
Expand Down
23 changes: 21 additions & 2 deletions Intro_Tutorial/lessons/06/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,22 @@ The predefined names can include:
In this example, you can use the "UM" resource so that the data can be accessed
by the CPU or GPU.

There is a `TODO` comment in the `six.cpp` exercise file where you
can modify the code to allocate GPU memory. When you are done, build
You will also find that we are adjusting the `RAJA::forall` to now work on the GPU.
In order for this to happen, we need a few extra things. First, we create a
`CUDA_BLOCK_SIZE` variable to tell RAJA how big we want our CUDA blocks to be.
Since there are 32 threads in a warp, 256 tends to be a good value for a block size.
Other sizes will work too, such as 128 or 512. This just depends on your GPU.

Additionally, the `RAJA::forall` needs the CUDA execution policy. More on GPU
execution policies can be found here: https://raja.readthedocs.io/en/develop/sphinx/user_guide/feature/policies.html#gpu-policies-for-cuda-and-hip

The `cuda_exec` policy takes the cuda block size argument we created before
as a template parameter. Finally, as we are filling in the lambda portion of
the `RAJA::forall`, we need to specify where it will reside in GPU memory.
This can be done directly or by using the `RAJA_DEVICE` macro.

There are several `TODO` comments in the `six.cpp` exercise file where you
can modify the code to work on a GPU. When you are done, build
and run the example:

```
Expand All @@ -33,3 +47,8 @@ $ ./bin/six

For more information on Umpire's resources, see our documentation:
https://umpire.readthedocs.io/en/develop/index.html

You can also read more about RAJA foralls and kernels here:
https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/add_vectors.html?highlight=RAJA_DEVICE#basic-loop-execution-vector-addition
and
https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/dot_product.html#raja-variants
21 changes: 15 additions & 6 deletions Intro_Tutorial/lessons/06/six.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,13 @@
#include "RAJA/RAJA.hpp"
#include "umpire/Umpire.hpp"

#if defined(COMPILE)

int main()
{
constexpr int N{10000};
constexpr std::size_t CUDA_BLOCK_SIZE{256};
//TODO: Set up a block size value
constexpr std::size_t CUDA_BLOCK_SIZE{????};
double* a{nullptr};
double* b{nullptr};

Expand All @@ -17,18 +20,22 @@ int main()
a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
b = static_cast<double*>(allocator.allocate(N*sizeof(double)));

RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(
RAJA::TypedRangeSegment<int>(0, N), [=] RAJA_DEVICE (int i) {
//TODO: fill in the forall statement with the CUDA execution policy
//TODO: and its block size argument. Then be sure to use RAJA_DEVICE
RAJA::forall<????? <?????> >(
RAJA::TypedRangeSegment<int>(0, N), [=] ?????? (int i) {
a[i] = 1.0;
b[i] = 1.0;
}
);

double dot{0.0};
RAJA::ReduceSum<RAJA::cuda_reduce, double> cudot(0.0);
//TODO: create a RAJA::ReduceSum with cuda_reduce called "cudot" for the GPU

RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::TypedRangeSegment<int>(0, N),
[=] RAJA_DEVICE (int i) {
//TODO: fill in the forall statement with the CUDA execution policy
//TODO: and its block size argument. Then be sure to use RAJA_DEVICE
RAJA::forall<?????<????>>(RAJA::TypedRangeSegment<int>(0, N),
[=] ???? (int i) {
cudot += a[i] * b[i];
});

Expand All @@ -38,4 +45,6 @@ int main()

allocator.deallocate(a);
allocator.deallocate(b);
#endif
return 0;
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ int main()
double* b{nullptr};

auto& rm = umpire::ResourceManager::getInstance();
// TODO: allocate with device unified memory
auto allocator = rm.getAllocator("UM");

a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
Expand All @@ -38,4 +37,6 @@ int main()

allocator.deallocate(a);
allocator.deallocate(b);

return 0;
}
14 changes: 14 additions & 0 deletions Intro_Tutorial/lessons/07/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
In this lesson, you will learn how to use Umpire's operations to copy data
between CPU and GPU memory in a portable way.

In `seven.cpp`, we create an allocator for the GPU with:
```
auto allocator = rm.getAllocator("DEVICE");
```

and a separate allocator on the CPU with:

```
auto host_allocator = rm.getAllocator("HOST");
```

We will initialize the data on the CPU, but we want to do computations on
the GPU. Therefore, we have to take advantage of some Umpire "Operators".

Umpire provides a number of operations implemented as methods on the
`ResourceManager`. These typically take pointer and size arguments, but you do
not need to tell Umpire which Allocator each pointer came from. Umpire keeps
Expand Down
2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/07/seven.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ int main()
a_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
b_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));

RAJA::forall< RAJA::loop_exec >(
RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
a_h[i] = 1.0;
b_h[i] = 1.0;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ int main()
a_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
b_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));

RAJA::forall< RAJA::loop_exec >(
RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
a_h[i] = 1.0;
b_h[i] = 1.0;
Expand Down
2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/08/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This newly created `pool` is an `umpire::Allocator` using the `QuickPool` strate
in: (1) the name we would like the pool to have, and (2) the allocator we previously created with the `ResourceManager` (see line 17 in the
file `eight.cpp`).

There are other arguments that could be passed to the pool constructor if needed. See the documentation page for more: https://umpire.readthedocs.io/en/develop/doxygen/html/index.html
There are other arguments that could be passed to the pool constructor if needed. These additional option arguments are a bit advanced and are beyond the scope of this tutorial. However, you can visit the documentation page for more: https://umpire.readthedocs.io/en/develop/doxygen/html/index.html

When you have created your QuickPool allocator, uncomment the COMPILE define on line 7;
then compile and run the code:
Expand Down
2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/08/eight.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ int main()
a_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
b_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));

RAJA::forall< RAJA::loop_exec >(
RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
a_h[i] = 1.0;
b_h[i] = 1.0;
Expand Down
2 changes: 1 addition & 1 deletion Intro_Tutorial/lessons/08/solution/eight_solution.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ int main()
a_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
b_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));

RAJA::forall< RAJA::loop_exec >(
RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
a_h[i] = 1.0;
b_h[i] = 1.0;
Expand Down
4 changes: 3 additions & 1 deletion Intro_Tutorial/lessons/09/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ where `data` is a `double*`, and `N` is the size of each dimension. The size of
`data` should be at least `N*N`.

In the file `nine.cpp`, there is a `TODO` comment where you should create three
views, A, B, and C. When you are ready, uncomment the COMPILE define on line 7;
views, A, B, and C. You will notice that we are doing the same dot product
calculation, but this time for matrices. Thus, we are now doing a matrix
multiplication. When you are ready, uncomment the COMPILE define on line 7;
then you can compile and run the code:

```
Expand Down
8 changes: 4 additions & 4 deletions Intro_Tutorial/lessons/09/nine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ int main()
// TODO: Create a view for A, B, and C
constexpr int DIM = 2;

RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
A(row, col) = row;
B(row, col) = col;
});
});

RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += A(row, k) * B(k, col);
Expand Down
8 changes: 4 additions & 4 deletions Intro_Tutorial/lessons/09/solution/nine_solution.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,15 @@ int main()
RAJA::View<double, RAJA::Layout<DIM>> B(b, N, N);
RAJA::View<double, RAJA::Layout<DIM>> C(c, N, N);

RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
A(row, col) = row;
B(row, col) = col;
});
});

RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += A(row, k) * B(k, col);
Expand Down
45 changes: 45 additions & 0 deletions Intro_Tutorial/lessons/10/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,47 @@ The previous lesson used multiple `RAJA::forall` calls, nested inside each
other, to implement a matrix multiplication. This pattern will work when
executing on the CPU, but not on a GPU. It is also less efficient.

That's why RAJA provides the `RAJA::kernel` functionality. We can create a
`RAJA::KernelPolicy` to describe the layout of our nested loops. For example,
this triple nest loop on the CPU:

```
for (int k = kmin; k < kmax; ++k) {
for (int j = jmin; j < jmax; ++j) {
for (int i = imin; i < imax; ++i) {
printf( " (%d, %d, %d) \n", i, j, k);
}
}
}
```

will require a kernel policy and kernel like this:

```
using KJI_EXECPOL = RAJA::KernelPolicy<
RAJA::statement::For<2, RAJA::seq_exec, // k
RAJA::statement::For<1, RAJA::seq_exec, // j
RAJA::statement::For<0, RAJA::seq_exec,// i
RAJA::statement::Lambda<0>
>
>
>
>;

RAJA::kernel<KJI_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
[=] (int i, int j, int k) {
printf( " (%d, %d, %d) \n", i, j, k);
});
```

Where the IRange, JRange, and KRange are simply defined like:

```
RAJA::TypedRangeSegment<int> KRange(0, kmax);
RAJA::TypedRangeSegment<int> JRange(0, jmax);
RAJA::TypedRangeSegment<int> IRange(0, imax);
```

Take a look at the RAJA documentation for a detailed explanation of the
`RAJA::kernel` method:
https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/kernel_nested_loop_reorder.html
Expand All @@ -19,6 +60,10 @@ If you are stuck, you can reference the matrix-multiply example in the RAJA
repository:
https://github.com/LLNL/RAJA/blob/develop/examples/tut_matrix-multiply.cpp

Keep in mind that this matrix multiplication lesson is built upon the previous
dot product lessons since a matrix multiplication is just a matrix version of
the dot product. The `RAJA::View` help us see this connection better.

When you have finished making your changes, uncomment the COMPILE define on line 7;
then compile and run the code:

Expand Down
6 changes: 3 additions & 3 deletions Intro_Tutorial/lessons/10/solution/ten_solution.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ int main()
auto& rm = umpire::ResourceManager::getInstance();

auto allocator = rm.getAllocator("HOST");
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("POOL", allocator);
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("myPOOL", allocator);

a = static_cast<double *>(pool.allocate(N*N*sizeof(double)));
b = static_cast<double *>(pool.allocate(N*N*sizeof(double)));
Expand All @@ -34,8 +34,8 @@ int main()
// TODO: initialization loop
using EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::loop_exec, // row
RAJA::statement::For<0, RAJA::loop_exec, // col
RAJA::statement::For<1, RAJA::seq_exec, // row
RAJA::statement::For<0, RAJA::seq_exec, // col
RAJA::statement::Lambda<0>
>
>
Expand Down
Loading