LLNL · kab163 · Aug 14, 2023 · Aug 10, 2023 · Aug 11, 2023 · Aug 11, 2023
diff --git a/Intro_Tutorial/README.md b/Intro_Tutorial/README.md
@@ -10,5 +10,6 @@ You can find lessons in the lessons subdirectory. Each lesson has a README file
 which will introduce new concepts and provide instructions to move forward. 
 
 Each lesson builds upon the previous one, so if you get stuck, you can look at
-the next lesson to see the complete code.
+the next lesson to see the complete code. Additionally, some tutorials have
+solutions folder with a provided solution.
 
diff --git a/Intro_Tutorial/lessons/02/README.md b/Intro_Tutorial/lessons/02/README.md
@@ -12,7 +12,7 @@ Additionally, since we have configured this project to use CUDA, BLT provides a
 `cuda` target to ensure that executables will be built with CUDA support.
 
 The `blt_add_executable` macro has another argument, `DEPENDS_ON`, that you can
-use to list dependendencies.
+use to list dependencies.
 
 ```
 blt_add_executable(

diff --git a/Intro_Tutorial/lessons/03/README.md b/Intro_Tutorial/lessons/03/README.md
@@ -23,10 +23,15 @@ The Allocator class provides methods for allocating and deallocating memory. You
 can view these methods in the Umpire source code documentation here:
 https://umpire.readthedocs.io/en/develop/doxygen/html/classumpire_1_1Allocator.html
 
+To use an Umpire allocator, use the following code, replacing "size in bytes" with
+the desired size for your allocation:
+
 ```
 void* memory = allocator.allocate(size in bytes);
 ```
 
+Don't forget to deallocate your memory afterwards!
+
 For more details, you can check out the Umpire documentation:
 https://umpire.readthedocs.io/en/develop/sphinx/tutorial/allocators.html
 

diff --git a/Intro_Tutorial/lessons/04/README.md b/Intro_Tutorial/lessons/04/README.md
@@ -24,11 +24,11 @@ The lambda expression needs to take one argument, the loop index:
 [=](int i) { // loop body }
 ```
 
-the `[=]` syntax tells the lambda to capture arguments by value (e.g. create a
+The `[=]` syntax tells the lambda to capture arguments by value (e.g. create a
 copy, rather than a reference).
 
 The `EXEC_POLICY` template argument controls how the loop will be executed. In
-this example, we will use the `RAJA::loop_exec` policy to execute this loop on
+this example, we will use the `RAJA::seq_exec` policy to execute this loop on
 the CPU. In later lessons, we will learn about other policies that allow us to
 run code on a GPU.
 
@@ -46,6 +46,3 @@ Address of data:
 data[50] = 50
 ```
 
-
-
-
diff --git a/Intro_Tutorial/lessons/05/CMakeLists.txt b/Intro_Tutorial/lessons/05/CMakeLists.txt
@@ -1,4 +1,4 @@
 blt_add_executable(
   NAME five
   SOURCES five.cpp
-  DEPENDS_ON RAJA umpire cuda)
+  DEPENDS_ON RAJA umpire)
diff --git a/Intro_Tutorial/lessons/05/README.md b/Intro_Tutorial/lessons/05/README.md
@@ -29,8 +29,11 @@ https://raja.readthedocs.io/en/develop/sphinx/user_guide/feature/policies.html#r
 The second parameter, the `TYPE` parameter, is just the data type of the 
 variable, such as `int`.
 
-In the file `five.cpp`, follow the instruction in the `TODO` comment. Once 
-you have filled in the correct reduction statement, compile and run:
+In the file `five.cpp`, follow the instruction in the `TODO` comment to create
+a RAJA Reduction using `seq_exec`. 
+
+
+Once you have filled in the correct reduction statement, compile and run:
 
 ```
 $ make five

diff --git a/Intro_Tutorial/lessons/05/five.cpp b/Intro_Tutorial/lessons/05/five.cpp
@@ -15,14 +15,18 @@ int main()
   a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
   b = static_cast<double*>(allocator.allocate(N*sizeof(double)));
 
+  // TODO: Change this dot variable to instead use a RAJA reduction 
+  // TODO: to calculate and output the dot product of a and b
+  double dot{0.0};
+
   RAJA::forall< RAJA::seq_exec >(
     RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
       a[i] = 1.0;
       b[i] = 1.0;
     }
   );
 
-  // TODO: use a reduction to calculate and output the dotproduct of a and b
+  std::cout << "dot product is "<< dot << std::endl;
 
   allocator.deallocate(a);
   allocator.deallocate(b);

diff --git a/Intro_Tutorial/lessons/05/solution/five_solution.cpp b/Intro_Tutorial/lessons/05/solution/five_solution.cpp
@@ -25,8 +25,6 @@ int main()
     }
   );
 
-  // TODO: use a reduction to calculate and output the dotproduct of a and b
-
   std::cout << "dot product is "<< dot << std::endl;
 
   allocator.deallocate(a);

diff --git a/Intro_Tutorial/lessons/06/README.md b/Intro_Tutorial/lessons/06/README.md
@@ -22,8 +22,22 @@ The predefined names can include:
 In this example, you can use the "UM" resource so that the data can be accessed
 by the CPU or GPU.
 
-There is a `TODO` comment in the `six.cpp` exercise file where you 
-can modify the code to allocate GPU memory. When you are done, build 
+You will also find that we are adjusting the `RAJA::forall` to now work on the GPU.
+In order for this to happen, we need a few extra things. First, we create a 
+`CUDA_BLOCK_SIZE` variable to tell RAJA how big we want our CUDA blocks to be.
+Since there are 32 threads in a warp, 256 tends to be a good value for a block size.
+Other sizes will work too, such as 128 or 512. This just depends on your GPU.
+
+Additionally, the `RAJA::forall` needs the CUDA execution policy. More on GPU
+execution policies can be found here: https://raja.readthedocs.io/en/develop/sphinx/user_guide/feature/policies.html#gpu-policies-for-cuda-and-hip
+
+The `cuda_exec` policy takes the cuda block size argument we created before
+as a template parameter. Finally, as we are filling in the lambda portion of
+the `RAJA::forall`, we need to specify where it will reside in GPU memory. 
+This can be done directly or by using the `RAJA_DEVICE` macro. 
+
+There are several `TODO` comments in the `six.cpp` exercise file where you 
+can modify the code to work on a GPU. When you are done, build 
 and run the example:
 
 ```
@@ -33,3 +47,8 @@ $ ./bin/six
 
 For more information on Umpire's resources, see our documentation:
 https://umpire.readthedocs.io/en/develop/index.html
+
+You can also read more about RAJA foralls and kernels here:
+https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/add_vectors.html?highlight=RAJA_DEVICE#basic-loop-execution-vector-addition
+and
+https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/dot_product.html#raja-variants
diff --git a/Intro_Tutorial/lessons/06/six.cpp b/Intro_Tutorial/lessons/06/six.cpp
@@ -3,10 +3,13 @@
 #include "RAJA/RAJA.hpp"
 #include "umpire/Umpire.hpp"
 
+#if defined(COMPILE)
+
 int main()
 {
   constexpr int N{10000};
-  constexpr std::size_t CUDA_BLOCK_SIZE{256};
+  //TODO: Set up a block size value
+  constexpr std::size_t CUDA_BLOCK_SIZE{????};
   double* a{nullptr};
   double* b{nullptr};
 
@@ -17,18 +20,22 @@ int main()
   a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
   b = static_cast<double*>(allocator.allocate(N*sizeof(double)));
 
-  RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(
-    RAJA::TypedRangeSegment<int>(0, N), [=] RAJA_DEVICE (int i) {
+  //TODO: fill in the forall statement with the CUDA execution policy
+  //TODO: and its block size argument. Then be sure to use RAJA_DEVICE
+  RAJA::forall<????? <?????> >(
+    RAJA::TypedRangeSegment<int>(0, N), [=] ?????? (int i) {
       a[i] = 1.0;
       b[i] = 1.0;
     }
   );
 
   double dot{0.0};
-  RAJA::ReduceSum<RAJA::cuda_reduce, double> cudot(0.0);
+  //TODO: create a RAJA::ReduceSum with cuda_reduce called "cudot" for the GPU
 
-  RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::TypedRangeSegment<int>(0, N), 
-    [=] RAJA_DEVICE (int i) { 
+  //TODO: fill in the forall statement with the CUDA execution policy
+  //TODO: and its block size argument. Then be sure to use RAJA_DEVICE
+  RAJA::forall<?????<????>>(RAJA::TypedRangeSegment<int>(0, N), 
+    [=] ???? (int i) { 
     cudot += a[i] * b[i]; 
   });    
 
@@ -38,4 +45,6 @@ int main()
 
   allocator.deallocate(a);
   allocator.deallocate(b);
+#endif
+  return 0;
 }
diff --git a/Intro_Tutorial/lessons/06/solution/six.cpp → ...rial/lessons/06/solution/six_solution.cpp b/Intro_Tutorial/lessons/06/solution/six.cpp → ...rial/lessons/06/solution/six_solution.cpp
@@ -11,7 +11,6 @@ int main()
   double* b{nullptr};
 
   auto& rm = umpire::ResourceManager::getInstance();
-  // TODO: allocate with device unified memory
   auto allocator = rm.getAllocator("UM");
 
   a = static_cast<double*>(allocator.allocate(N*sizeof(double)));
@@ -38,4 +37,6 @@ int main()
 
   allocator.deallocate(a);
   allocator.deallocate(b);
+
+  return 0;
 }
diff --git a/Intro_Tutorial/lessons/07/README.md b/Intro_Tutorial/lessons/07/README.md
@@ -3,6 +3,20 @@
 In this lesson, you will learn how to use Umpire's operations to copy data
 between CPU and GPU memory in a portable way.
 
+In `seven.cpp`, we create an allocator for the GPU with:
+```  
+auto allocator = rm.getAllocator("DEVICE");
+```
+
+and a separate allocator on the CPU with:
+
+```
+  auto host_allocator = rm.getAllocator("HOST");
+```
+
+We will initialize the data on the CPU, but we want to do computations on
+the GPU. Therefore, we have to take advantage of some Umpire "Operators".
+
 Umpire provides a number of operations implemented as methods on the
 `ResourceManager`. These typically take pointer and size arguments, but you do
 not need to tell Umpire which Allocator each pointer came from. Umpire keeps

diff --git a/Intro_Tutorial/lessons/07/seven.cpp b/Intro_Tutorial/lessons/07/seven.cpp
@@ -21,7 +21,7 @@ int main()
   a_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
   b_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
 
-  RAJA::forall< RAJA::loop_exec >(
+  RAJA::forall< RAJA::seq_exec >(
     RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
       a_h[i] = 1.0;
       b_h[i] = 1.0;

diff --git a/Intro_Tutorial/lessons/07/solution/seven.cpp → ...al/lessons/07/solution/seven_solution.cpp b/Intro_Tutorial/lessons/07/solution/seven.cpp → ...al/lessons/07/solution/seven_solution.cpp
@@ -21,7 +21,7 @@ int main()
   a_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
   b_h = static_cast<double*>(host_allocator.allocate(N*sizeof(double)));
 
-  RAJA::forall< RAJA::loop_exec >(
+  RAJA::forall< RAJA::seq_exec >(
     RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
       a_h[i] = 1.0;
       b_h[i] = 1.0;

diff --git a/Intro_Tutorial/lessons/08/README.md b/Intro_Tutorial/lessons/08/README.md
@@ -23,7 +23,7 @@ This newly created `pool` is an `umpire::Allocator` using the `QuickPool` strate
 in: (1) the name we would like the pool to have, and (2) the allocator we previously created with the `ResourceManager` (see line 17 in the
 file `eight.cpp`).
 
-There are other arguments that could be passed to the pool constructor if needed. See the documentation page for more: https://umpire.readthedocs.io/en/develop/doxygen/html/index.html
+There are other arguments that could be passed to the pool constructor if needed. These additional option arguments are a bit advanced and are beyond the scope of this tutorial. However, you can visit the documentation page for more: https://umpire.readthedocs.io/en/develop/doxygen/html/index.html
 
 When you have created your QuickPool allocator, uncomment the COMPILE define on line 7;
 then compile and run the code:

diff --git a/Intro_Tutorial/lessons/08/eight.cpp b/Intro_Tutorial/lessons/08/eight.cpp
@@ -29,7 +29,7 @@ int main()
   a_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
   b_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
 
-  RAJA::forall< RAJA::loop_exec >(
+  RAJA::forall< RAJA::seq_exec >(
     RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
       a_h[i] = 1.0;
       b_h[i] = 1.0;

diff --git a/Intro_Tutorial/lessons/08/solution/eight_solution.cpp b/Intro_Tutorial/lessons/08/solution/eight_solution.cpp
@@ -26,7 +26,7 @@ int main()
   a_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
   b_h = static_cast<double *>(host_allocator.allocate(N*sizeof(double)));
 
-  RAJA::forall< RAJA::loop_exec >(
+  RAJA::forall< RAJA::seq_exec >(
     RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
       a_h[i] = 1.0;
       b_h[i] = 1.0;

diff --git a/Intro_Tutorial/lessons/09/README.md b/Intro_Tutorial/lessons/09/README.md
@@ -34,7 +34,9 @@ where `data` is a `double*`, and `N` is the size of each dimension. The size of
 `data` should be at least `N*N`.
 
 In the file `nine.cpp`, there is a `TODO` comment where you should create three
-views, A, B, and C. When you are ready, uncomment the COMPILE define on line 7;
+views, A, B, and C. You will notice that we are doing the same dot product 
+calculation, but this time for matrices. Thus, we are now doing a matrix
+multiplication. When you are ready, uncomment the COMPILE define on line 7;
 then you can compile and run the code:
 
 ```

diff --git a/Intro_Tutorial/lessons/09/nine.cpp b/Intro_Tutorial/lessons/09/nine.cpp
@@ -30,15 +30,15 @@ int main()
   // TODO: Create a view for A, B, and C
   constexpr int DIM = 2;
 
-  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
-    RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
+  RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
+    RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
       A(row, col) = row;
       B(row, col) = col;
     });
   });
 
-  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
-    RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
+  RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
+    RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
       double dot = 0.0;
       for (int k = 0; k < N; ++k) {
         dot += A(row, k) * B(k, col);

diff --git a/Intro_Tutorial/lessons/09/solution/nine_solution.cpp b/Intro_Tutorial/lessons/09/solution/nine_solution.cpp
@@ -29,15 +29,15 @@ int main()
   RAJA::View<double, RAJA::Layout<DIM>> B(b, N, N);
   RAJA::View<double, RAJA::Layout<DIM>> C(c, N, N);
 
-  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
-    RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
+  RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
+    RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
       A(row, col) = row;
       B(row, col) = col;
     });
   });
 
-  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
-    RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
+  RAJA::forall<RAJA::seq_exec>( row_range, [=](int row) {
+    RAJA::forall<RAJA::seq_exec>( col_range, [=](int col) {
       double dot = 0.0;
       for (int k = 0; k < N; ++k) {
         dot += A(row, k) * B(k, col);

diff --git a/Intro_Tutorial/lessons/10/README.md b/Intro_Tutorial/lessons/10/README.md
@@ -6,6 +6,47 @@ The previous lesson used multiple `RAJA::forall` calls, nested inside each
 other, to implement a matrix multiplication. This pattern will work when
 executing on the CPU, but not on a GPU. It is also less efficient. 
 
+That's why RAJA provides the `RAJA::kernel` functionality. We can create a
+`RAJA::KernelPolicy` to describe the layout of our nested loops. For example,
+this triple nest loop on the CPU:
+
+```
+  for (int k = kmin; k < kmax; ++k) {
+    for (int j = jmin; j < jmax; ++j) {
+      for (int i = imin; i < imax; ++i) {
+        printf( " (%d, %d, %d) \n", i, j, k);
+      }
+    }
+  }
+```
+
+will require a kernel policy and kernel like this:
+
+```
+using KJI_EXECPOL = RAJA::KernelPolicy<
+                        RAJA::statement::For<2, RAJA::seq_exec,    // k
+                          RAJA::statement::For<1, RAJA::seq_exec,  // j
+                            RAJA::statement::For<0, RAJA::seq_exec,// i 
+                              RAJA::statement::Lambda<0>
+                            > 
+                          > 
+                        > 
+                      >;
+
+  RAJA::kernel<KJI_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
+  [=] (int i, int j, int k) { 
+     printf( " (%d, %d, %d) \n", i, j, k);
+  });
+```
+
+Where the IRange, JRange, and KRange are simply defined like:
+
+```
+RAJA::TypedRangeSegment<int> KRange(0, kmax);
+RAJA::TypedRangeSegment<int> JRange(0, jmax);
+RAJA::TypedRangeSegment<int> IRange(0, imax);
+```
+
 Take a look at the RAJA documentation for a detailed explanation of the
 `RAJA::kernel` method:
 https://raja.readthedocs.io/en/develop/sphinx/user_guide/tutorial/kernel_nested_loop_reorder.html
@@ -19,6 +60,10 @@ If you are stuck, you can reference the matrix-multiply example in the RAJA
 repository:
 https://github.com/LLNL/RAJA/blob/develop/examples/tut_matrix-multiply.cpp
 
+Keep in mind that this matrix multiplication lesson is built upon the previous
+dot product lessons since a matrix multiplication is just a matrix version of
+the dot product. The `RAJA::View` help us see this connection better.
+
 When you have finished making your changes, uncomment the COMPILE define on line 7;
 then compile and run the code:
 

diff --git a/Intro_Tutorial/lessons/10/solution/ten_solution.cpp b/Intro_Tutorial/lessons/10/solution/ten_solution.cpp
@@ -17,7 +17,7 @@ int main()
   auto& rm = umpire::ResourceManager::getInstance();
 
   auto allocator = rm.getAllocator("HOST");
-  auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("POOL", allocator);
+  auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("myPOOL", allocator);
 
   a = static_cast<double *>(pool.allocate(N*N*sizeof(double)));
   b = static_cast<double *>(pool.allocate(N*N*sizeof(double)));
@@ -34,8 +34,8 @@ int main()
   // TODO: initialization loop
   using EXEC_POL =
     RAJA::KernelPolicy<
-      RAJA::statement::For<1, RAJA::loop_exec,    // row
-        RAJA::statement::For<0, RAJA::loop_exec,  // col
+      RAJA::statement::For<1, RAJA::seq_exec,    // row
+        RAJA::statement::For<0, RAJA::seq_exec,  // col
           RAJA::statement::Lambda<0>
         >
       >