Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Added 3.11.0 release docs #288

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
id: architecture-summary
title: Architecture summary
sidebar_label: Architecture summary
---

---

<img src={require("../assets/architecture-summary.png").default} alt="Architecture Overview" />

The Litmus architecture can be segregated into two parts:

1. **Control plane:** Contains the components required for the functioning of ChaosCenter, the website-based portal for Litmus.

2. **Execution plane:** Contains the components required for the injection of chaos in the target resources.

- Control plane can be used for creating and scheduling chaos experiments, which is a set of chaos faults defined in a definitive sequence to achieve desired chaos impact on the target resources upon execution. Users can log in to the ChaosCenter using the web UI or the APIs to define a chaos experiment and assess the resilience of target workloads.

- Once the user creates a chaos experiment using the ChaosCenter, it is passed on to the execution plane. The Execution plane can be present either in the same cluster as the ChaosCenter if the self chaos infrastructure is being used, or in a remote cluster if an external chaos infrastructure is being used. The Execution plane interprets the chaos experiment as a list of actions that will inject chaos into the target workloads. It ensures efficient orchestration of chaos in various cloud-native environments using Kubernetes custom resources.

- Once the chaos experiment is executed, Execution plane sends the chaos result to the control plane for their post-processing using either the built-in monitoring dashboard of Litmus or using external observability tools such as Prometheus DB and Grafana dashboard. Litmus also achieves automated chaos experiment runs to execute chaos as part of the CI/CD pipeline based on a set of defined conditions using GitOps.

:::note
With the latest release of LitmusChaos 3.0.0:
- The term **Chaos Delegate/Agent** has been changed to **Chaos Infrastructure**.
- The term **Chaos Experiment** has been changed to **Chaos Fault**.
- The term **Chaos Scenario/Workflow** has been changed to **Chaos Experiment**.
:::
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
id: chaos-control-plane
title: Chaos control plane
sidebar_label: Chaos control plane
---

---

<img src={require("../assets/chaos-control-plane.png").default} alt="Chaos Control Plane" />

Chaos control plane consists of micro-services responsible for the functioning of the ChaosCenter, the website-based portal that can be used for interacting with Litmus, apart from the CLI. Chaos Plane facilitates the creation and scheduling of chaos experiments, system observability during the event of chaos, and post-processing and analysis of fault results.

## Chaos control plane components

- **Authentication Server:** A Golang micro-service that is responsible for authorizing, authenticating the requests received from ChaosCenter and managing users along with their projects. It primarily serves the cause of user creation, user login, resetting the password, updating user information, creating project, managing project related operations.

- **Backend Server:** A GraphQL based Golang micro-service that serves the requests received from ChaosCenter, by either querying the database for the relevant information or by fetching information from the Execution Plane.

- **Database:** A NoSQL MongoDB database micro-service that is accountable for storing users' information, past chaos experiments, saved chaos experiment templates, user projects, ChaosHubs, and GitOps details, among the other information.

- **ChaosCenter:** Refers to the interfaces used by Litmus for creation and scheduling of chaos experiments, system observability during chaos injection, and post chaos result analysis. It includes:

- **Web UI:** A React.js based frontend application micro-service with built-in system observability capabilities and an analytics dashboard. It also facilitates teams of users to collaborate over chaos experiments using role-based user accounts.

- **Litmusctl:** A command-line tool that allows management of Litmus Chaos Infrastructure components. It can be used to create chaos infrastructures, project, and manage multiple Litmus accounts.

- **Litmus API:** Refers to two different Litmus APIs, namely Litmus Authentication API and Litmus Portal API:

- **Litmus Authentication API:** Used to authenticate the identity of a user and to perform several user and project specific tasks like create new users, update profile, update password, create project, invite users to project, get project details etc. It uses the Authentication Server to perform these tasks.

- **Litmus Portal API:** Provides command-line and UI experience for managing and monitoring the events around chaos experiments. It uses the Backend Server to perform its functions.

## Standard Chaos Control Plane Flow

1. The User logs in to the ChaosCenter using a valid login credential. A default project is created for the user on initial login. Every user is a part of a project and has a role assigned to them. To schedule a chaos experiment, the user needs to have an Owner role assigned in the project.
2. The user uploads a Chaos Experiment manifest using the ChaosCenter, which is received by the Backend Server.
3. Backend Server stores the manifest in the Database and also sends it to the Chaos Infrastructure.
4. Chaos Infrastructure uses the Chaos Experiment manifest to inject chaos into the target resources. The steps of the Chaos Experiment execution can be visualized using the ChaosCenter.
5. Chaos Infrastructure returns the results of the chaos faults that were a part of the chaos experiment back to the Backend Server, along with the fault logs.
6. Backend Server then sends the chaos fault results and logs to the ChaosCenter. It also stores the results into the Database for generating post-chaos experiment statistics and information.
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
id: chaos-execution-plane
title: Chaos execution plane
sidebar_label: Chaos execution plane
---

---

<img src={require("../assets/chaos-execution-plane.png").default} alt="Chaos Execution Plane" />

Chaos Execution Plane contains the components responsible for orchestrating the chaos injection in the target resources. They get installed in either an external target cluster if an external chaos infrastructure is being used or in the host cluster containing the control plane if a self chaos infrastructure is being used. It can be further segregated into Litmus Chaos Infrastructure components and Litmus Backend Execution Infrastructure components.

## Litmus Execution Plane Components

Litmus Chaos Infrastructure components help facilitate the chaos injection, manage chaos observability, and enable chaos automation for target resources. These components include:

1. **Workflow Controller:** The Argo Workflow Controller responsible for the creation of Chaos Experiments using the Chaos Experiment CR.

2. **Subscriber:** Serves as the link between the Chaos Execution Plane and the Control Plane. It has a few distinct responsibilities such as performing health check of all the components in Chaos Execution Plane, creation of a Chaos Experiment CR from a Chaos Experiment template, watching for Chaos Experiment events during its execution, and sending the chaos experiment result to the Control Plane.

3. **Event Tracker:** An optional component that is capable of triggering automated chaos experiment runs based on a set of defined conditions for any given resources in the cluster. It is a controller that manages EventTrackerPolicy CR, which is basically the set of defined conditions that is validated by Event Tracker. If the current state of the tracked resources match with the state defined in the EventTrackerPolicy CR, the chaos experiment run run gets triggered. This feature can only be used if GitOps is enabled.

4. **Chaos Exporter:** An optional component that facilitates external observability in Litmus by exporting the chaos metrics generated during the chaos injection as time-series data to the Prometheus DB for its processing and analysis.

Litmus Backend Execution Infrastructure components orchestrate the execution of Chaos Experiment in target resources. These components include:

1. **Chaos Experiment CR:** Refers to the Argo Workflow CR which describes the steps that are executed as a part of the chaos experiment. It is used to define failures during a certain workload condition (such as, say, percentage load), multiple (parallel) failures of dependent and independent services etc.

2. **ChaosExperiment CR:** Used for defining the low-level execution information for any Litmus chaos fault as well as to store the various fault tunables.

3. **ChaosEngine CR:** Used to hold information about how the chaos faults are executed. It connects an application instance with one or more chaos faults while allowing the users to specify run-level details.

4. **Chaos Operator:** A Kubernetes custom-controller that manages the lifecycle of certain resources or applications intending to validate their "desired state". It helps reconcile the state of the ChaosEngine by performing specific actions upon CRUD of the ChaosEngine. It also defines a secondary resource (the ChaosEngine Runner pod), which is created & managed by it to implement the reconcile functions.

<div style={{textAlign: 'center'}}>
<img src={require("../assets/chaos-execution-plane-chaos-operator.png").default} alt="Chaos Operator" />
</div>

5. **ChaosResult CR:** Holds the results of a chaos fault, such as ChaosEngine reference, Fault State, Verdict of the fault (on completion), salient application/result attributes. It also acts as a source for metrics collection for observability.

6. **Chaos Runner:** Acts as a bridge between the Chaos Operator and Chaos Faults. It is a lifecycle manager for the chaos faults that creates Fault Jobs for the execution of fault business logic and monitors the fault pods (jobs) until completion.

<div style={{textAlign: 'center'}}>
<img src={require("../assets/chaos-execution-plane-chaos-runner.png").default} alt="Chaos Runner" />
</div>

7. **Fault Jobs:** Refers to the pods that execute the fault logic. One fault pod is created per chaos fault in the chaos experiment.

## Standard Chaos Execution Plane Flow

1. Subscriber receives the Chaos Experiment manifest from the Control Plane and applies the manifest to create a Chaos Experiment CR.
2. Chaos Experiment CRs are tracked by the Argo Workflow Controller. When the Workflow Controller finds a new Chaos Experiment CR, it creates the ChaosExperiment(Chaos Fault) CRs and the ChaosEngine CRs for the chaos faults that are a part of the chaos experiment.
3. ChaosEngine CRs are tracked by the Chaos Operator. Once a ChaosEngine CR is ready, the Chaos Operator updates the ChaosEngine state to reflect that the particular ChaosEngine is now being executed.
4. For each ChaosEngine resource, a Chaos Runner is created by the Chaos Operator.
5. Chaos Runner firstly reads the chaos parameters from the ChaosExperiment(Chaos fault) CR and overrides them with values from the ChaosEngine CR. It then constructs the Fault Jobs and monitors them until their completion.
6. Fault Jobs execute the fault business logic and undertake chaos injection on target resources. Once done, the ChaosResult is updated with the fault verdict.
7. Chaos Runner then fetches the updated ChaosResult and updates the ChaosEngine status as well as the verdict.
8. Once the ChaosEngine is updated, Subscriber fetches the ChaosEngine details and the ChaosResult and forwards them to Chaos Control Plane.

It is worth noticing that:

- If configured, Chaos Exporter fetches data from the ChaosResult CR and converts it in a time-series format to be consumed by the Prometheus DB.

- An Event Tracker Policy can also be set up as part of the Backend GitOps, where the Backend GitOps Controller tracks a set of specified resources in the target cluster for any change. If any of the tracked resources undergo any change and their resulting state matches the state defined in the Event Tracker Policy, then a pre-defined Chaos Experiment is executed.

:::note
With the latest release of LitmusChaos 3.0.0:
- The term **Chaos Delegate/Agent** has been changed to **Chaos Infrastructure**.
- The term **Chaos Experiment** has been changed to **Chaos Fault**.
- The term **Chaos Scenario/Workflow** has been changed to **Chaos Experiment**.
:::
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
id: chaos-fault-flow
title: Chaos fault flow
sidebar_label: Chaos fault flow
---

---

<img src={require("../assets/experiment-flow.png").default} alt="Chaos Fault Flow" />

The fault execution is triggered upon the creation of a ChaosEngine resource. The ChaosEngine resource interacts with Chaos Runner, which is created by the Chaos Operator. The Chaos Runner creates Fault Jobs that execute the fault business logic. Typically, these ChaosEngines are embedded within the 'steps' of a Litmus Chaos Experiment. However, one may also create and apply the Chaos Engines manually, and then the chaos-operator reconciles this resource and triggers the fault execution. Chaos faults are classified as:

- Kubernetes Faults
- Pod-Level Chaos
- Node-Level Chaos
- Application Chaos
- Cloud Infrastructure

## Chaos Fault Flow Steps

1. Chaos fault execution gets triggered by the Fault Job.
2. Fault tunables and low-level execution details are fetched.
3. ChaosResult gets initialized and its verdict is updated as "Awaited" to indicate that the fault is currently running.
4. Steady-state condition for the respective fault is validated. If the condition is found to be invalid, the fault execution is stopped and the ChaosResult is updated as "Fail".
5. Once the steady-state condition is validated, fault resources are created to facilitate the chaos injection.
6. Chaos injection is performed on the target resources for the specified chaos duration.
7. Chaos injection gets reverted.
8. Post chaos status-check is performed to ensure that the steady-state is still maintained.
9. If the check is invalid, the ChaosEngine and ChaosResult verdicts are updated as "Fail", otherwise they are updated as "Pass".
10. Fault execution ends.

:::note
With the latest release of LitmusChaos 3.0.0:
- The term **Chaos Delegate/Agent** has been changed to **Chaos Infrastructure**.
- The term **Chaos Experiment** has been changed to **Chaos Fault**.
- The term **Chaos Scenario/Workflow** has been changed to **Chaos Experiment**.
:::
25 changes: 25 additions & 0 deletions website/versioned_docs/version-3.11.0/architecture/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
id: overview
title: Overview
sidebar_label: Overview
---

---

The Architecture section contains the component overview, sequence diagrams and description of flow of information through the Litmus architecture.

### [Architecture Summary](architecture-summary.md)

A very high level overview of the entire Litmus architecture with the objective of highlighting the flow of information through the various components.

### [Control Plane](chaos-control-plane.md)

Consists of micro-services responsible for the functioning of the ChaosCenter, the web based portal used for creating, scheduling, and monitoring chaos experiments.

### [Execution Plane](chaos-execution-plane.md)

Contains the components required for the orchestration of chaos injection in the target resources.

### [Chaos Fault Flow](chaos-experiment-flow.md)

Flow of information during the execution of Litmus chaos experiments, grouped into the categories such as pod-level, node-level, application-level, and public-cloud.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Loading