Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ec2-terminate-by-tag] Handle interval 0/1 case #483

Open
yogeek opened this issue Feb 7, 2022 · 1 comment
Open

[ec2-terminate-by-tag] Handle interval 0/1 case #483

yogeek opened this issue Feb 7, 2022 · 1 comment
Assignees
Labels
2.15.0 Issues to be consider for this release bug Something isn't working

Comments

@yogeek
Copy link

yogeek commented Feb 7, 2022

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

What happened:

In the case of ec2-terminate-by-tag with MANAGED_SUBGROUP=enable (when EC2 instances are managed by an ASG), there is an issue when trying to execute only once the chaos.
In general, setting CHAOS_INTERVAL=TOTAL_CHAOS_DURATION is the way to get a single time execution.
But if we set CHAOS_INTERVAL<TOTAL_CHAOS_DURATION the chaos is failing because of the following behavior :

it seems that the code does a loop during all the CHAOS_DURATION

for duration < experimentsDetails.ChaosDuration {
and inside it, it loops over the instanceIDList so it can try to stop the same instance multiple times during the Chaos duration. In the case of (MANAGED_SUBGROUP=disable), the instance is "stopped" instead of terminated so it will stop/start/stop... the same instance without any issue. But in the case of MANAGED_SUBGROUP=enable, the instance is "terminated" and it causes an issue as the instance has not been removed from the instanceIDList , it cannot be stopped as it is not existing anymore in the next iteration...

The only way to have a success is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then we have to wait CHAOS_INTERVAL_TOTAL for nothing at the end of the chaos first (and only) iteration.

=> the case when the interval is 0/1 should be handled

The details are explained in this Slack discussion : https://kubernetes.slack.com/archives/CNXNB0ZTN/p1643826054494339?thread_ts=1643739932.025119&cid=CNXNB0ZTN

What you expected to happen:

In the case of MANAGED_SUBGROUP=enable, the instance has to be removed from the instanceIDList to avoid trying to stop it again in the next iterations.

How to reproduce it (as minimally and precisely as possible):

  • tag an instance with chaos=allowed
  • Launch the experiment with : MANAGED_SUBGROUP=enable, TOTAL_CHAOS_DURATION=500s (a sufficient time to allow the ASG to terminate the stopped instance), CHAOS_INTERVAL=0 (or any value < TOTAL_CHAOS_DURATION), and `INSTANCE_TAG= 'chaos:allowed'
  • the instance is stopped, after several minutes, the instance is terminated by the ASG
  • the code is waiting CHAOS_INTERVAL => 0 seconds
  • the instance is still in the list of instances to stop => the experiment is failing with err: ec2 instance failed to stop, err: IncorrectInstanceState: This instance 'i-0fd0da669ea93c044' is not in a state from which it can be stopped.

The only way to not fail is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then :

  • the instance is stopped, after several minutes, the instance is terminated by the ASG
  • the code is waiting CHAOS_INTERVAL ! (so 400s for nothing)
  • the experiment is successful (but with a useless waiting period of CHAOS_INTERVAL)

Anything else we need to know?:

@ksatchit and @uditgaurav already agreed on that missing behavior, thanks to them for the support to understand this issue 👍

@yogeek
Copy link
Author

yogeek commented Feb 8, 2022

Additionnal info : the above behavior also causes an issue if I tag more than one instance with the chaos: allowed
Indeed, if I tag 2 instances, 2 targeted instances are detected, the first is stopped, terminated, and then as the instanceId is still in the list, the code tries to stop the 1st instance again, which is terminated, and it loops over this error...
the workflow is never ending, I have to delete it manually and of course the 2nd instance is never stopped.

@ispeakc0de ispeakc0de added bug Something isn't working 2.15.0 Issues to be consider for this release labels Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.15.0 Issues to be consider for this release bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants