Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Master Cluster Tutorial - Suggested HAProxy Timeout Settings Cause Unstable Communication #66888

Open
clayoster opened this issue Sep 14, 2024 · 5 comments
Labels
Documentation Relates to Salt documentation needs-triage

Comments

@clayoster
Copy link
Contributor

Description
I have been testing the suggested HAProxy configuration from the Master Cluster tutorial and have found that the suggested client/server timeout values of 1m cause unstable minion communication, specifically with the publisher port (4505).

I am using the default transport with 3 masters and 50 minions all running 3007.1. My HAProxy version is 2.6.12-1 (Debian 12), though I have tested older and newer versions with the same results.

Adjusting TCP keepalive values on the masters and minions does not seem to affect HAProxy closing TCP sessions after 1 minute of inactivity. Reducing tcp_keepalive_idle does speed up minions reconnecting after HAProxy closes the connection though.

It seems that no matter how frequently the master and minions send keepalives, HAProxy will close the sessions after 1 minute if no data is sent through the session. If I run something like salt '*' test.ping every 30 seconds, this keeps the sessions with the publisher port alive for longer than 1 minute.

To Reproduce:

  • On the HAProxy server, run watch "netstat -nalpt | grep 4505" and watch for TCP sessions to switch from "ESTABLISHED" to "TIME_WAIT". This should happen within a minute. Run salt '*' test.ping from the master while sessions are in this condition and you'll see minions fail to respond as they did not see the event published from the master.

I am currently using timeout values of 12h on the publisher and request server ports to reduce the frequency TCP sessions being killed off. While this probably isn't the best solution, it does keep minion communication stable as it greatly reduces how often minions have to re-establish their connection with the master.

Suggested Fix
Is there other configuration expected to be set on the master and minion to allow stable minion communication with the suggested 1 minute timeouts in HAProxy?

Type of documentation
Tutorial

Location or format of documentation
https://docs.saltproject.io/en/latest/topics/tutorials/master-cluster.html

@clayoster clayoster added Documentation Relates to Salt documentation needs-triage labels Sep 14, 2024
@dwoz
Copy link
Contributor

dwoz commented Sep 14, 2024

Yes the timeouts for the publish port should match the 'publish_session'. The docs need to be updated. This has been in my backlog for some time.

@clayoster
Copy link
Contributor Author

clayoster commented Sep 15, 2024

@dwoz Would you recommend leaving publish_session at the default of 24 hours, or setting it to something lower? Additionally, would it be a good idea to set ping_on_rotate: True in this configuration?

@anthonyra
Copy link

I'm also trying to follow this tutorial and can't get the minion to "sign_in" with one of the masters. Even if I directly run the salt-minion with the salt-master in question. @clayoster are you minion configs simply having the master: salt.example.com or is there other settings needed?

@clayoster
Copy link
Contributor Author

@anthonyra Correct, that is the only master-related setting in my minion config file. Using your example, salt.example.com points to my HAProxy server which load balances connections to 3 masters. My HAProxy config follows the example in the master cluster tutorial aside from using 12h instead of 1m for the timeout values as mentioned in this issue.

Do the minion logs give an indication of what the issue may be?

@anthonyra
Copy link

I think it was a combination of the master default config setting the user as salt while the minion was setting the user as root and the bootstrap script auto starting the daemons (so when I changed the config I didn't restart the service just started it). I was able to get it working thank you for the help/feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Relates to Salt documentation needs-triage
Projects
None yet
Development

No branches or pull requests

3 participants