Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

Open
istrator2 opened this issue Aug 9, 2024 · 1 comment

Comments

@istrator2
Copy link

istrator2 commented Aug 9, 2024

I have a ZTE MF281 with LTE.

If the connection is very busy, checkuplink often incorrectly recognizes the tunnel as dead and reconnects.

If the connection is heavily utilized, the latency of the client connections increases significantly.
Measured client ping RTT values through the tunnel of 800-1500ms on average, peak values up to 5000ms and occasional packet loss.
(LTE ping RTT from gluon without tunnel increases to approx. 350ms; no packetloss, stagnates there stably)
Checkuplink often incorrectly recognizes the tunnel as dead and reconnects. (Tunnel is not dead, just high latency!)

My Idea:

Increase the timeouts. (helps, but not enough)
+
Repeat the tests. If 3/3 fail --> dead.

(+ some high.latency.mode option for enable/disable the extended tests)

Code might look like this.
I'm testing this at the moment, looks promising.

retry_wget() {
    local url="$1"
    local max_attempts=3
    local attempt=1
    local delay=1
    local ret=0

    while [ $attempt -le $max_attempts ];
    do
        wget "$url" --timeout=10 -O/dev/null -q && return 0 || ret=$?
        logger -p warn -t checkuplink "wget attempt $attempt failed with code $ret, retrying in $delay seconds..."
        sleep $delay
        attempt=$((attempt + 1))
        delay=$((delay * 2))
    done

    return $ret
}

retry_batctl_ping() {
    local gwmac="$1"
    local max_attempts=3
    local attempt=1
    local delay=1

    while [ $attempt -le $max_attempts ];
    do
        if batctl ping -c 7 -t 10 -i 1 "$gwmac" > /dev/null 2>&1; then
            return 0
        fi
        logger -p warn -t checkuplink "batctl ping attempt $attempt failed, retrying in $delay seconds..."
        sleep $delay
        attempt=$((attempt + 1))
        delay=$((delay * 2))
    done

    return 1
}

is_connected() {
        if retry_wget "http://[$(wg|grep fe80|awk '{split($3,A,"/")};{print A[1]}')%$MESH_VPN_IFACE]/";
        then
                GWMAC=$(batctl gwl|awk '/[*]/{print $2}')
                if retry_batctl_ping "$GWMAC";
                then
                        return 0
                fi
        fi
        return 1
}
@blocktrron
Copy link
Member

Just my 2 cents here.

The nature of the check seems to be not well suited for what you are trying to achieve.

Given the fact the HTTP request is of blocking nature and pings on the L2 router protocol level are not executed continuously (providing a better assessment over the connection state on a lossy link), this surely might fail.

If wireguard does not provide the information about bidirectional link-health, my suggestion would be to implement a daemon which continuously monitors the link-health.

You can do this by sending regular UDP packets with sequenced bodys (at fixed or adaptive intervals) in order to asses the links properties in terms of loss in multiple intervals. With this you can model the anomaly conditions in a more detailed way. This can be done on the Ethernet layer within the vxlan tunnel with a responder on the other end.

Other indicators might be out-of-order delivery, packet checksums, ...

Examples would be:

  • Increasing requests on continuous 100% loss detection over short interval A
  • Decreasing requests on 0 % loss over short interval A
  • Setting different intervals / thresholds based on the uplink type (cellular / etc)

You can also react in other ways, such as updating or implementing shapers when detecting continuous loss.

When implemented as a separate service (either interfaced by a regular unix-socket, ubus, status-file, you name it) you can still use the whole of your script.

As a second thought, the interface also has Rx packet counters you can base your anomaly assumption on. Granted this does not replace any check of bidirectonal connectivity (assuming this is what you after given the way of implementation currently there) however you can take this as a factor and alter your other means of detection on it.

Not everything a go-to implementation guidance, just my ideas how i would tackle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants