ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

istrator2 · 2024-08-09T07:25:34Z

I have a ZTE MF281 with LTE.

If the connection is very busy, checkuplink often incorrectly recognizes the tunnel as dead and reconnects.

If the connection is heavily utilized, the latency of the client connections increases significantly.
Measured client ping RTT values through the tunnel of 800-1500ms on average, peak values up to 5000ms and occasional packet loss.
(LTE ping RTT from gluon without tunnel increases to approx. 350ms; no packetloss, stagnates there stably)
Checkuplink often incorrectly recognizes the tunnel as dead and reconnects. (Tunnel is not dead, just high latency!)

My Idea:

Increase the timeouts. (helps, but not enough)
+
Repeat the tests. If 3/3 fail --> dead.

(+ some high.latency.mode option for enable/disable the extended tests)

Code might look like this.
I'm testing this at the moment, looks promising.

retry_wget() {
    local url="$1"
    local max_attempts=3
    local attempt=1
    local delay=1
    local ret=0

    while [ $attempt -le $max_attempts ];
    do
        wget "$url" --timeout=10 -O/dev/null -q && return 0 || ret=$?
        logger -p warn -t checkuplink "wget attempt $attempt failed with code $ret, retrying in $delay seconds..."
        sleep $delay
        attempt=$((attempt + 1))
        delay=$((delay * 2))
    done

    return $ret
}

retry_batctl_ping() {
    local gwmac="$1"
    local max_attempts=3
    local attempt=1
    local delay=1

    while [ $attempt -le $max_attempts ];
    do
        if batctl ping -c 7 -t 10 -i 1 "$gwmac" > /dev/null 2>&1; then
            return 0
        fi
        logger -p warn -t checkuplink "batctl ping attempt $attempt failed, retrying in $delay seconds..."
        sleep $delay
        attempt=$((attempt + 1))
        delay=$((delay * 2))
    done

    return 1
}

is_connected() {
        if retry_wget "http://[$(wg|grep fe80|awk '{split($3,A,"/")};{print A[1]}')%$MESH_VPN_IFACE]/";
        then
                GWMAC=$(batctl gwl|awk '/[*]/{print $2}')
                if retry_batctl_ping "$GWMAC";
                then
                        return 0
                fi
        fi
        return 1
}

blocktrron · 2024-08-09T08:51:13Z

Just my 2 cents here.

The nature of the check seems to be not well suited for what you are trying to achieve.

Given the fact the HTTP request is of blocking nature and pings on the L2 router protocol level are not executed continuously (providing a better assessment over the connection state on a lossy link), this surely might fail.

If wireguard does not provide the information about bidirectional link-health, my suggestion would be to implement a daemon which continuously monitors the link-health.

You can do this by sending regular UDP packets with sequenced bodys (at fixed or adaptive intervals) in order to asses the links properties in terms of loss in multiple intervals. With this you can model the anomaly conditions in a more detailed way. This can be done on the Ethernet layer within the vxlan tunnel with a responder on the other end.

Other indicators might be out-of-order delivery, packet checksums, ...

Examples would be:

Increasing requests on continuous 100% loss detection over short interval A
Decreasing requests on 0 % loss over short interval A
Setting different intervals / thresholds based on the uplink type (cellular / etc)

You can also react in other ways, such as updating or implementing shapers when detecting continuous loss.

When implemented as a separate service (either interfaced by a regular unix-socket, ubus, status-file, you name it) you can still use the whole of your script.

As a second thought, the interface also has Rx packet counters you can base your anomaly assumption on. Granted this does not replace any check of bidirectonal connectivity (assuming this is what you after given the way of implementation currently there) however you can take this as a factor and alter your other means of detection on it.

Not everything a go-to implementation guidance, just my ideas how i would tackle this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

istrator2 commented Aug 9, 2024 •

edited

Loading

blocktrron commented Aug 9, 2024

ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

ffmuc-mesh-vpn-wireguard-vxlan: on very busy connection checkuplink often incorrectly recognizes the tunnel as dead and reconnects. #131

Comments

istrator2 commented Aug 9, 2024 • edited Loading

blocktrron commented Aug 9, 2024

istrator2 commented Aug 9, 2024 •

edited

Loading