No Healthy Upstream following MS Docs on AAG for Containers #124454

dan93-93 · 2024-09-18T16:46:31Z

I've been following the docs verbatim over the past two days and pulling my hair out why this error is occurring:

Firstly, I ran all the az shell commands from this article:

Quickstart: Deploy Application Gateway for Containers ALB Controller.

All resources were deployed successfully inc the ALB controller. Though that naming the description for HELM_NAMESPACE='<your cluster name>' being cluster name was odd (a convention?) though I just named it as 'default'.

Then I ran through the BYO deployment article

Quickstart: Create Application Gateway for Containers - bring your own deployment

Created a new vNet and subnet as I'm doing BYO:

VNET_ADDRESS_PREFIX='10.0.0.0/16'  # Allows for multiple subnets
SUBNET_ADDRESS_PREFIX='10.0.1.0/24'  # Provides 256 addresses, meeting the 250 requirement

I noticed this specific article referenced creating a frontend which doesn't match with the frontend name associated with the next article. It'd be better if we reference the same frontend name throughout the documentation to avoid gateway class misconfiguration: FRONTEND_NAME='test-frontend'?

Last article followed was regarding SSL offloading:

SSL offloading with Application Gateway for Containers - Gateway API

As mentioned above, the referenced frontend is called FRONTEND_NAME='frontend' here whereas before it was referenced FRONTEND_NAME='test-frontend' (obvious please correct me if this isn't a right, but it would seem more appropriate to reference the previous article's frontend name).

Going through the documentation, not doing anything outside of what the docs have referenced (bar the frontend name change) and making sure the route and gateway are successful still throws a No healthy upstream response from curling the fqdn.

I've attached a .txt file I created to track all the shell commands that needed to be ran AAG4C Deployment.txt - if there's been any misconfiguration I'd be really appreciative to know why. Redacted some info it in, also convert it to a .sh for ease for reading.

Equally I have also reviewed the troubleshooting and got the logs of the ALB and looks like it's been updated:

{"level":"info","version":"1.2.3","AGC":"alb-test","alb-resource-id":"/subscriptions/b6f333ab-db5c-42fe-a88f-508eef404579/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test","operationID":"f70b34f7-39ad-4359-be93-6fc2412708ed","Timestamp":"2024-09-18T16:32:53.603597699Z","message":"Application Gateway for Containers resource config update OPERATION_STATUS_SUCCESS with operation ID f70b34f7-39ad-4359-be93-6fc2412708ed"}

So a bit stuck gotta admit, I'd really like to know where I've gone wrong....

services: application-gateway
author: @GregLin
ms.service: azure-application-gateway
ms.subservice: appgw-for-containers
ms.custom: devx-track-azurecli
ms.topic: quickstart
ms.author: @GregLin

The text was updated successfully, but these errors were encountered:

ManoharLakkoju-MSFT · 2024-09-19T04:47:32Z

@dan93-93
Thanks for your feedback! We will investigate and update as appropriate.

dan93-93 · 2024-09-19T10:52:19Z

I'd also like to add the following:

I've tried troubleshooting the issue and looking at backend health:

kubectl get pods -n azure-alb-system
kubectl logs alb-controller-5cdcb6459b-ck2pf -n azure-alb-system -c alb-controller # Standby controller
kubectl logs alb-controller-5cdcb6459b-lqdhv -n azure-alb-system -c alb-controller # Elected controller

kubectl port-forward alb-controller-5cdcb6459b-lqdhv -n $CONTROLLER_NAMESPACE 8000 8001
curl 'http://127.0.0.1:8000/backendHealth?service-name=test-infra/echo/80&detailed=true'

What was returned:

{
  "services": [
    {
      "serviceName": "test-infra/echo/80",
      "serviceHealth": [
        {
          "albId": "/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test",
          "totalEndpoints": 1,
          "totalHealthyEndpoints": 0,
          "totalUnhealthyEndpoints": 1,
          "endpoints": [
            {
              "address": "10.224.0.42",
              "health": {
                "status": "UNHEALTHY"
              }
            }
          ]
        }
      ]
    }
  ]
}

I created a deployment.yaml file and added a HealthCheckPolicy (used kubectl apply -f deployment.yaml):

apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
  name: gateway-health-check-policy
  namespace: test-infra
spec:
  targetRef:
    group: ""
    kind: Service
    name: echo
    namespace: test-infra
  default:
    interval: 5s
    timeout: 3s
    healthyThreshold: 1
    unhealthyThreshold: 1
    port: 80
    http:
      path: /
      match:
        statusCodes:
        - start: 200
          end: 299

And I also configured a readiness probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: echo
  name: echo
  namespace: test-infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
      - image: gcr.io/k8s-staging-ingressconformance/echoserver:v20220815-e21d1a4
        name: echo
        lifecycle:
          preStop:
            exec:
              command: ["sleep", "10"]
        ports:
          - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /
            port: 3000
          periodSeconds: 3
          timeoutSeconds: 1
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

Unfortunately even with these additions it still returns UNHEALTHY. Please note that the IP 10.224.0.42 is also the Pod IP, surely it shouldn't care about the Pod IP but rather than Service itself.

I deployed a curl-pod

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
  namespace: test-infra
spec:
  containers:
  - name: curl-container
    image: curlimages/curl:latest
    command: ["sleep", "3600"]
EOF

Running the below returns the same information:

kubectl exec -it curl-pod -n test-infra -- /bin/sh
curl http://echo.test-infra.svc.cluster.local:80/
curl http://10.224.0.42:3000/

{
 "path": "/",
 "host": "echo.test-infra.svc.cluster.local",
 "method": "GET",
 "proto": "HTTP/1.1",
 "headers": {
  "Accept": [
   "*/*"
  ],
  "User-Agent": [
   "curl/8.10.1"
  ]
 },
 "namespace": "test-infra",
 "ingress": "",
 "service": "",
 "pod": "echo-7965899f7d-hvw4l"
}

So why is the backendHealth displaying as unhealthy? What constitutes it being unhealthy? As the detailed logs as well don't give any indication as to why...

TPavanBalaji · 2024-09-24T08:51:52Z

@dan93-93
Thank you for bringing this to our attention.
I've delegated this to content author, who will review it and offer their insightful opinions.

dan93-93 · 2024-09-27T15:01:59Z

Is there any update or is there any more information you need from me?

smithjw · 2024-09-30T00:17:11Z

@dan93-93 @greg-lindsay I believe I've found the issues here as I was also running into the dreaded No healthy upstream issue with my containers. I believe the issue is with the default HealthCheckPolicy. What I found was that you need to configure your own custom HealthCheckPolicy pointing to the targetPort of the Service that the HTTPRoute is pointing to, and not the backendRefs port. As soon as that custom HealthCheckPolicy is created and deployed, my App Gateways started working as intended.

This is likely something that needs to be better defined in the documentation as it will catch a lot of people out. Following the example on SSL offloading with Application Gateway for Containers - Gateway API if you add in a healthcheck like the following, things should work:

apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
  name: echo-healthcheck
  namespace: test-infra
spec:
  targetRef:
    group: ''
    kind: Service
    name: echo
    namespace: tig
  default:
    interval: 10s
    timeout: 3s
    healthyThreshold: 1
    unhealthyThreshold: 5
    port: 3000 # targetPort of the service, not port
    http:
      path: /
      match:
        statusCodes:
          - start: 200
            end: 299
    useTLS: false

dan93-93 · 2024-09-30T14:44:28Z

@smithjw thanks for the input, I just ran the below and unfortunately I'm still running in to this error even with the policy in place using the targetPort:

kubectl apply -f https://trafficcontrollerdocs.blob.core.windows.net/examples/https-scenario/ssl-termination/deployment.yaml

Then created a healthcheckpolicy:

kubectl apply -f - <<EOF
apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
  name: echo-healthcheck
  namespace: test-infra
spec:
  targetRef:
    group: ''
    kind: Service
    name: echo
    namespace: test-infra
  default:
    interval: 10s
    timeout: 3s
    healthyThreshold: 1
    unhealthyThreshold: 5
    port: 3000
    http:
      path: /
      match:
        statusCodes:
          - start: 200
            end: 299
    useTLS: false
EOF

Appears to be valid:

Status:
  Conditions:
    Last Transition Time:  2024-09-30T14:33:41Z
    Message:               Valid HealthCheckPolicy
    Observed Generation:   2
    Reason:                Accepted
    Status:                True
    Type:                  Accepted
Events:                    <none>

Then created the gateway and httpRoute as per the docs, ensuring successful deployment of each resource:

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway-01
  namespace: test-infra
  annotations:
    alb.networking.azure.io/alb-id: $RESOURCE_ID
spec:
  gatewayClassName: azure-alb-external
  listeners:
  - name: https-listener
    port: 443
    protocol: HTTPS
    allowedRoutes:
      namespaces:
        from: Same
    tls:
      mode: Terminate
      certificateRefs:
      - kind : Secret
        group: ""
        name: listener-tls-secret
  addresses:
  - type: alb.networking.azure.io/alb-frontend
    value: $FRONTEND_NAME
EOF

# Check gateway
kubectl get gateway gateway-01 -n test-infra -o yaml

# Create HttpRoute - do not deploy manually if using PythonApp.yaml (it's in the file)
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: https-route
  namespace: test-infra
spec:
  parentRefs:
  - name: gateway-01
    sectionName: https-listener
  rules:
  - backendRefs:
    - name: echo
      port: 80
EOF

# Check HttpRoute
kubectl get httproute https-route -n test-infra -o yaml

Still thinks the endpoint of the pod is unhealthy:

{
  "services": [
    {
      "serviceName": "test-infra/echo/80",
      "serviceHealth": [
        {
          "albId": "/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test",
          "totalEndpoints": 1,
          "totalHealthyEndpoints": 0,
          "totalUnhealthyEndpoints": 1,
          "endpoints": [
            {
              "address": "10.224.0.7",
              "health": {
                "status": "UNHEALTHY"
              }
            }
          ]
        }
      ]
    }
  ]
}

Deleted the pod as well to see if a new pod maybe helped but sadly not.

smithjw · 2024-10-04T06:16:14Z

@dan93-93 Ok, something else I've seen which I think it may be. I spun up a new deployment, service, modified my httproute, and added a custom health check with the backend port but saw the same Unhealthy message.

Just on a whim I tried setting HealthCheckPolicy back to the Service port, deployed that, modified the HealthCheckPolicy back to the backend port, deployed and now it's reporting healthy.

I have no idea why it's not working first time but seems to require creating the HealthCheckPolicy, changing the port and back again before it reports as healthy.

issues-automation bot added the Pri3 label Sep 18, 2024

ManoharLakkoju-MSFT assigned TPavanBalaji Sep 19, 2024

ManoharLakkoju-MSFT added triaged cxp product-issue labels Sep 19, 2024

TPavanBalaji added the azure-application-gateway/svc label Sep 24, 2024

TPavanBalaji assigned greg-lindsay and unassigned TPavanBalaji Sep 24, 2024

TPavanBalaji added assigned-to-author and removed cxp labels Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Healthy Upstream following MS Docs on AAG for Containers #124454

No Healthy Upstream following MS Docs on AAG for Containers #124454

dan93-93 commented Sep 18, 2024 •

edited by TPavanBalaji

Loading

ManoharLakkoju-MSFT commented Sep 19, 2024

dan93-93 commented Sep 19, 2024 •

edited

Loading

TPavanBalaji commented Sep 24, 2024

dan93-93 commented Sep 27, 2024

smithjw commented Sep 30, 2024

dan93-93 commented Sep 30, 2024

smithjw commented Oct 4, 2024

No Healthy Upstream following MS Docs on AAG for Containers #124454

No Healthy Upstream following MS Docs on AAG for Containers #124454

Comments

dan93-93 commented Sep 18, 2024 • edited by TPavanBalaji Loading

ManoharLakkoju-MSFT commented Sep 19, 2024

dan93-93 commented Sep 19, 2024 • edited Loading

TPavanBalaji commented Sep 24, 2024

dan93-93 commented Sep 27, 2024

smithjw commented Sep 30, 2024

dan93-93 commented Sep 30, 2024

smithjw commented Oct 4, 2024

dan93-93 commented Sep 18, 2024 •

edited by TPavanBalaji

Loading

dan93-93 commented Sep 19, 2024 •

edited

Loading