diff --git a/README.md b/README.md index 5c6c9bb..4e2c9c8 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,99 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries. +### Resources requirementsSection + +Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. +e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB. +However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded. +Please make sure your Kubernetes cluster has enough resources to run the service. + +## Requirements + +Kubernetes: `>= 1.21.0` + +| Repository | Name | Version | +| ----------------------------- | ------------ | ------- | +| oci://ghcr.io/magda-io/charts | magda-common | 4.2.1 | + +## Values + +| Key | Type | Default | Description | +| ---------------------------------- | ------ | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| affinity | object | `{}` | | +| autoscaling.hpa.enabled | bool | `false` | | +| autoscaling.hpa.maxReplicas | int | `3` | | +| autoscaling.hpa.minReplicas | int | `1` | | +| autoscaling.hpa.targetCPU | int | `90` | | +| autoscaling.hpa.targetMemory | string | `""` | | +| bodyLimit | int | Default to 10485760 (10MB). | Defines the maximum payload, in bytes, that the server is allowed to accept | +| closeGraceDelay | int | Default to 25000 (25s). | The maximum amount of time before forcefully closing pending requests. This should set to a value lower than the Pod's termination grace period (which is default to 30s) | +| debug | bool | `false` | Start Fastify app in debug mode with nodejs inspector inspector port is 9320 | +| defaultImage.imagePullSecret | bool | `false` | | +| defaultImage.pullPolicy | string | `"IfNotPresent"` | | +| defaultImage.repository | string | `"ghcr.io/magda-io"` | | +| deploymentAnnotations | object | `{}` | | +| envFrom | list | `[]` | | +| extraContainers | string | `""` | | +| extraEnvs | list | `[]` | | +| extraInitContainers | string | `""` | | +| extraVolumeMounts | list | `[]` | | +| extraVolumes | list | `[]` | | +| fullnameOverride | string | `""` | | +| global.image | object | `{}` | | +| global.rollingUpdate | object | `{}` | | +| hostAliases | list | `[]` | | +| image.name | string | `"magda-embedding-api"` | | +| lifecycle | object | `{}` | pod lifecycle policies as outlined here: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks | +| livenessProbe.failureThreshold | int | `10` | | +| livenessProbe.httpGet.path | string | `"/status/liveness"` | | +| livenessProbe.httpGet.port | int | `3000` | | +| livenessProbe.initialDelaySeconds | int | `10` | | +| livenessProbe.periodSeconds | int | `20` | | +| livenessProbe.successThreshold | int | `1` | | +| livenessProbe.timeoutSeconds | int | `5` | | +| logLevel | string | `info`. | The log level of the application. one of 'fatal', 'error', 'warn', 'info', 'debug', 'trace'; also 'silent' is supported to disable logging. Any other value defines a custom level and requires supplying a level value via levelVal. | +| nameOverride | string | `""` | | +| nodeSelector | object | `{}` | | +| pluginTimeout | int | Default to 10000 (10 seconds). | The maximum amount of time in milliseconds in which a fastify plugin can load. If not, ready will complete with an Error with code 'ERR_AVVIO_PLUGIN_TIMEOUT'. | +| podAnnotations | object | `{}` | | +| podSecurityContext.runAsUser | int | `1000` | | +| priorityClassName | string | `"magda-9"` | | +| rbac.automountServiceAccountToken | bool | `false` | Controls whether or not the Service Account token is automatically mounted to /var/run/secrets/kubernetes.io/serviceaccount | +| rbac.create | bool | `false` | | +| rbac.serviceAccountAnnotations | object | `{}` | | +| rbac.serviceAccountName | string | `""` | | +| readinessProbe.failureThreshold | int | `10` | | +| readinessProbe.httpGet.path | string | `"/status/readiness"` | | +| readinessProbe.httpGet.port | int | `3000` | | +| readinessProbe.initialDelaySeconds | int | `10` | | +| readinessProbe.periodSeconds | int | `20` | | +| readinessProbe.successThreshold | int | `1` | | +| readinessProbe.timeoutSeconds | int | `5` | | +| replicas | int | `1` | | +| resources.limits.memory | string | `"2000M"` | the memory limit of the container Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. When change the default model, be sure to test the peak memory usage of the service before setting the memory limit. | +| resources.requests.cpu | string | `"100m"` | | +| resources.requests.memory | string | `"850M"` | | +| service.annotations | object | `{}` | | +| service.httpPortName | string | `"http"` | | +| service.labels | object | `{}` | | +| service.loadBalancerIP | string | `""` | | +| service.loadBalancerSourceRanges | list | `[]` | | +| service.name | string | `"magda-embedding-api"` | | +| service.nodePort | string | `""` | | +| service.port | int | `80` | | +| service.targetPort | int | `3000` | | +| service.type | string | `"ClusterIP"` | | +| startupProbe.failureThreshold | int | `30` | | +| startupProbe.httpGet.path | string | `"/status/startup"` | | +| startupProbe.httpGet.port | int | `3000` | | +| startupProbe.initialDelaySeconds | int | `10` | | +| startupProbe.periodSeconds | int | `10` | | +| startupProbe.successThreshold | int | `1` | | +| startupProbe.timeoutSeconds | int | `5` | | +| tolerations | list | `[]` | | +| topologySpreadConstraints | list | `[]` | This is the pod topology spread constraints https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ | + ### Build & Run for Local Development > Please note: for production deployment, please use the released [Docker images](https://github.com/magda-io/magda-embedding-api/pkgs/container/magda-embedding-api) & [helm charts](https://github.com/magda-io/magda-embedding-api/pkgs/container/charts%2Fmagda-embedding-api). @@ -53,89 +146,3 @@ Deploy to minikube Cluster ```bash helm -n test upgrade --install test ./deploy/magda-embedding-api -f ./deploy/test-deploy.yaml ``` - -## Requirements - -Kubernetes: `>= 1.21.0` - -| Repository | Name | Version | -| ----------------------------- | ------------ | ------- | -| oci://ghcr.io/magda-io/charts | magda-common | 4.2.1 | - -## Values - -| Key | Type | Default | Description | -| ---------------------------------- | ------ | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| affinity | object | `{}` | | -| autoscaling.hpa.enabled | bool | `false` | | -| autoscaling.hpa.maxReplicas | int | `3` | | -| autoscaling.hpa.minReplicas | int | `1` | | -| autoscaling.hpa.targetCPU | int | `90` | | -| autoscaling.hpa.targetMemory | string | `""` | | -| bodyLimit | int | Default to 10485760 (10MB). | Defines the maximum payload, in bytes, that the server is allowed to accept | -| closeGraceDelay | int | Default to 25000 (25s). | The maximum amount of time before forcefully closing pending requests. This should set to a value lower than the Pod's termination grace period (which is default to 30s) | -| debug | bool | `false` | Start Fastify app in debug mode with nodejs inspector inspector port is 9320 | -| defaultImage.imagePullSecret | bool | `false` | | -| defaultImage.pullPolicy | string | `"IfNotPresent"` | | -| defaultImage.repository | string | `"ghcr.io/magda-io"` | | -| deploymentAnnotations | object | `{}` | | -| envFrom | list | `[]` | | -| extraContainers | string | `""` | | -| extraEnvs | list | `[]` | | -| extraInitContainers | string | `""` | | -| extraVolumeMounts | list | `[]` | | -| extraVolumes | list | `[]` | | -| fullnameOverride | string | `""` | | -| global.image | object | `{}` | | -| global.rollingUpdate | object | `{}` | | -| hostAliases | list | `[]` | | -| image.name | string | `"magda-embedding-api"` | | -| lifecycle | object | `{}` | pod lifecycle policies as outlined here: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks | -| livenessProbe.failureThreshold | int | `10` | | -| livenessProbe.httpGet.path | string | `"/status/liveness"` | | -| livenessProbe.httpGet.port | int | `3000` | | -| livenessProbe.initialDelaySeconds | int | `10` | | -| livenessProbe.periodSeconds | int | `20` | | -| livenessProbe.successThreshold | int | `1` | | -| livenessProbe.timeoutSeconds | int | `5` | | -| logLevel | string | `info`. | The log level of the application. one of 'fatal', 'error', 'warn', 'info', 'debug', 'trace'; also 'silent' is supported to disable logging. Any other value defines a custom level and requires supplying a level value via levelVal. | -| nameOverride | string | `""` | | -| nodeSelector | object | `{}` | | -| pluginTimeout | int | Default to 10000 (10 seconds). | The maximum amount of time in milliseconds in which a fastify plugin can load. If not, ready will complete with an Error with code 'ERR_AVVIO_PLUGIN_TIMEOUT'. | -| podAnnotations | object | `{}` | | -| podSecurityContext.runAsUser | int | `1000` | | -| priorityClassName | string | `"magda-9"` | | -| rbac.automountServiceAccountToken | bool | `false` | Controls whether or not the Service Account token is automatically mounted to /var/run/secrets/kubernetes.io/serviceaccount | -| rbac.create | bool | `false` | | -| rbac.serviceAccountAnnotations | object | `{}` | | -| rbac.serviceAccountName | string | `""` | | -| readinessProbe.failureThreshold | int | `10` | | -| readinessProbe.httpGet.path | string | `"/status/readiness"` | | -| readinessProbe.httpGet.port | int | `3000` | | -| readinessProbe.initialDelaySeconds | int | `10` | | -| readinessProbe.periodSeconds | int | `20` | | -| readinessProbe.successThreshold | int | `1` | | -| readinessProbe.timeoutSeconds | int | `5` | | -| replicas | int | `1` | | -| resources.limits.memory | string | `"2000M"` | | -| resources.requests.cpu | string | `"100m"` | | -| resources.requests.memory | string | `"850M"` | | -| service.annotations | object | `{}` | | -| service.httpPortName | string | `"http"` | | -| service.labels | object | `{}` | | -| service.loadBalancerIP | string | `""` | | -| service.loadBalancerSourceRanges | list | `[]` | | -| service.name | string | `"magda-embedding-api"` | | -| service.nodePort | string | `""` | | -| service.port | int | `80` | | -| service.targetPort | int | `3000` | | -| service.type | string | `"ClusterIP"` | | -| startupProbe.failureThreshold | int | `30` | | -| startupProbe.httpGet.path | string | `"/status/startup"` | | -| startupProbe.httpGet.port | int | `3000` | | -| startupProbe.initialDelaySeconds | int | `10` | | -| startupProbe.periodSeconds | int | `10` | | -| startupProbe.successThreshold | int | `1` | | -| startupProbe.timeoutSeconds | int | `5` | | -| tolerations | list | `[]` | | -| topologySpreadConstraints | list | `[]` | This is the pod topology spread constraints https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ | diff --git a/README.md.gotmpl b/README.md.gotmpl index 3a76bdc..1671c23 100644 --- a/README.md.gotmpl +++ b/README.md.gotmpl @@ -19,6 +19,21 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries. +### Resources requirementsSection + +Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. +e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB. +However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded. +Please make sure your Kubernetes cluster has enough resources to run the service. + +{{ template "chart.maintainersSection" . }} + +{{ template "chart.requirementsSection" . }} + +{{ template "chart.valuesHeader" . }} + +{{ template "chart.valuesTable" . }} + ### Build & Run for Local Development > Please note: for production deployment, please use the released [Docker images](https://github.com/magda-io/magda-embedding-api/pkgs/container/magda-embedding-api) & [helm charts](https://github.com/magda-io/magda-embedding-api/pkgs/container/charts%2Fmagda-embedding-api). @@ -53,12 +68,4 @@ Deploy to minikube Cluster ```bash helm -n test upgrade --install test ./deploy/magda-embedding-api -f ./deploy/test-deploy.yaml -``` - -{{ template "chart.maintainersSection" . }} - -{{ template "chart.requirementsSection" . }} - -{{ template "chart.valuesHeader" . }} - -{{ template "chart.valuesTable" . }} \ No newline at end of file +``` \ No newline at end of file diff --git a/deploy/magda-embedding-api/values.yaml b/deploy/magda-embedding-api/values.yaml index ec3d5a7..a6d8c54 100644 --- a/deploy/magda-embedding-api/values.yaml +++ b/deploy/magda-embedding-api/values.yaml @@ -161,4 +161,7 @@ resources: cpu: "100m" memory: "850M" limits: + # -- (string) the memory limit of the container + # Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. + # When change the default model, be sure to test the peak memory usage of the service before setting the memory limit. memory: "2000M" \ No newline at end of file