智能弹性实践:Kubernetes HPA 与观测云的自定义指标集成

    前言

    Kubernetes 作为容器编排领域的领导者,已经确立了其在云原生技术中的事实标准地位。其中,Horizontal Pod Autoscaler(HPA)作为 Kubernetes 的核心组件之一,扮演着至关重要的角色。它赋予了应用程序根据实时工作负载和性能指标动态调整资源的能力,确保了服务的高可用性和响应性。

    同时,Kubernetes 的 Custom Metrics API 与 External Metrics API 为 HPA 提供了强大的扩展能力,使用户能够基于具体业务需求定制弹性伸缩策略。无论是基于特定业务指标的 QPS,还是其他复杂的性能参数,用户都可以实现精准的资源管理。

    本文将深入探讨如何利用观测云平台与 Custom Metrics API 对接,实现基于自定义指标的智能弹性伸缩。我们将通过实际的业务场景,展示如何根据业务副本的 QPS 等关键性能指标,自动化地调整资源,以应对不断变化的业务需求。

    跟随本文的指导,您将学会如何构建一个既灵活又高效的容器化应用环境,确保在各种业务场景下都能保持最佳性能。

    操作步骤

    1. 部署业务程序

    部署应用容器到集群,此处提供一个封装好镜像的 httpserver 程序。该程序暴露了 HTTP 请求的监控指标,指标名称为 http_requests_total ,为历史访问的累计值。

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: httpserver
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: httpserver
      template:
        metadata:
          labels:
            app: httpserver
        spec:
          containers:
            - name: httpserver
              image: pubrepo.jiagouyun.com/demo/httpserver:guance-metrics
              imagePullPolicy: Always
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: httpserver
      namespace: default
      labels:
        app: httpserver
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/path: '/metrics'
        prometheus.io/port: 'http'
    spec:
      type: ClusterIP
      ports:
      - name: http
        port: 80
        targetPort: 80
      selector:
        app: httpserver
    

    2. 安装 guance-metrics-apiserver

    在使用 Kubernetes 的标准 API 接口时,请求实际上是首先到达 Aggregator 聚合层。这里的 kube-aggregator 充当代理,将请求转发到相应的后端服务。kube-aggregator 本质上是一个基于 URL 路由的代理服务器,它能够根据请求的 URL 将流量分发到不同的 API 后端。这种设计为 Kubernetes 的 API 提供了极高的灵活性和扩展性,其工作原理如下图:

    我们这里要用到的 guance-metrics-apiserver 则是部署在 Kubernetes 集群内部的一个自定义 API 后端服务,负责接收、验证并注入自定义指标数据,也就是对于上图中蓝色访问路径的具体实现。

    YAML文件内容如下:

    apiVersion: v1
    kind: Namespace
    metadata:
      name: guance-metrics
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: guance-metrics:system:auth-delegator
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: system:auth-delegator
    subjects:
      - kind: ServiceAccount
        name: guance-metrics-apiserver
        namespace: guance-metrics
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: guance-metrics-auth-reader
      namespace: kube-system
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: extension-apiserver-authentication-reader
    subjects:
      - kind: ServiceAccount
        name: guance-metrics-apiserver
        namespace: guance-metrics
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: guance-metrics-apiserver
      name: guance-metrics-apiserver
      namespace: guance-metrics
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: guance-metrics-apiserver
      template:
        metadata:
          labels:
            app: guance-metrics-apiserver
          name: guance-metrics-apiserver
        spec:
          serviceAccountName: guance-metrics-apiserver
          containers:
            - name: guance-metrics-apiserver
              image: pubrepo.jiagouyun.com/base/guance-metrics-adapter:latest
              imagePullPolicy: IfNotPresent
              args:
                - --secure-port=6443
                - --cert-dir=/var/run/serving-cert
                - --v=10
              ports:
                - containerPort: 6443
                  name: https
                - containerPort: 8080
                  name: http
              volumeMounts:
                - mountPath: /tmp
                  name: temp-vol
                  readOnly: false
                - mountPath: /var/run/serving-cert
                  name: volume-serving-cert
                  readOnly: false
          volumes:
            - name: temp-vol
              emptyDir: {}
            - name: volume-serving-cert
              emptyDir: {}
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: guance-metrics-resource-reader
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: guance-metrics-resource-reader
    subjects:
      - kind: ServiceAccount
        name: guance-metrics-apiserver
        namespace: guance-metrics
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: guance-metrics-services-proxy
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: guance-metrics-services-proxy
    subjects:
      - kind: ServiceAccount
        name: guance-metrics-apiserver
        namespace: guance-metrics
    ---
    kind: ServiceAccount
    apiVersion: v1
    metadata:
      name: guance-metrics-apiserver
      namespace: guance-metrics
    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: guance-metrics-apiserver
      namespace: guance-metrics
      annotations:
        kubernetes.io/service-account.name: guance-metrics-apiserver
    type: kubernetes.io/service-account-token
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: guance-metrics-apiserver
      namespace: guance-metrics
    spec:
      ports:
        - name: https
          port: 443
          targetPort: 6443
        - name: http
          port: 80
          targetPort: 8080
      selector:
        app: guance-metrics-apiserver
    ---
    apiVersion: apiregistration.k8s.io/v1
    kind: APIService
    metadata:
      name: v1beta1.custom.metrics.k8s.io
    spec:
      service:
        name: guance-metrics-apiserver
        namespace: guance-metrics
      group: custom.metrics.k8s.io
      version: v1beta1
      insecureSkipTLSVerify: true
      groupPriorityMinimum: 100
      versionPriority: 100
    ---
    apiVersion: apiregistration.k8s.io/v1
    kind: APIService
    metadata:
      name: v1beta2.custom.metrics.k8s.io
    spec:
      service:
        name: guance-metrics-apiserver
        namespace: guance-metrics
      group: custom.metrics.k8s.io
      version: v1beta2
      insecureSkipTLSVerify: true
      groupPriorityMinimum: 100
      versionPriority: 200
    ---
    apiVersion: apiregistration.k8s.io/v1
    kind: APIService
    metadata:
      name: v1beta1.external.metrics.k8s.io
    spec:
      service:
        name: guance-metrics-apiserver
        namespace: guance-metrics
      group: external.metrics.k8s.io
      version: v1beta1
      insecureSkipTLSVerify: true
      groupPriorityMinimum: 100
      versionPriority: 100
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: guance-metrics-server-resources
    rules:
      - apiGroups:
          - custom.metrics.k8s.io
        resources: ['*']
        verbs: ['*']
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: guance-metrics-services-proxy
    rules:
      - apiGroups:
          - ''
        resources:
          - services/proxy
        verbs: ['*']
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: guance-metrics-resource-reader
    rules:
      - apiGroups:
          - ''
        resources:
          - namespaces
          - pods
          - services
        verbs:
          - get
          - list
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: guance-metrics-resource-reader
    rules:
      - apiGroups:
          - ''
        resources:
          - namespaces
          - pods
          - services
        verbs:
          - get
          - list
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: hpa-controller-guance-metrics
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: guance-metrics-server-resources
    subjects:
      - kind: ServiceAccount
        name: horizontal-pod-autoscaler
        namespace: kube-system
    

    部署成功后,在 K8s 管控节点上执行以下命令验证:

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
    

    预期返回信息如下:

    3. 观测云采集业务监控指标

    在正式实验前,我们需要通过观测云采集httpserver服务的指标(参考观测云集成文档:https://docs.guance.com/integrations/kubernetes-prom/ )。实际使用中可以是任意自定义的业务指标。

    本例我们在业务中使用 http_requests_total 指标来记录 HTTP 请求。通过 DataFlux Func 平台进行处理,我们可以编写一个指标采集和转换脚本。这个脚本将使用 DataFlux 查询语言(DQL)从观测云查询自定义指标,然后将这些指标转换为标准的 Kubernetes Metrics 数据格式,并将处理后的数据发送给 guance-metrics-apiserver

    指标集里的 http_requests_total 为累计值不能直接使用,可以通过如下的语句计算出每个业务 Pod 的 QPS 监控,计算出来的QPS指标用 http_requests_qps 表示。示例如下:

    DQL 查询语句如下:

    rate("M::httpserver:(avg(http_requests_total)) [2m::] BY pod_name")
    

    在 Func 平台编写功能脚本,下面是完整的示例脚本内容:

    import json
    import requests
    from requests.auth import HTTPBasicAuth
    from guance_toolkit__guance import OpenWay
    
    # 使用 DQL 查询观测云数据
    def query_dql_data(dql):
        openway = OpenWay('tkn_xxxxxxxxxxxxx') # 替换为真实观测云工作空间token
        try:
            dql_res = openway.dql_query(dql)
            return dql_res
        except Exception as e:
            print(f"查询数据出错 {e}")
            return None
    
    # 处理查询结果集
    def process_query_result(dql_res):
        processed_series = []
        for series in dql_res['content'][0]['series']:
            time_value = series['values'][0][0]
            rate_value = series['values'][0][1]
            pod_name = series['tags']['pod_name']
            processed_series.append({
                'time': time_value,
                'pod_name': pod_name,
                'http_requests_qps': rate_value
            })
        return processed_series
    
    # 获取计算后的指标数据
    def get_processed_content():
        # 查询httpserver接口近2分钟的平均QPS
        dql_query = '''rate("M::httpserver:(avg(http_requests_total)) [2m::] BY pod_name")'''
        dql_res = query_dql_data(dql_query)
        if dql_res is not None:
            processed_content = process_query_result(dql_res)
            # print(json.dumps(processed_content, indent=2))
        else:
            print("查询失败, 无结果可处理.")
        # print(processed_content)
        return processed_content
    
    # 将指标数据发送到 Guance Metrics APIServer
    def post_guance_metrics_apiserver(api_server, token):
        headers = {
            'Authorization': f'Bearer {token}',
            'Content-Type': 'application/json'
        }
    
        # 集群内代理API写入指标接口
        remote_proxy_api_url = f'{api_server}/api/v1/namespaces/guance-metrics/services/guance-metrics-apiserver:http/proxy/write-metrics'
    
        ### 接口规范说明
        # /apis/v1/namespaces/<CUSTOMAPI_NAMESPACE>/services/<NAME_OF_CUSTOM_METRICS_SERVICE>:http/proxy/write-metrics/namespaces/<APP_NAMESAPCE>/pods/<APP_POD_NAME>/<METRIC_NAME>'
    
        base_url = f'{remote_proxy_api_url}/namespaces/default/pods/'
        data_list = get_processed_content()
    
        for data in data_list:
            pod_name = data['pod_name']
            http_requests_qps_value = data['http_requests_qps']
            url = f"{base_url}{pod_name}/http_requests_qps"
            data_raw = f'{http_requests_qps_value:.2f}'
    
            try:
                response = requests.post(url, headers=headers, data=json.dumps(data_raw), verify=False)
                response.raise_for_status()
                print(f"发送数据成功:{url} {data_raw}")
            except requests.exceptions.RequestException as e:
                print(f"发送数据失败:{url} {e}")
    
    @DFF.API('获取httpserver过去2分钟内的平均QPS')
    def main():
        api_server = 'https://xxx.xxx.xxx.xxx:6443' # 替换为集群K8s ApiServer连接地址
        token = DFF.ENV('ack_bearer_token') # K8s admin bearer token, 用以下命令获取
        post_guance_metrics_apiserver(api_server, token)
    

    注意:脚本内涉及到的相关参数请自行替换,K8s admin bearer token 可以通过以下命令获取(完成第 2 步部署后会自动生成 secret 资源),将 token 返回值使用 Func 内置的环境变量功能存储:

    kubectl get secret -n guance-metrics guance-metrics-apiserver  -o json | jq -r .data.token | base64 -d
    

    在 Function 平台完成脚本开发后,点击编辑界⾯右上角的发布按钮。进入管理后台,添加定时任务,设定为每 2 分钟执行一次。

    添加定时任务成功,等任务自动触发。

    待任务自动执行后将指标集数据写入,终端执行以下命令验证:

    $ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_qps" | jq .
    

    4. 测试验证 HPA

    4.1 创建 HPA 规则

    在集群内部手动创建 HPA 规则,设置业务单副本 Pod 的平均 QPS 达到 30 时将触发扩容,最小副本为 1 个,最大副本为 10 个,配置示例如下:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: httpserver-hpa
      namespace: default
    spec:
      minReplicas: 1
      maxReplicas: 10
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: httpserver
      metrics:
        - pods:
            metric:
              name: http_requests_qps
            target:
              averageValue: 30
              type: AverageValue
          type: Pods
    

    查看 HPA 状态,Targets 列获取指标正常:

    4.2 验证弹性扩容

    使用以下命令模拟压测,压测地址的 IP 为 httpserver 的 svc ip:

    $ kubectl get svc httpserver
    NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
    httpserver   ClusterIP   10.106.27.174   <none>        80/TCP    96m
    $ ab -n 10000 -c 500 http://10.106.27.174/metrics
    

    加压后查看 HPA 状态,targets 列左侧数值表示当前指标为 500,右侧 30 表示触发扩容的阈值,当前达到触发条件:

    观测输出结果,当 REPLICAS 的值逐渐增大并最终跟 MAXPODS 的值相同,说明已实现 HPA 基于业务自定义指标进行弹性伸缩。

    查看 hpa 规则的事件信息,返回扩容成功的信息:

    4.3 验证弹性缩容

    停止接口压测,大概平稳等待 10 分钟左右,再次查看 hpa 规则的事件,会显示执行缩容过程的信息:

    说明:由于 HPA 的默认的保护机制,在自动扩容成功后需要等待 10 分钟左右才会自动缩容。具体见官方文档说明:https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

    至此,我们已经实现了基于自定义指标的自动扩缩容。

    总结

    在实际生产环境中,在设定弹性伸缩的触发条件时,单依赖 CPU 和内存指标是远远不够的。对于大多数 Web 服务后端,实时的访问量(QPS)是至关重要的指标。在触发弹性伸缩的条件选择上,甚至我们可以将订单的数量、Job 队列的长度、服务的响应时间、接口请求错误数等等业务指标作为衡量标准,这样的多维度指标考量,将使我们能够更加精准地应对流量高峰,以更好的应对和处置突发事件,确保服务的稳定性。

    联系我们

    加入社区

    微信扫码
    加入官方交流群

    立即体验

    在线开通,按量计费,真正的云服务!

    免费开启

    支持私有云环境部署

    代码托管平台