如何用Prometheus和Grafana监控Kubernetes集群？

会员服务 ·

如何用Prometheus和Grafana监控Kubernetes集群？

2020 年 8 月 16 日 InfoQ

作者 | Kubernetes Advocate

Prometheus 是一款免费软件，用于监控事件和警报工具。它可以帮助在时间戳系列数据库中记录实时指标，使用 Http 模型进行 n 次查询和实时报警。我们可以使用 Prometheus 来监控整个 Kubernetes 集群。

Prometheus 栈包括：

Prometheus
Alertmanager
kube-state-metrics
node-exporter
Grafana

我们还可以在其中包括警报和仪表板。

Capacity planningCluster healthDeploymentsk8s cluster rsrc usek8s node rsrc usek8s resources clusterk8s resources namespacek8s resources podkube DNSkubeletNodesPodsStatefulsetKubernetes all-nodesKubernetes cluster-allKubernetes pods-clusterKubernetes resources-requests

警报

Component Down （API Server、Kubelet、Node exporter、Alertmanager 以及 Prometheus 等等）
Pod alerts （Crashloopbackoff、Pending，尚未就绪）
Workload controller alerts （Replicas Mismatch、DaemonSet NotScheduled、DaemonSet MisScheduled、Job Failed 和 Long-running Jobs）
Resources alerts （Cpu overcommit、Memory overcommit、Quota exceeded）
Persistent Volume alerts
Kube API error 和 Client alerts
Prometheus configuration error alerts

安装

第一步：从 GitHub 克隆 Prometheus-grafana 仓库：

git clone URL to GIT REPO

第二步：创建一个 manifest 文件：

cd Prometheus-grafanaawk ‘FNR==1 {print “ — -”}{print}’ manifests/* > “prometheus_grafana_manifest.yaml”

第三步：安装 Prometheus-Grafana 栈：

kubectl apply -f prometheus_grafana_manifest.yaml

第四步：为 Grafana 创建 ingress：

如果集群中有一个 ingress 控制器，请更新 grafana-ingress.yaml 文件中的域和 ingress 类，并创建 ingress 资源。

kubectl apply -f grafana-ingress.yaml

如果没有 ingress 控制器，仍然可以使用负载平衡服务或节点端口服务，或使用 Kube-proxy 访问 grafana 。

Grafana Credentials（凭据）

Grafana 的默认凭据为：

Username：Cloud
Password：Cloud

Grafana 登陆页面：

Grafana Nodes 仪表板

你可以根据自己的兴趣设置自己的用户名和密码。

在更新凭据机密文件中的值之前，必须以 base64 格式对用户名和密码进行编码。

echo “myuser” | base64bXl1c2VyCg==echo “HgTf0n9L@wrd” | base64 HgTf0n9L@wrdGHJKLYuiGFDYH=

现在，我们将使用 manifests 目录下的 2-grafana-cerdentials-secret.yaml 中用 base64 编码的用户名和密码来“更新 admin-user 和 admin-password 的值”。

apiVersion: v1kind: Secretmetadata:name: grafananamespace: prometheuslabels:app.kubernetes.io/name: prometheusapp.kubernetes.io/component: grafanatype: Opaquedata:admin-user: jdvchksojs)==admin-password: GHJKLYuiGFDYH=

运行命令：

kubectl apply -f 2-grafana-credentials-secret.yaml

如果 Grafana 已经安装并正在运行，则必须删除现有的 Pod。我们将看到一个新的 Pod，具有最新配置和更新配置。

获取 Grafana 凭据

你可以通过解码值从 secret 中获得凭据：

echo "Username: $(kubectl get secret grafana --namespace prometheus \--output=jsonpath='{.data.admin-user}' | base64 --decode)"echo "Password: $(kubectl get secret grafana --namespace prometheus \--output=jsonpath='{.data.admin-password}' | base64 --decode)"

我们还可以看到，在 Prometheus 中，无需身份验证即可登录到 Web 界面。

Prometheus Web 界面：

配置 Alertmanager（警报管理器）

在安装栈时，必须提供警报接收器的详细信息。

否则，你将永远不会收到有关集群状态变更和资源利用率的通知。

我们可以根据需要更改配置。

Alert Manager 配置了一个以 YAML 格式编写的配置文件，该文件定义了规则、通知路由和接收器。

下面是 Email、Slack 和 Webhook 接收器的配置示例：

Email ：

global:resolve_timeout: 5mreceivers:- name: email_configemail_configs:- to: "< to_address >"from: "< from_address >"smarthost: "< smtp_host:port >"auth_username: "< smtp_username >"auth_password: "< smtp_password >"route:group_by:- jobreceiver: email_configgroup_interval: 5mgroup_wait: 30srepeat_interval: 30m

Slack :

global:resolve_timeout: 5mslack_api_url: "< slack_webhook_url >"receivers:- name: "slack-notifications"slack_configs:- channel: "#alerts"route:group_by:- jobreceiver: "slack-notifications"group_interval: 5mgroup_wait: 30srepeat_interval: 30m

Web-hook :

global:resolve_timeout: 5mreceivers:- name: webhookwebhook_configs:- url: "< webhook_url >"route:group_by:- jobrepeat_interval: 30mgroup_interval: 5mgroup_wait: 30sreceiver: webhook

如上所述，在 mainifests 目录下的 1-alermanager-configmap.yaml 文件中更新配置，并应用配置。

kubectl apply -f 1-alertmanager-configmap.yaml

更新 coonfigmap 后，重启正在运行的 alertmanager pod。将使用更新后的配置创建一个新的 pod。

参考阅读：

https://medium.com/faun/how-to-monitor-kubernetes-cluster-with-prometheus-and-grafana-8ec7e060896f

InfoQ 读者交流群上线啦！各位小伙伴可以扫描下方二维码，添加 InfoQ 小助手，回复关键字“进群”申请入群。大家可以和 InfoQ 读者一起畅所欲言，和编辑们零距离接触，超值的技术礼包等你领取，还有超值活动等你参加，快来加入我们吧！

点个在看少个 bug

登录查看更多

相关内容

Kubernetes

关注 13

Kubernetes 是一个自动化部署，扩展，以及容器化管理应用程序的开源系统。

【2020新书】使用Kubernetes开发高级平台，519页pdf

专知会员服务

69+阅读 · 2020年9月19日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【2020新书】实战R语言4，323页pdf

专知会员服务

102+阅读 · 2020年7月1日

Python导论，476页pdf，现代Python计算

专知会员服务

264+阅读 · 2020年5月17日

干净的数据：数据清洗入门与实践，204页pdf

专知会员服务

164+阅读 · 2020年5月14日

深度神经网络实时物联网图像处理，241页pdf

专知会员服务

78+阅读 · 2020年3月15日

【新书】Java企业微服务，Enterprise Java Microservices，272页pdf

专知会员服务

53+阅读 · 2020年1月30日

【干货】大数据入门指南：Hadoop、Hive、Spark、 Storm等

专知会员服务

98+阅读 · 2019年12月4日

【O'Reilly TensorFlow Conference 2019】使用TensorFlow Extended（TFX）的生产ML管道（ Production ML pipelines with TensorFlow Extended (TFX) ）， Wifirst 的创始人兼CTO AurélienGéron

专知会员服务

11+阅读 · 2019年11月14日

MIT新书《强化学习与最优控制》

专知会员服务

281+阅读 · 2019年10月9日

在K8S上运行Kafka合适吗？会遇到哪些陷阱？

DBAplus社群

9+阅读 · 2019年9月4日

已删除

运维帮

5+阅读 · 2019年7月26日

基于Prometheus的K8S监控在小米的落地

DBAplus社群

16+阅读 · 2019年7月23日

用Now轻松部署无服务器Node应用程序

前端之巅

16+阅读 · 2019年6月19日

ISeeYou一款强大的社工工具

黑白之道

32+阅读 · 2019年5月17日

浅谈 Kubernetes 在生产环境中的架构

DevOps时代

11+阅读 · 2019年5月8日

使用无服务器式的 Jenkins X：探索Prow，Jenkins X Pipeline Operator和Tekton

DevOps时代

5+阅读 · 2019年4月25日

如何用GitLab本地私有化部署代码库？

Python程序员

9+阅读 · 2018年12月29日

Forge：如何管理你的机器学习实验

专知

11+阅读 · 2018年12月1日

Neo4j 和图数据库起步

Linux中国

8+阅读 · 2017年12月20日

Query Understanding via Intent Description Generation

Arxiv

9+阅读 · 2020年8月25日

GIANT: Scalable Creation of a Web-scale Ontology

Arxiv

10+阅读 · 2020年4月5日

Imbalance Problems in Object Detection: A Review

Arxiv

25+阅读 · 2020年3月11日

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Arxiv

14+阅读 · 2020年3月10日

End to End Video Segmentation for Driving : Lane Detection For Autonomous Car

Arxiv

3+阅读 · 2018年12月13日

Semantics of Data Mining Services in Cloud Computing

Arxiv

4+阅读 · 2018年10月5日

Video-to-Video Synthesis

Arxiv

9+阅读 · 2018年8月20日

Attention-based Group Recommendation

Arxiv

14+阅读 · 2018年4月18日

Wasserstein Auto-Encoders

Arxiv

7+阅读 · 2018年3月12日

Netizen-Style Commenting on Fashion Photos: Dataset and Diversity Measures

Arxiv

4+阅读 · 2018年1月31日

VIP会员