grafana / grafana-cli —— 可视化平台运维

一句话定义

grafana 是 K8s 生态最主流的可视化平台——展示 prometheus / loki / elastic 的数据。绝大多数运维都是 web UI 操作（建 dashboard / panel / alert），但 CLI 在装 plugin、重置密码、批量管理 dashboard / datasource 时不可替代。

典型场景

装 plugin：grafana-cli plugins install <name>
admin 密码忘了：grafana-cli admin reset-admin-password
装 dashboard：调 HTTP API + curl
配置文件诊断
升级 / 备份配置

Grafana 本身不是 CLI 工具，但 CLI 操作有自己的角色。这篇聚焦 CLI + API 操作。

Grafana 的常见部署形态

场景	工具
K8s 里跑（最常见）	helm install grafana grafana/grafana 或 kube-prometheus-stack 套装
单机	apt install grafana / docker run
Cloud	Grafana Cloud（不需要装）

K8s 场景下大部分配置走 K8s 资源（ConfigMap / Secret / Deployment env）+ Provisioning（声明式 datasource / dashboard）。

装 grafana-cli

Grafana 自带 grafana-cli——和 grafana-server 二进制一起。

容器场景在 grafana pod 里跑：

kubectl exec -it -n monitoring grafana-xxx -- grafana-cli ...

或者本机装：

apt install grafana                                  # 装完含 grafana-server + grafana-cli

装 / 卸 plugin

# 在 grafana pod 里
grafana-cli plugins list-remote                       # 列可装的
grafana-cli plugins install grafana-clock-panel       # 装一个
grafana-cli plugins install <plugin-id> <version>     # 指定版本
grafana-cli plugins ls                                 # 列已装
grafana-cli plugins update <plugin-id>                # 升级
grafana-cli plugins remove <plugin-id>                # 卸

# 装完要重启 grafana
kubectl rollout restart deploy/grafana -n monitoring

K8s 推荐做法：写到 helm values：

plugins:
  - grafana-clock-panel
  - grafana-piechart-panel

重启自动装。不用 CLI 装（重启后会丢）。

重置 admin 密码

经典需求："忘了 admin 密码"。

方法 1：grafana-cli

kubectl exec -it -n monitoring grafana-xxx -- grafana-cli admin reset-admin-password new-strong-password

方法 2：改 Secret（K8s 推荐）

helm chart 装的 grafana 密码在 secret 里：

kubectl get secret -n monitoring grafana -o jsonpath='{.data.admin-password}' | base64 -d
# old-password

# 改
kubectl patch secret -n monitoring grafana \
  -p '{"data":{"admin-password":"'$(echo -n new-password | base64)'"}}'

# 重启 grafana 让它读新密码
kubectl rollout restart deploy/grafana -n monitoring

方法 3：环境变量

# helm values
adminPassword: new-strong-password
# 或者通过 existingSecret 引用 K8s secret

Provisioning —— 声明式配置（推荐）

Grafana 启动时读取 /etc/grafana/provisioning/ 下的配置自动设置 datasource / dashboard / notifier。比手动在 UI 点开了再点开稳定——K8s 环境 grafana pod 重启不丢配置。

Datasource

# /etc/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-server.monitoring.svc:9090
    isDefault: true

Dashboard

# /etc/grafana/provisioning/dashboards/all.yaml
apiVersion: 1
providers:
  - name: 'default'
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

# 把 dashboard json 放 /var/lib/grafana/dashboards/*.json

K8s helm 里通过 ConfigMap 注入：

# helm values
dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: 'default'
        folder: ''
        type: file
        options:
          path: /var/lib/grafana/dashboards/default

dashboardsConfigMaps:
  default: "grafana-dashboards-default"

ConfigMap grafana-dashboards-default 里包含 dashboard JSON 文件。

HTTP API —— 用 curl 批量管理

Grafana 提供完整 REST API，几乎所有 UI 操作都能 curl。

认证

# 用 admin 密码（Basic Auth）
curl -u admin:password http://grafana:3000/api/...

# 用 API Token（推荐）
# 在 UI 里创建 service account + token
curl -H "Authorization: Bearer <token>" http://grafana:3000/api/...

看 datasources

curl -u admin:pwd http://grafana:3000/api/datasources | jq

加 datasource

curl -X POST -u admin:pwd \
  -H "Content-Type: application/json" \
  http://grafana:3000/api/datasources \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://prom:9090",
    "access": "proxy",
    "isDefault": true
  }'

导入 dashboard

# 从 grafana.com 上找 dashboard ID
DASH_ID=315                                            # K8s cluster monitoring
curl -s https://grafana.com/api/dashboards/$DASH_ID/revisions/3/download \
  | jq '{dashboard: ., inputs: [{name: "DS_PROMETHEUS", type: "datasource", pluginId: "prometheus", value: "Prometheus"}], overwrite: true}' \
  | curl -X POST -u admin:pwd \
      -H "Content-Type: application/json" \
      http://grafana:3000/api/dashboards/import \
      -d @-

删 dashboard

curl -X DELETE -u admin:pwd http://grafana:3000/api/dashboards/uid/<uid>

列出所有 dashboard

curl -s -u admin:pwd 'http://grafana:3000/api/search?type=dash-db' | jq '.[].title'

备份所有 dashboard 到本地

mkdir -p /tmp/dashboards
curl -s -u admin:pwd 'http://grafana:3000/api/search?type=dash-db' \
  | jq -r '.[].uid' \
  | while read uid; do
      curl -s -u admin:pwd "http://grafana:3000/api/dashboards/uid/$uid" \
        > /tmp/dashboards/$uid.json
    done

迁移 / DR 必备。

配置文件（`/etc/grafana/grafana.ini`）

Grafana 主配置文件位置取决于安装方式：

安装方式	路径
apt	`/etc/grafana/grafana.ini`
Docker	`/etc/grafana/grafana.ini`（容器内）
helm	env / config map 注入

主要段（精简）：

[server]
http_port = 3000
domain = grafana.example.com
root_url = https://grafana.example.com

[database]
type = sqlite3                                          # 单机默认；生产用 postgres / mysql

[security]
admin_user = admin
admin_password = $__env{GF_SECURITY_ADMIN_PASSWORD}      # 引用环境变量

[auth.anonymous]
enabled = true                                          # 允许匿名查看
org_role = Viewer

[smtp]
enabled = true
host = smtp.example.com:587

[unified_alerting]
enabled = true                                          # 启用 v9+ 的统一 alerting

K8s 里不直接改 ini，用 helm values 注入 env var（grafana 配置任何字段都能用 GF_<SECTION>_<KEY>=value 形式覆盖）：

# helm values
env:
  GF_SECURITY_ADMIN_PASSWORD: "strong-password"
  GF_SMTP_ENABLED: "true"
  GF_SMTP_HOST: "smtp:587"
  GF_SERVER_DOMAIN: "grafana.example.com"

排错

看 grafana 日志

kubectl logs -n monitoring deploy/grafana -f

启动失败大部分原因：

配置文件语法错
数据库连不上（用了外部 postgres / mysql）
Plugin 装错版本

看 grafana 健康

curl http://grafana:3000/api/health
# {"commit":"...","database":"ok","version":"..."}

database: ok 说明能正常工作。

看 datasource 是否真通

UI 里 datasource 编辑页面有 "Save & Test"。命令行：

curl -X GET -u admin:pwd \
  http://grafana:3000/api/datasources/uid/<ds-uid>/health
# {"status":"OK","message":"Data source is working"}

常见踩坑

坑 1：CLI 装 plugin 后 helm 升级丢了

grafana-cli plugins install x
# 重启 pod / helm upgrade 之后 plugin 没了

CLI 装是临时的（在 pod 文件系统）。K8s 持久化 plugin 必须写 helm values：

plugins:
  - x

坑 2：admin 密码改了但没生效

grafana-cli admin reset-admin-password xxx
# 密码改了

# 但 helm 控制 deployment，重启后还是旧密码（env var GF_SECURITY_ADMIN_PASSWORD）

helm 部署的 grafana密码来源：env var > 数据库 > config ini。env 有就覆盖。

修：改 helm secret 或 values：

adminPassword: new-strong-password

helm upgrade grafana grafana/grafana -n monitoring -f values.yaml

坑 3：Provisioning datasource 不出现

# /etc/grafana/provisioning/datasources/prom.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    ...

文件在 pod 里、grafana 启动时只读一次。改完要重启：

kubectl rollout restart deploy grafana -n monitoring

K8s 里通常通过 ConfigMap 注入这种配置，CM 改了→ pod 重启自动加载。

坑 4：API curl 报 401

curl http://grafana:3000/api/datasources
# {"message":"Unauthorized"}

加认证：-u admin:pwd 或 Bearer token。

注意 service account token 必须在 UI 里创建（Administration → Service accounts）。

坑 5：导入 dashboard 时 datasource UID 错

curl ... /api/dashboards/import -d '{"dashboard": {...}}'
# 导入成功但所有 panel "Datasource not found"

dashboard JSON 里的 datasource UID 是源 grafana 的、目标 grafana 没这个 UID。

修法：用 inputs 字段让 grafana 在导入时映射：

{
  "dashboard": { ... },
  "inputs": [
    {
      "name": "DS_PROMETHEUS",
      "type": "datasource",
      "pluginId": "prometheus",
      "value": "<目标 datasource UID 或名字>"
    }
  ],
  "overwrite": true
}

坑 6：用 sqlite + 多副本

replicas: 3                                            # 想 HA
database.type: sqlite3                                  # 默认

sqlite 是单文件 → 多 pod 各写各的 sqlite → 数据不一致。

修：生产换 postgres / mysql + 用 PVC 共享 dashboard 文件夹（或者放 git）：

database:
  type: postgres
  host: postgres.example.com:5432
  user: grafana
  password: ...
  name: grafana

坑 7：anonymous auth 暴露内部数据

[auth.anonymous]
enabled = true

未登录用户能看 dashboard。内网监控 OK；公网暴露慎用——含敏感 metric 名 / label 就泄漏。

生产推荐 OAuth / OIDC：

[auth.google]
enabled = true
client_id = ...
client_secret = ...
allowed_domains = example.com

坑 8：v9+ Alerting 配置和老 alerting 共存

Grafana 9 引入了 Unified Alerting，老的 legacy alerting 和 Notification channels 还在。

老 dashboard 的 alert 不会自动迁移。看：

grafana-cli admin help                               # 看 migration 命令

迁移到 unified alerting 是个工程动作，规划好再做。

关联命令

prometheus —— grafana 的主要数据源
alertmanager —— grafana unified alerting 也可以推 amgr
curl —— grafana HTTP API 必备
jq —— 处理 grafana API JSON
helm —— helm install grafana grafana/grafana
kubectl —— 管 pod / secret / configmap