π Prometheus Integration¶
KubeBuddy can enrich its cluster health reports by querying Prometheus directly, whether running in-cluster or as an external endpoint.
Prometheus integration has two distinct outputs:
- Prometheus-backed checks such as
PROM001throughPROM008 - an optional 24-hour snapshot used for report metrics cards, charts, and JSON
metrics
These two paths are related but independent. It is possible for Prometheus-backed checks to run successfully while the 24-hour snapshot remains unavailable.
π Why Integrate Prometheus?¶
By pulling time-series data you can detect:
- API server latency (p99)
- Node/pod CPU & memory usage
- Pod restart patterns
- Disk, network and capacity pressure
- Node sizing opportunities (underutilized vs saturated nodes using p95 trends)
- Pod/container sizing opportunities (p95-based request and memory limit recommendations)
Snapshot Behavior¶
When --include-prometheus is enabled, KubeBuddy attempts to collect a separate 24-hour Prometheus snapshot for:
- cluster CPU and memory trend cards in the HTML report
- per-node CPU, memory, and disk trend cards in the HTML report
- the top-level
metricsobject in JSON output
This snapshot requires usable node-level metric series. If KubeBuddy can run Prometheus-backed checks but cannot build the snapshot, the report will still be generated:
- Prometheus checks such as
PROM006andPROM007continue to run - JSON
metricsremainsnull - JSON metadata includes
prometheusSnapshotStatusandprometheusSnapshotReason - console output includes a clear snapshot-unavailable message
Common reasons the snapshot is unavailable:
- the cluster is new and GMP/Prometheus has not populated node metrics yet
- the Prometheus workspace does not expose the required node-level metrics
- provider-specific metric families are present only partially
β Supported Prometheus Modes¶
| Mode | Description | Auth Required | Typical Use Case |
|---|---|---|---|
local |
In-cluster Prometheus (e.g. kube-prometheus-stack) | β | No auth needed inside the cluster |
basic |
External Prometheus with HTTP Basic auth | β | Behind an ingress or firewall |
bearer |
External Prometheus secured by bearer token | β | OAuth proxy, API gateway, etc. |
azure |
Azure Monitor Managed Prometheus (AKS + Monitor) | β AAD token | AKS + Azure Monitor workspace |
gcp |
Google Managed Service for Prometheus | β ADC token | GKE + Cloud Monitoring |
π How to Authenticate¶
For the native Go CLI, the auth model is:
local: no extra auth inputsazure: uses existing Azure auth from the current shell or environmentgcp: uses Google Application Default Credentials from the current shell or environmentbearer: uses--prometheus-bearer-token-envto read a bearer token from an environment variablebasic: readsPROMETHEUS_USERNAMEandPROMETHEUS_PASSWORDfrom the environment
The PowerShell wrapper maps onto the same runtime, but can also help populate those environment variables for you.
Native CLI Examples¶
Local (no auth)¶
kubebuddy run \
--html-report \
--include-prometheus \
--prometheus-url "http://prometheus.monitoring.svc:9090" \
--prometheus-mode local \
--yes
Bearer Token¶
export PROMETHEUS_TOKEN="<your-token>"
kubebuddy run \
--include-prometheus \
--prometheus-url "https://prom.example.com" \
--prometheus-mode bearer \
--prometheus-bearer-token-env PROMETHEUS_TOKEN \
--yes
Azure Monitor (AAD)¶
Use the current Azure identity in your environment. Local shells often use az login; containers and CI usually use AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, and AZURE_TENANT_ID.
kubebuddy run \
--include-prometheus \
--prometheus-url "https://<workspace>.prometheus.monitor.azure.com" \
--prometheus-mode azure \
--yes
Google Managed Service for Prometheus (ADC)¶
Use Application Default Credentials from your current environment. Local shells often use gcloud auth application-default login; GKE, Cloud Run, and CI typically use attached service accounts or GOOGLE_APPLICATION_CREDENTIALS.
kubebuddy run \
--include-prometheus \
--prometheus-url "https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus" \
--prometheus-mode gcp \
--yes
Notes for Google Managed Service for Prometheus:
- KubeBuddy uses the GMP Prometheus-compatible API at
https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus - some Google-managed metric families require explicit label matchers such as
monitored_resource="k8s_node" - new GKE clusters may expose Prometheus-backed checks before enough node-level series exist for the report snapshot
Basic Auth¶
export PROMETHEUS_USERNAME="admin"
export PROMETHEUS_PASSWORD="s3cr3t"
kubebuddy run \
--include-prometheus \
--prometheus-url "https://prom.example.com" \
--prometheus-mode basic \
--yes
PowerShell Wrapper Examples¶
Local (no auth)¶
Invoke-KubeBuddy `
-HtmlReport `
-IncludePrometheus `
-PrometheusUrl "http://prometheus.monitoring.svc:9090" `
-PrometheusMode local
````
### Basic Auth
```powershell
$env:PROMETHEUS_USERNAME = "admin"
$env:PROMETHEUS_PASSWORD = "s3cr3t"
Invoke-KubeBuddy `
-IncludePrometheus `
-PrometheusUrl "https://prom.example.com" `
-PrometheusMode basic
Bearer Token¶
$env:PROMETHEUS_TOKEN = "<your-token>"
Invoke-KubeBuddy `
-IncludePrometheus `
-PrometheusUrl "https://prom.example.com" `
-PrometheusMode bearer `
-PrometheusBearerTokenEnv PROMETHEUS_TOKEN
Azure Monitor (AAD)¶
# Ensure AZURE_CLIENT_ID / SECRET / TENANT_ID are set
Invoke-KubeBuddy `
-IncludePrometheus `
-PrometheusUrl "https://<workspace>.prometheus.monitor.azure.com" `
-PrometheusMode azure
Google Managed Service for Prometheus (ADC)¶
# Ensure Application Default Credentials are available
Invoke-KubeBuddy `
-IncludePrometheus `
-PrometheusUrl "https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus" `
-PrometheusMode gcp
π§ͺ Example Query¶
p99 API-server latency over last hour
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))
For GMP, refresh your access token before manual testing:
TOKEN="$(gcloud auth print-access-token)"
β±οΈ Time-Window Configuration¶
Rather than being fixed, the look-back window is now driven by your YAMLβs Range.Duration. You can specify minutes (m), hours (h), or days (d):
Prometheus:
Query: 'sum(rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m])) by (pod)'
Range:
Step: "5m"
Duration: "24h" # supports "m"=minutes, "h"=hours, "d"=days
KubeBuddy will translate that into start = now - 24h (or 30m, or 2d, etc.) automatically.
βΆοΈ CLI Usage¶
Use any combination of report outputs:
# HTML report with Prometheus
$promCred = Get-Credential
Invoke-KubeBuddy `
-HtmlReport `
-IncludePrometheus `
-PrometheusUrl "https://prometheus.example.com" `
-PrometheusMode basic `
-PrometheusCredential $promCred `
-OutputPath "C:\reports\cluster.html"
# Text report with Prometheus
Invoke-KubeBuddy `
-txtReport `
-IncludePrometheus `
-PrometheusUrl "http://prometheus.monitoring.svc:9090" `
-PrometheusMode local `
-OutputPath "/home/user/kube.txt"
# JSON report, Azure Monitor mode
Invoke-KubeBuddy `
-jsonReport `
-IncludePrometheus `
-PrometheusUrl "https://<workspace>.prometheus.monitor.azure.com" `
-PrometheusMode azure `
-OutputPath "/reports/cluster.json"
π Node Sizing Insights¶
When Prometheus integration is enabled, KubeBuddy runs PROM006 and classifies each node using fixed 7-day p95 CPU/memory usage:
Underutilized: candidate for smaller SKU or scale-inRight-sized: keep current sizingSaturated: candidate for larger SKU or scale-out
PROM006 now also includes:
- Current Allocatable (vCPU/Gi) from node allocatable capacity
- Suggested Target Capacity (vCPU/Gi) estimated from p95 utilization with safety headroom
Minimum data rule: - KubeBuddy requires at least 7 days of Prometheus history before emitting node sizing recommendations. - If history is below 7 days, reports include an explicit Insufficient Prometheus history row instead of recommendations.
This 7-day rule affects sizing recommendations only. It is separate from the 24-hour report snapshot described above.
The check surfaces in the Nodes tab and in JSON/text output like any other check.
In HTML reports, Overview now includes a Rightsizing at a Glance section that summarizes:
- Node sizing distribution (Underutilized / Saturated / Right-sized)
- Pod sizing action counts from PROM007
- Impact buckets and quick links to PROM006 and PROM007
Optional Threshold Overrides¶
You can tune the classification in ~/.kube/kubebuddy-config.yaml:
thresholds:
node_sizing_downsize_cpu_p95: 35
node_sizing_downsize_mem_p95: 40
node_sizing_upsize_cpu_p95: 80
node_sizing_upsize_mem_p95: 85
π¦ Pod Sizing Insights¶
When Prometheus integration is enabled, KubeBuddy also runs PROM007 for per-container recommendations using fixed 7-day p95 usage:
- CPU request recommendation (millicores)
- Memory request recommendation (MiB)
- Memory limit recommendation (MiB)
- CPU limit recommendation defaults to
none
Minimum data rule: - KubeBuddy requires at least 7 days of Prometheus history before emitting pod sizing recommendations. - If history is below 7 days, reports include an explicit Insufficient Prometheus history row instead of recommendations.
This 7-day rule does not by itself explain JSON metrics: null; that value indicates the separate snapshot collector could not build usable node-level metrics.
Why CPU limit defaults to none¶
By default, KubeBuddy recommends no CPU limit because:
- CPU is compressible; requests already control fair scheduling.
- Hard CPU limits can trigger CFS throttling and add latency jitter.
- In many production workloads, setting requests (without limits) gives better tail latency.
Set CPU limits only when strict tenant caps are required.
Optional Pod Sizing Threshold Overrides¶
thresholds:
pod_sizing_profile: balanced # conservative|balanced|aggressive
pod_sizing_compare_profiles: true # HTML/JSON include all 3 profiles by default
pod_sizing_target_cpu_utilization: 65
pod_sizing_target_mem_utilization: 75
pod_sizing_cpu_request_floor_mcores: 25
pod_sizing_mem_request_floor_mib: 128
pod_sizing_mem_limit_buffer_percent: 20
Profile behavior:
- conservative: higher requests/floors (more headroom)
- balanced: default behavior (CPU floor: 25m)
- aggressive: lower requests/floors (higher packing efficiency, CPU floor: 10m)
Comparison mode:
- pod_sizing_compare_profiles is enabled by default to emit all three profile results in JSON and HTML.
- Set pod_sizing_compare_profiles: false if you want only the active profile.
- HTML report includes a profile selector on PROM007 findings so you can switch between profiles.
- Text/CLI remain focused on the single active profile.
π³ Docker Usage with Prometheus¶
For full Docker details, see the Docker Usage guide. Hereβs a minimal Prometheus-enabled example:
export tagId="v0.0.19"
docker run -it --rm \
-e KUBECONFIG="/home/kubeuser/.kube/config" \
-e HTML_REPORT="true" \
-e INCLUDE_PROMETHEUS="true" \
-e PROMETHEUS_URL="https://prom.example.com" \
-e PROMETHEUS_MODE="basic" \
-e PROMETHEUS_USERNAME="admin" \
-e PROMETHEUS_PASSWORD="s3cr3t" \
-v $HOME/.kube/config:/tmp/kubeconfig-original:ro \
-v $HOME/kubebuddy-report:/app/Reports \
ghcr.io/kubedeckio/kubebuddy:$tagId