📊 Prometheus Integration¶

KubeBuddy can enrich its cluster health reports by querying Prometheus directly, whether running in-cluster or as an external endpoint.

Prometheus integration has two distinct outputs:

Prometheus-backed checks such as PROM001 through PROM008
an optional 24-hour snapshot used for report metrics cards, charts, and JSON metrics

These two paths are related but independent. It is possible for Prometheus-backed checks to run successfully while the 24-hour snapshot remains unavailable.

🔍 Why Integrate Prometheus?¶

By pulling time-series data you can detect:

API server latency (p99)
Node/pod CPU & memory usage
Pod restart patterns
Disk, network and capacity pressure
Node sizing opportunities (underutilized vs saturated nodes using p95 trends)
Pod/container sizing opportunities (p95-based request and memory limit recommendations)

Snapshot Behavior¶

When --include-prometheus is enabled, KubeBuddy attempts to collect a separate 24-hour Prometheus snapshot for:

cluster CPU and memory trend cards in the HTML report
per-node CPU, memory, and disk trend cards in the HTML report
the top-level metrics object in JSON output

This snapshot requires usable node-level metric series. If KubeBuddy can run Prometheus-backed checks but cannot build the snapshot, the report will still be generated:

Prometheus checks such as PROM006 and PROM007 continue to run
JSON metrics remains null
JSON metadata includes prometheusSnapshotStatus and prometheusSnapshotReason
console output includes a clear snapshot-unavailable message

Common reasons the snapshot is unavailable:

the cluster is new and GMP/Prometheus has not populated node metrics yet
the Prometheus workspace does not expose the required node-level metrics
provider-specific metric families are present only partially

✅ Supported Prometheus Modes¶

Mode	Description	Auth Required	Typical Use Case
`local`	In-cluster Prometheus (e.g. kube-prometheus-stack)	❌	No auth needed inside the cluster
`basic`	External Prometheus with HTTP Basic auth	✅	Behind an ingress or firewall
`bearer`	External Prometheus secured by bearer token	✅	OAuth proxy, API gateway, etc.
`azure`	Azure Monitor Managed Prometheus (AKS + Monitor)	✅ AAD token	AKS + Azure Monitor workspace
`gcp`	Google Managed Service for Prometheus	✅ ADC token	GKE + Cloud Monitoring

🔐 How to Authenticate¶

For the native Go CLI, the auth model is:

local: no extra auth inputs
azure: uses existing Azure auth from the current shell or environment
gcp: uses Google Application Default Credentials from the current shell or environment
bearer: uses --prometheus-bearer-token-env to read a bearer token from an environment variable
basic: reads PROMETHEUS_USERNAME and PROMETHEUS_PASSWORD from the environment

The PowerShell wrapper maps onto the same runtime, but can also help populate those environment variables for you.

Native CLI Examples¶

Local (no auth)¶

kubebuddy run \
  --html-report \
  --include-prometheus \
  --prometheus-url "http://prometheus.monitoring.svc:9090" \
  --prometheus-mode local \
  --yes

Bearer Token¶

export PROMETHEUS_TOKEN="<your-token>"

kubebuddy run \
  --include-prometheus \
  --prometheus-url "https://prom.example.com" \
  --prometheus-mode bearer \
  --prometheus-bearer-token-env PROMETHEUS_TOKEN \
  --yes

Azure Monitor (AAD)¶

Use the current Azure identity in your environment. Local shells often use az login; containers and CI usually use AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, and AZURE_TENANT_ID.

kubebuddy run \
  --include-prometheus \
  --prometheus-url "https://<workspace>.prometheus.monitor.azure.com" \
  --prometheus-mode azure \
  --yes

Google Managed Service for Prometheus (ADC)¶

Use Application Default Credentials from your current environment. Local shells often use gcloud auth application-default login; GKE, Cloud Run, and CI typically use attached service accounts or GOOGLE_APPLICATION_CREDENTIALS.

kubebuddy run \
  --include-prometheus \
  --prometheus-url "https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus" \
  --prometheus-mode gcp \
  --yes

Notes for Google Managed Service for Prometheus:

KubeBuddy uses the GMP Prometheus-compatible API at https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus
some Google-managed metric families require explicit label matchers such as monitored_resource="k8s_node"
new GKE clusters may expose Prometheus-backed checks before enough node-level series exist for the report snapshot

Basic Auth¶

export PROMETHEUS_USERNAME="admin"
export PROMETHEUS_PASSWORD="s3cr3t"

kubebuddy run \
  --include-prometheus \
  --prometheus-url "https://prom.example.com" \
  --prometheus-mode basic \
  --yes

PowerShell Wrapper Examples¶

Local (no auth)¶

Invoke-KubeBuddy `
  -HtmlReport `
  -IncludePrometheus `
  -PrometheusUrl "http://prometheus.monitoring.svc:9090" `
  -PrometheusMode local
````

### Basic Auth

```powershell
$env:PROMETHEUS_USERNAME = "admin"
$env:PROMETHEUS_PASSWORD = "s3cr3t"

Invoke-KubeBuddy `
  -IncludePrometheus `
  -PrometheusUrl "https://prom.example.com" `
  -PrometheusMode basic

Bearer Token¶

$env:PROMETHEUS_TOKEN = "<your-token>"
Invoke-KubeBuddy `
  -IncludePrometheus `
  -PrometheusUrl "https://prom.example.com" `
  -PrometheusMode bearer `
  -PrometheusBearerTokenEnv PROMETHEUS_TOKEN

Azure Monitor (AAD)¶

# Ensure AZURE_CLIENT_ID / SECRET / TENANT_ID are set
Invoke-KubeBuddy `
  -IncludePrometheus `
  -PrometheusUrl "https://<workspace>.prometheus.monitor.azure.com" `
  -PrometheusMode azure

Google Managed Service for Prometheus (ADC)¶

# Ensure Application Default Credentials are available
Invoke-KubeBuddy `
  -IncludePrometheus `
  -PrometheusUrl "https://monitoring.googleapis.com/v1/projects/<project-id>/location/global/prometheus" `
  -PrometheusMode gcp

🧪 Example Query¶

p99 API-server latency over last hour histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))

For GMP, refresh your access token before manual testing:

TOKEN="$(gcloud auth print-access-token)"

⏱️ Time-Window Configuration¶

Rather than being fixed, the look-back window is now driven by your YAML’s Range.Duration. You can specify minutes (m), hours (h), or days (d):

Prometheus:
  Query: 'sum(rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m])) by (pod)'
  Range:
    Step:    "5m"
    Duration: "24h"    # supports "m"=minutes, "h"=hours, "d"=days

KubeBuddy will translate that into start = now - 24h (or 30m, or 2d, etc.) automatically.

▶️ CLI Usage¶

Use any combination of report outputs:

# HTML report with Prometheus
$promCred = Get-Credential

Invoke-KubeBuddy `
  -HtmlReport `
  -IncludePrometheus `
  -PrometheusUrl "https://prometheus.example.com" `
  -PrometheusMode basic `
  -PrometheusCredential $promCred `
  -OutputPath "C:\reports\cluster.html"

# Text report with Prometheus
Invoke-KubeBuddy `
  -txtReport `
  -IncludePrometheus `
  -PrometheusUrl "http://prometheus.monitoring.svc:9090" `
  -PrometheusMode local `
  -OutputPath "/home/user/kube.txt"

# JSON report, Azure Monitor mode
Invoke-KubeBuddy `
  -jsonReport `
  -IncludePrometheus `
  -PrometheusUrl "https://<workspace>.prometheus.monitor.azure.com" `
  -PrometheusMode azure `
  -OutputPath "/reports/cluster.json"

📐 Node Sizing Insights¶

When Prometheus integration is enabled, KubeBuddy runs PROM006 and classifies each node using fixed 7-day p95 CPU/memory usage:

Underutilized: candidate for smaller SKU or scale-in
Right-sized: keep current sizing
Saturated: candidate for larger SKU or scale-out

PROM006 now also includes: - Current Allocatable (vCPU/Gi) from node allocatable capacity - Suggested Target Capacity (vCPU/Gi) estimated from p95 utilization with safety headroom

Minimum data rule: - KubeBuddy requires at least 7 days of Prometheus history before emitting node sizing recommendations. - If history is below 7 days, reports include an explicit Insufficient Prometheus history row instead of recommendations.

This 7-day rule affects sizing recommendations only. It is separate from the 24-hour report snapshot described above.

The check surfaces in the Nodes tab and in JSON/text output like any other check.

In HTML reports, Overview now includes a Rightsizing at a Glance section that summarizes: - Node sizing distribution (Underutilized / Saturated / Right-sized) - Pod sizing action counts from PROM007 - Impact buckets and quick links to PROM006 and PROM007

Optional Threshold Overrides¶

You can tune the classification in ~/.kube/kubebuddy-config.yaml:

thresholds:
  node_sizing_downsize_cpu_p95: 35
  node_sizing_downsize_mem_p95: 40
  node_sizing_upsize_cpu_p95: 80
  node_sizing_upsize_mem_p95: 85

📦 Pod Sizing Insights¶

When Prometheus integration is enabled, KubeBuddy also runs PROM007 for per-container recommendations using fixed 7-day p95 usage:

CPU request recommendation (millicores)
Memory request recommendation (MiB)
Memory limit recommendation (MiB)
CPU limit recommendation defaults to none

Minimum data rule: - KubeBuddy requires at least 7 days of Prometheus history before emitting pod sizing recommendations. - If history is below 7 days, reports include an explicit Insufficient Prometheus history row instead of recommendations.

This 7-day rule does not by itself explain JSON metrics: null; that value indicates the separate snapshot collector could not build usable node-level metrics.

Why CPU limit defaults to `none`¶

By default, KubeBuddy recommends no CPU limit because:

CPU is compressible; requests already control fair scheduling.
Hard CPU limits can trigger CFS throttling and add latency jitter.
In many production workloads, setting requests (without limits) gives better tail latency.

Set CPU limits only when strict tenant caps are required.

Optional Pod Sizing Threshold Overrides¶

thresholds:
  pod_sizing_profile: balanced   # conservative|balanced|aggressive
  pod_sizing_compare_profiles: true  # HTML/JSON include all 3 profiles by default
  pod_sizing_target_cpu_utilization: 65
  pod_sizing_target_mem_utilization: 75
  pod_sizing_cpu_request_floor_mcores: 25
  pod_sizing_mem_request_floor_mib: 128
  pod_sizing_mem_limit_buffer_percent: 20

Profile behavior: - conservative: higher requests/floors (more headroom) - balanced: default behavior (CPU floor: 25m) - aggressive: lower requests/floors (higher packing efficiency, CPU floor: 10m)

Comparison mode: - pod_sizing_compare_profiles is enabled by default to emit all three profile results in JSON and HTML. - Set pod_sizing_compare_profiles: false if you want only the active profile. - HTML report includes a profile selector on PROM007 findings so you can switch between profiles. - Text/CLI remain focused on the single active profile.

🐳 Docker Usage with Prometheus¶

For full Docker details, see the Docker Usage guide. Here’s a minimal Prometheus-enabled example:

export tagId="v0.0.19"

docker run -it --rm \
  -e KUBECONFIG="/home/kubeuser/.kube/config" \
  -e HTML_REPORT="true" \
  -e INCLUDE_PROMETHEUS="true" \
  -e PROMETHEUS_URL="https://prom.example.com" \
  -e PROMETHEUS_MODE="basic" \
  -e PROMETHEUS_USERNAME="admin" \
  -e PROMETHEUS_PASSWORD="s3cr3t" \
  -v $HOME/.kube/config:/tmp/kubeconfig-original:ro \
  -v $HOME/kubebuddy-report:/app/Reports \
  ghcr.io/kubedeckio/kubebuddy:$tagId

📊 Prometheus Integration¶

🔍 Why Integrate Prometheus?¶

Snapshot Behavior¶

✅ Supported Prometheus Modes¶

🔐 How to Authenticate¶

Native CLI Examples¶

Local (no auth)¶

Bearer Token¶

Azure Monitor (AAD)¶

Google Managed Service for Prometheus (ADC)¶

Basic Auth¶

PowerShell Wrapper Examples¶

Local (no auth)¶

Bearer Token¶

Azure Monitor (AAD)¶

Google Managed Service for Prometheus (ADC)¶

🧪 Example Query¶

⏱️ Time-Window Configuration¶

▶️ CLI Usage¶

📐 Node Sizing Insights¶

Optional Threshold Overrides¶

📦 Pod Sizing Insights¶

Why CPU limit defaults to none¶

Optional Pod Sizing Threshold Overrides¶

🐳 Docker Usage with Prometheus¶

📘 Related Docs¶

Why CPU limit defaults to `none`¶