Incident Debugging

An agent asked “why is my app returning 500s?” runs a full incident triage — checking pod health, recent Kubernetes events, error logs, and deployment rollout history — all composed across multiple execute calls, reasoning about each result before deciding what to check next.

The Triage Flow

This isn’t a single code block — it’s how the agent thinks. Each step is one execute call, but the agent decides what to check based on what it finds.

Step 1 — Pod Health Check

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // resolved by the agent from conversation

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube/${path}`,
  }).then(r => r.body);

  const pods = await kube(`api/v1/namespaces/${namespace}/pods`);

  return pods.items.map(p => ({
    name: p.metadata.name,
    phase: p.status.phase,
    restarts: p.status.containerStatuses?.reduce((s, c) => s + c.restartCount, 0) || 0,
    ready: p.status.containerStatuses?.every(c => c.ready) || false,
    containers: p.status.containerStatuses?.map(c => ({
      name: c.name,
      ready: c.ready,
      restarts: c.restartCount,
      state: Object.keys(c.state || {})[0],
      reason: c.state?.waiting?.reason || c.state?.terminated?.reason || null,
    })),
  }));
}

The agent sees a pod in CrashLoopBackOff with 12 restarts. It decides to check events and logs.

Step 2 — Recent Events

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const podName = "api-proxy-7f8b4c..."; // from step 1 results

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube/${path}`,
  }).then(r => r.body);

  const events = await kube(
    `api/v1/namespaces/${namespace}/events?fieldSelector=involvedObject.name=${podName}`
  );

  // Sort by last timestamp, return most recent
  return events.items
    .sort((a, b) => new Date(b.lastTimestamp) - new Date(a.lastTimestamp))
    .slice(0, 15)
    .map(e => ({
      type: e.type,
      reason: e.reason,
      message: e.message,
      count: e.count,
      last: e.lastTimestamp,
    }));
}

Events show OOMKilled — the container ran out of memory. The agent checks logs to confirm.

Step 3 — Error Logs

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const podName = "api-proxy-7f8b4c..."; // from step 1 results

  const logs = await cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube/api/v1/namespaces/${namespace}/pods/${podName}/log`,
    query: { tailLines: "200", previous: "true" },
  }).then(r => r.body);

  // Filter for errors and warnings
  const lines = logs.split("\n");
  const errors = lines.filter(l =>
    /error|fatal|panic|exception|oom|killed/i.test(l)
  );

  return {
    total_lines: lines.length,
    error_lines: errors.length,
    errors: errors.slice(-20),
  };
}

Note previous: "true" — the agent fetches logs from the crashed container, not the restarting one. It finds memory allocation failures in the last 20 error lines.

Step 4 — Deployment Rollout History

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const deploymentName = "api-proxy"; // from step 1 results

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube/${path}`,
  }).then(r => r.body);

  const [deployment, replicaSets] = await Promise.all([
    kube(`apis/apps/v1/namespaces/${namespace}/deployments/${deploymentName}`),
    kube(`apis/apps/v1/namespaces/${namespace}/replicasets`),
  ]);

  // Find ReplicaSets owned by this deployment
  const owned = replicaSets.items
    .filter(rs => rs.metadata.ownerReferences?.some(o => o.name === deploymentName))
    .sort((a, b) => parseInt(b.metadata.annotations?.["deployment.kubernetes.io/revision"] || "0")
                   - parseInt(a.metadata.annotations?.["deployment.kubernetes.io/revision"] || "0"));

  return {
    current_image: deployment.spec.template.spec.containers[0]?.image,
    current_limits: deployment.spec.template.spec.containers[0]?.resources?.limits,
    revisions: owned.slice(0, 5).map(rs => ({
      revision: rs.metadata.annotations?.["deployment.kubernetes.io/revision"],
      image: rs.spec.template.spec.containers[0]?.image,
      replicas: rs.status.replicas,
      created: rs.metadata.creationTimestamp,
    })),
  };
}

The agent finds that the latest revision changed the image but removed memory limits — root cause identified.

Why This Matters

An SRE manually doing this would:

kubectl get pods — check status
kubectl describe pod — read events
kubectl logs --previous — check crash logs
kubectl rollout history — check what changed

That’s 4 separate commands with raw output they need to mentally parse. The agent does it in 4 execute calls, but each one filters and extracts only what’s relevant. The LLM reasons about structured findings, not walls of YAML. More importantly, the agent adapts. It doesn’t run a fixed checklist — it sees OOMKilled and decides to check previous container logs and deployment history. A traditional MCP tool would need a pre-built “debug pod” tool that tries to anticipate every scenario.

Overview

MCP Servers

Guides

Examples

The Triage Flow

Step 1 — Pod Health Check

Step 2 — Recent Events

Step 3 — Error Logs

Step 4 — Deployment Rollout History

Why This Matters

Overview

MCP Servers

Guides

Examples

​The Triage Flow

​Step 1 — Pod Health Check

​Step 2 — Recent Events

​Step 3 — Error Logs

​Step 4 — Deployment Rollout History

​Why This Matters

The Triage Flow

Step 1 — Pod Health Check

Step 2 — Recent Events

Step 3 — Error Logs

Step 4 — Deployment Rollout History

Why This Matters