Run Coordinator: Tool Output Continuity at Scale

Title

Run Coordinator: Tool Output Continuity at Scale

What This Covers

You’ll reason about an Assistants-style thread/run coordinator and a tool-output submission handler under transient infrastructure instability. The focus is on debugging run lifecycles, correlating tool results to run steps, and preventing user-visible side effects.

The System

  • A backend service manages conversation threads and runs, polling run status and resuming runs when they enter requires_action.
  • Tools are executed by internal workers; results flow through a Tool Output Submission Handler that validates payloads and submits outputs back to the run.
  • The submission handler maintains correlation IDs (thread_id, run_id, tool_call_id) and enforces step ordering.
  • The coordinator is deployed on a horizontally scaled serving layer; replicas can restart during load spikes.
  • External tools include side-effectful actions (e.g., ticket creation, email sending) and return an audit ID.

What Happened

On Tuesday between 14:05–14:25 UTC, Support reported a burst of duplicate outbound actions for a small subset of users (e.g., duplicate tickets/emails created within seconds). The on-call also saw a rise in runs spending longer in requires_action before completing. During the evaluation, you’ll investigate this and related scenarios.

Anchor Artifact

2026-03-19T14:12:08.441Z  serve/replica-llm-7  WARN  Replica restarting (exit=137, OOMKilled=true)
2026-03-19T14:12:09.003Z  coordinator         INFO  run.status thread=thd_9c2a run=run_31f7 status=requires_action pending_tool_calls=1
2026-03-19T14:12:09.017Z  tool-exec           INFO  POST /tools/create_ticket req=gw_8b91 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.028Z  tool-exec           WARN  create_ticket timeout req=gw_8b91 elapsed_ms=10011
2026-03-19T14:12:19.041Z  tool-exec           INFO  POST /tools/create_ticket req=gw_8b92 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.612Z  tool-exec           INFO  201 Created req=gw_8b92 tool_audit_id=tkt_772901
2026-03-19T14:12:19.640Z  submitter           INFO  submit_tool_outputs thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12 bytes=684
2026-03-19T14:12:20.104Z  tool-exec           INFO  201 Created req=gw_8b91 tool_audit_id=tkt_772900
2026-03-19T14:12:20.131Z  submitter           WARN  duplicate submit attempt thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12

Observed Signals

  • Monitoring shows a short spike in replica restarts and p95 run completion time (from 4.2s to 11.8s) during the same window.
  • The team noticed some external systems contain near-identical records created seconds apart, while the app UI shows only one confirmation.
  • A few customers reported “I clicked once” while Support chat transcripts show occasional repeated user messages.

Constraints

  • You can’t change the external tool providers’ behavior this week.
  • Any mitigation must preserve throughput (no global locking) and keep median latency within 10% of baseline.
  • You must assume tool execution and tool-output submission can be retried by infrastructure.

What You’ll Be Evaluated On

  • How you triage and narrow hypotheses from partial logs/metrics
  • Run/thread lifecycle reasoning, including requires_action handling and step correlation
  • Designing safe retry/idempotency and deduplication strategies across components
  • Proposing mitigations that reduce user impact quickly, plus longer-term fixes
  • Identifying what additional telemetry you’d add (and where) to confirm root cause