Title
Run Coordinator: Tool Output Continuity at Scale
What This Covers
You’ll reason about an Assistants-style thread/run coordinator and a tool-output submission handler under transient infrastructure instability. The focus is on debugging run lifecycles, correlating tool results to run steps, and preventing user-visible side effects.
The System
- A backend service manages conversation threads and runs, polling run status and resuming runs when they enter
requires_action. - Tools are executed by internal workers; results flow through a Tool Output Submission Handler that validates payloads and submits outputs back to the run.
- The submission handler maintains correlation IDs (thread_id, run_id, tool_call_id) and enforces step ordering.
- The coordinator is deployed on a horizontally scaled serving layer; replicas can restart during load spikes.
- External tools include side-effectful actions (e.g., ticket creation, email sending) and return an audit ID.
What Happened
On Tuesday between 14:05–14:25 UTC, Support reported a burst of duplicate outbound actions for a small subset of users (e.g., duplicate tickets/emails created within seconds). The on-call also saw a rise in runs spending longer in requires_action before completing. During the evaluation, you’ll investigate this and related scenarios.
Anchor Artifact
2026-03-19T14:12:08.441Z serve/replica-llm-7 WARN Replica restarting (exit=137, OOMKilled=true)
2026-03-19T14:12:09.003Z coordinator INFO run.status thread=thd_9c2a run=run_31f7 status=requires_action pending_tool_calls=1
2026-03-19T14:12:09.017Z tool-exec INFO POST /tools/create_ticket req=gw_8b91 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.028Z tool-exec WARN create_ticket timeout req=gw_8b91 elapsed_ms=10011
2026-03-19T14:12:19.041Z tool-exec INFO POST /tools/create_ticket req=gw_8b92 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.612Z tool-exec INFO 201 Created req=gw_8b92 tool_audit_id=tkt_772901
2026-03-19T14:12:19.640Z submitter INFO submit_tool_outputs thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12 bytes=684
2026-03-19T14:12:20.104Z tool-exec INFO 201 Created req=gw_8b91 tool_audit_id=tkt_772900
2026-03-19T14:12:20.131Z submitter WARN duplicate submit attempt thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12
Observed Signals
- Monitoring shows a short spike in replica restarts and p95 run completion time (from 4.2s to 11.8s) during the same window.
- The team noticed some external systems contain near-identical records created seconds apart, while the app UI shows only one confirmation.
- A few customers reported “I clicked once” while Support chat transcripts show occasional repeated user messages.
Constraints
- You can’t change the external tool providers’ behavior this week.
- Any mitigation must preserve throughput (no global locking) and keep median latency within 10% of baseline.
- You must assume tool execution and tool-output submission can be retried by infrastructure.
What You’ll Be Evaluated On
- How you triage and narrow hypotheses from partial logs/metrics
- Run/thread lifecycle reasoning, including
requires_actionhandling and step correlation - Designing safe retry/idempotency and deduplication strategies across components
- Proposing mitigations that reduce user impact quickly, plus longer-term fixes
- Identifying what additional telemetry you’d add (and where) to confirm root cause