Assistant Run Lifecycle & Tool Output Coordination Reliability

Title

Assistant Run Lifecycle & Tool Output Coordination Reliability

Evaluation Overview

This evaluation focuses on debugging and hardening an Assistants-style orchestration layer that manages conversation threads, run lifecycles, and tool execution handoffs—especially around requires_action states and submitting tool outputs back to the correct run step. The emphasis is on reasoning about distributed behavior, retries, and correctness when multiple systems (LLM serving, tool gateway, and run coordinator) interact.

You’ll be evaluated on your ability to:

  • Trace a multi-step run across thread/run state, tool calls, and tool-output submission
  • Reason about ordering, correlation IDs, and idempotency/deduplication boundaries
  • Use logs/metrics to form and test hypotheses under ambiguity
  • Identify reliability risks in “at-least-once” delivery paths and propose mitigations
  • Balance strict validation vs tolerance for partial/late tool results
  • Propose changes that improve correctness without unacceptable latency or throughput loss

System Overview

Architecture

  • Product: a customer-support agent that can call tools (e.g., create_ticket, send_email, refund_payment) during a run and then continue the run after tool outputs are submitted.
  • The Assistant Thread/Run Coordinator owns thread state, creates runs, monitors run status, and resumes runs when tool outputs are available.
  • The Tool Output Submission Handler receives tool execution results from a tool gateway, normalizes/validates them, correlates them to the correct run_id + tool_call_id, and submits them back to the Assistants API.
  • Runs frequently enter requires_action with multiple tool calls in a single step; the system supports parallel tool execution and then submits a batch of tool outputs.
  • Deployment is horizontally scaled: multiple coordinator instances, a shared submission queue, and a shared idempotency/deduplication store.

Implementation

The coordinator primarily uses polling for run status (with a short backoff) and submits tool outputs via the Assistants API submit_tool_outputs endpoint. Tool execution is handled by a separate tool gateway service; results are posted back asynchronously to the submission handler.

Concrete setup details:

  • Correlation model: {thread_id, run_id, step_id, tool_call_id} is expected to uniquely identify a tool output within a run.
  • The submission handler writes a dedup record keyed by (run_id, tool_call_id) with a 24h TTL before calling submit_tool_outputs.
  • Tool results are normalized into a canonical JSON shape: {tool_call_id, output} where output is a JSON-encoded string.
  • The coordinator enforces a “step ordering” rule: it only submits outputs for the latest known requires_action step for a run.

What Happened

On Tuesday between 10:05–10:20 UTC, Support reported several customers receiving duplicate confirmation emails and a small number of duplicate tickets being created for a single chat request. The issue was intermittent and clustered within that 15-minute window; after 10:20 UTC the rate returned to baseline without a rollback. During the evaluation, you may be asked to investigate this and similar scenarios.

Example Artifacts

Artifact 1 — Tool gateway audit log (external side effects)

2026-03-12T10:11:42.018Z tool-gw AUDIT action=create_ticket
  user_id=U19384 conversation_id=C8821
  payload_hash=9f2c...b17
  external_ticket_id=INC-55102 http_status=201 latency_ms=842

2026-03-12T10:11:49.907Z tool-gw AUDIT action=create_ticket
  user_id=U19384 conversation_id=C8821
  payload_hash=9f2c...b17
  external_ticket_id=INC-55103 http_status=201 latency_ms=801

Artifact 2 — Submission handler log (appears normal at first glance)

{
  "ts": "2026-03-12T10:11:44.120Z",
  "service": "tool-output-submitter",
  "level": "INFO",
  "thread_id": "thread_7QpK",
  "run_id": "run_Ba91",
  "step_id": "step_3",
  "tool_call_id": "call_X1",
  "dedup_key": "run_Ba91:call_X1",
  "dedup_result": "MISS",
  "submit_tool_outputs": {
    "tool_outputs_count": 1,
    "http_status": 200,
    "request_id": "req_01J8K..."
  }
}

Artifact 3 — Coordinator + serving layer excerpt (subtle anomaly)

10:11:40.332 coordinator INFO run_status thread_7QpK/run_Ba91 = requires_action step=step_3 tool_calls=[call_X1]
10:11:41.005 llm-serve WARN replica=serve-llm-2 event="worker restarted" reason="OOMKilled"
10:11:41.118 llm-serve INFO request path=/v1/threads/thread_7QpK/runs/run_Ba91 input_tokens=1820
10:11:41.119 llm-serve INFO request path=/v1/threads/thread_7QpK/runs/run_Ba91 input_tokens=1820
10:11:43.901 coordinator INFO run_status thread_7QpK/run_Ba91 = requires_action step=step_3 tool_calls=[call_X1]

Artifact 4 — Metric snapshot (system-level behavior)

10:10–10:20 UTC
- tool_gw.http_5xx_rate: 0.2% (baseline 0.2%)
- tool_gw.p95_latency_ms: 910 (baseline 880)
- submitter.submit_tool_outputs.success: 99.8% (baseline 99.9%)
- submitter.dedup.miss_rate: 7.4% (baseline 1.1%)
- llm_serve.replica_restarts: 6 (baseline 0–1)

Observed Signals

  • Duplicate downstream records (tickets/emails) share near-identical payloads and occur within seconds of each other for the same conversation.
  • The orchestrator’s run history for affected conversations shows a single logical tool step, while external systems show two successful creates.
  • A short-lived spike in LLM serving replica restarts overlaps with the time window of duplicate side effects.
  • The tool gateway does not show elevated error rates, but the submission handler’s dedup “MISS” rate increases during the incident window.
  • Some affected runs show repeated requires_action polling results for the same step_id and tool call list, with no obvious user re-submission.
  • A subset of tool invocations show a timeout warning in the orchestrator logs, followed by a second invocation with the same arguments shortly after.

Constraints

  • Customer impact is high: duplicate side effects (emails/tickets/refunds) require manual cleanup and can violate support SLAs.
  • Privacy constraints: full tool payloads and message content are not consistently logged; investigation relies on hashes, IDs, and metadata.
  • The system must maintain throughput; adding heavy synchronous locking in the hot path is discouraged.
  • The coordinator and submission handler are horizontally scaled; any fix must work correctly under concurrency and partial failures.
  • Changes should be deployable with minimal downtime; the team prefers incremental mitigations over a full redesign.

Your Role & What Happens Next

You’re stepping in as an engineer responsible for the reliability of the Assistant Thread/Run Coordinator and the Tool Output Submission Handler, with a focus on correctness around requires_action and tool-output submission.

In the evaluation, you’ll be asked to reason through the run/tool lifecycle, interpret the provided artifacts, identify what additional telemetry you’d want, and propose pragmatic changes (code, configuration, and/or operational safeguards) that reduce the chance of duplicate side effects while keeping the system performant and maintainable.