Assistant Thread/Run Coordination Reliability in Tool-Driven Agent Flows

stemaway · March 19, 2026, 11:35pm

Title

Evaluation Overview

This evaluation focuses on reliability and correctness in an Assistants-style orchestration layer that manages threads, run lifecycles, and tool-result submission—especially around requires_action states and multi-step tool execution. The scenario emphasizes reasoning about distributed behavior, correlation/ordering, and safe handling of retries in systems that can produce real-world side effects.

You’ll be evaluated on your ability to:

Trace end-to-end run state across thread/run lifecycle events and tool submissions
Reason about idempotency, deduplication, and correlation IDs in multi-step agent workflows
Debug ambiguous incidents using logs/metrics/audit trails without overfitting to one signal
Identify correctness gaps between orchestrator state and external tool state
Propose pragmatic mitigations (code + operational controls) under real constraints
Communicate tradeoffs between throughput, consistency, and implementation complexity

System Overview

Architecture

The product exposes an “agent” API that maps each user conversation to an Assistants-style thread, and each user turn to a run.
A Thread/Run Coordinator owns run creation, status tracking, and resumption when a run enters requires_action.
A Tool Output Submission Handler accepts tool execution results, validates/normalizes them, correlates them to the correct run step, and submits them back to the run to continue.
Tools are executed out-of-process via a tool gateway (HTTP) that calls internal services (e.g., ticketing, email, billing sandbox) and returns structured JSON results.
The coordinator uses a mix of polling (for run status) and event callbacks (tool completion events) to drive progress.
An idempotency/deduplication layer exists for some internal events, but tool execution spans multiple services and queues.

Implementation

The coordinator and submission handler run as stateless services behind an internal load balancer. Run state is primarily provider-managed (thread/run state), with application-side correlation metadata stored in Redis and Postgres for observability and deduplication.

Run status is polled every 2s while “active”; requires_action triggers tool dispatch and waits for tool completion.
Tool results are normalized into a canonical schema and submitted via submit_tool_outputs with tool_call_id and a per-run correlation ID.
The tool gateway uses HTTP timeouts (p95 target 3s, hard timeout 10s) and retries some requests on network errors.
Submission handler processes tool completion events from a queue (at-least-once delivery), then submits to the provider API.
Logs include thread_id, run_id, tool_call_id (when available), and an internal correlation_id generated at run start.

What Happened

On Tuesday between 10:14–10:27 UTC, several customers reported duplicate side effects from agent actions (e.g., two identical support tickets created, or the same “send confirmation email” action occurring twice) even though the UI showed a single assistant response for the turn. The issue was intermittent and clustered around a brief period of elevated latency in the agent backend. During the evaluation, you may be asked to investigate this and similar scenarios.

Example Artifacts

Artifact 1: Coordinator run lifecycle log (appears mostly normal)

2026-03-12T10:16:02.118Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a create_run model=gpt-4.1
2026-03-12T10:16:03.904Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a status=in_progress
2026-03-12T10:16:05.221Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a status=requires_action tool_calls=1
2026-03-12T10:16:05.233Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a dispatch_tool name=create_ticket tool_call_id=call_A1
2026-03-12T10:16:08.041Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a status=in_progress
2026-03-12T10:16:09.612Z coordinator INFO  thread=thr_8f2 run=run_4c1 corr=cr_91a status=completed

Artifact 2: Tool gateway access log (subtle anomaly)

10:16:05.410Z POST /tools/create_ticket 200 2.1s
  x-corr-id=cr_91a  x-run-id=run_4c1  x-thread-id=thr_8f2
  body_hash=9b3f...  tool_call_id=call_A1

10:16:07.982Z POST /tools/create_ticket 200 1.9s
  x-corr-id=cr_91a  x-run-id=run_4c1  x-thread-id=thr_8f2
  body_hash=9b3f...  tool_call_id=call_A1

Artifact 3: External system audit trail (ticketing)

[
  {
    "event": "ticket.created",
    "ticket_id": "TCK-774201",
    "created_at": "2026-03-12T10:16:07.531Z",
    "source": "agent-tool-gateway",
    "request_fingerprint": "9b3f..."
  },
  {
    "event": "ticket.created",
    "ticket_id": "TCK-774202",
    "created_at": "2026-03-12T10:16:09.104Z",
    "source": "agent-tool-gateway",
    "request_fingerprint": "9b3f..."
  }
]

Artifact 4: Submission handler log (looks reasonable, but incomplete)

2026-03-12T10:16:07.610Z submitter INFO  corr=cr_91a run=run_4c1 tool_call_id=call_A1 normalize_ok bytes=842
2026-03-12T10:16:07.742Z submitter INFO  corr=cr_91a run=run_4c1 tool_call_id=call_A1 submit_tool_outputs status=200
2026-03-12T10:16:08.003Z submitter WARN  corr=cr_91a run=run_4c1 tool_call_id=call_A1 duplicate_event queue_msg_id=qm_5521 action=drop

Observed Signals

Duplicate downstream records appear within 1–3 minutes of each other with near-identical payload fingerprints, while the assistant UI shows only one completed run for the user turn.
Tool gateway logs show two HTTP POSTs with the same body_hash and the same x-run-id/x-thread-id during the incident window; some entries include the same tool_call_id.
Orchestrator tracing shows repeated spans for “tool dispatch” on a small subset of runs during a period of elevated backend latency; other runs in the same timeframe behave normally.
Customer support reports include a few cases where users insist they clicked once; other cases include users who retried after seeing a spinner, creating ambiguity about client-side resubmission.
The submission handler reports dropping some duplicate queue events, but external side effects still show duplicates for a subset of those runs.
Infrastructure metrics show a brief spike in replica churn/restarts in the agent backend around 10:15–10:20 UTC, alongside increased p95 request latency.

Constraints

Production impact is high: duplicated side effects (tickets/emails/billing sandbox actions) require manual cleanup and erode trust.
Logging is partially redacted for privacy; full tool payloads are not stored, only hashes/fingerprints and selected headers.
The system must maintain throughput (no global locks); p95 end-to-end turn latency target is 6 seconds.
Changes should be incremental and safe to roll out behind feature flags; a full architecture rewrite is out of scope.
External tools are not uniformly idempotent; some support idempotency keys, others do not.

Your Role & What Happens Next

You’re stepping in as an engineer responsible for the Assistant Thread/Run Coordinator and the Tool Output Submission Handler, with the goal of improving correctness and operational reliability in multi-step tool-driven runs.

During the evaluation, you’ll be asked to reason through the run lifecycle, correlate artifacts across services, and propose changes that prevent stalling at requires_action while also reducing the chance of duplicated side effects—without sacrificing latency and throughput.