01 Rubric Dimensions
Every evaluation scores answers on two types of rubric dimensions:
Technical dimensions (3 visible + 1 hidden):
Evaluations operate in one of two modes. Each mode has its own set of technical dimensions:
| Mode | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 |
|---|---|---|---|---|
| Debug | Root Cause Isolation | Evidence Interpretation | Impact Assessment | Hidden (LLM-generated) |
| Design | Tradeoff Analysis | Gap Identification | Preventive Thinking | Hidden (LLM-generated) |
Candidates see dimensions 1–3 listed in the scenario post under “What You’ll Be Evaluated On.” Dimension 4 is generated by the LLM based on the specific component and is not shown in advance.
General dimensions (fixed, all modes):
Clarity, Prioritization, and Reasoning Quality. These apply to every evaluation regardless of mode.
02 Dimension Selection & Scoring Scale
The evaluator picks 3–6 of the most relevant dimensions per response from the full pool of 7 (4 technical + 3 general). Not every dimension is scored on every answer.
| Score | Label | Meaning |
|---|---|---|
| 1 | Insufficient | Missing, off-topic, or no understanding |
| 2 | Developing | Relevant but incomplete or contains errors |
| 3 | Competent | Adequate for the expected background level |
| 4 | Proficient | Clear, correct, good reasoning |
| 5 | Excellent | Exceptional insight, depth, or mastery |
03 Coached Answer Cap
If a candidate provides an answer only after the evaluator hinted at it in a previous turn, that dimension is capped at 3/5. A score of 4 or 5 requires the candidate to lead the discovery independently.
04 Weighting
AIVIA applies two layers of weighting to scores:
Layer 1 — Follow-up depth (within each question thread):
| Response | Weight | Rationale |
|---|---|---|
| Initial answer | 1.0× | Could be memorized |
| Follow-up 1 answer | 1.2× | Probes actual understanding |
| Follow-up 2 answer | 1.5× | Deepest probe, hardest to fake |
Layer 2 — Recency (across question threads):
| Thread | Weight | Rationale |
|---|---|---|
| Thread 1 | 1.0× | Baseline |
| Thread 2 | 1.2× | Consistency check |
| Thread 3 | 1.5× | Ceiling test |
The highest-weighted single response is Thread 3 / Follow-up 2 — the deepest probe in the final thread. This design rewards candidates who improve under pressure and penalizes those who collapse under follow-up probing.
05 Final Grade
The final grade is pattern-based, not average-based. It is determined by the distribution of technical dimension scores across all evaluator outputs:
| Grade | Pattern | Requirement |
|---|---|---|
| Beginner | >40% of scores are 1–2 | Regardless of occasional highs |
| Intermediate | ≥60% of scores are 3+ | Majority competent |
| Expert | ≥60% of scores are 4–5, with zero 1s | Majority proficient, no failures |
Additional factors:
Improving trajectory across threads can push a grade up. Collapsing under follow-up pressure can push a grade down. Technical dimensions are weighted higher than general dimensions. The grade applies to the specific component evaluated, not to general engineering ability.
A weighted numeric score (on a 1–5 scale) is also computed and used internally for badge aggregation but is not displayed to candidates or hiring teams. A valid grade requires at least 6 scored responses. If too many responses go unscored due to technical issues, the result is marked “Insufficient Data.”