〔AIVIA〕Evaluation Rubric & Scoring

Rubric & Scoring

HOW EVALUATIONS ARE SCORED, FROM INDIVIDUAL ANSWERS TO FINAL GRADE



01 Rubric Dimensions


Every evaluation scores answers on two types of rubric dimensions:


Technical dimensions (3 visible + 1 hidden):

Evaluations operate in one of two modes. Each mode has its own set of technical dimensions:

Mode Dimension 1 Dimension 2 Dimension 3 Dimension 4
Debug Root Cause Isolation Evidence Interpretation Impact Assessment Hidden (LLM-generated)
Design Tradeoff Analysis Gap Identification Preventive Thinking Hidden (LLM-generated)

Candidates see dimensions 1–3 listed in the scenario post under “What You’ll Be Evaluated On.” Dimension 4 is generated by the LLM based on the specific component and is not shown in advance.


General dimensions (fixed, all modes):

Clarity, Prioritization, and Reasoning Quality. These apply to every evaluation regardless of mode.



02 Dimension Selection & Scoring Scale


The evaluator picks 3–6 of the most relevant dimensions per response from the full pool of 7 (4 technical + 3 general). Not every dimension is scored on every answer.


Score Label Meaning
1 Insufficient Missing, off-topic, or no understanding
2 Developing Relevant but incomplete or contains errors
3 Competent Adequate for the expected background level
4 Proficient Clear, correct, good reasoning
5 Excellent Exceptional insight, depth, or mastery



03 Coached Answer Cap


If a candidate provides an answer only after the evaluator hinted at it in a previous turn, that dimension is capped at 3/5. A score of 4 or 5 requires the candidate to lead the discovery independently.



04 Weighting


AIVIA applies two layers of weighting to scores:


Layer 1 — Follow-up depth (within each question thread):

Response Weight Rationale
Initial answer 1.0× Could be memorized
Follow-up 1 answer 1.2× Probes actual understanding
Follow-up 2 answer 1.5× Deepest probe, hardest to fake

Layer 2 — Recency (across question threads):

Thread Weight Rationale
Thread 1 1.0× Baseline
Thread 2 1.2× Consistency check
Thread 3 1.5× Ceiling test

The highest-weighted single response is Thread 3 / Follow-up 2 — the deepest probe in the final thread. This design rewards candidates who improve under pressure and penalizes those who collapse under follow-up probing.



05 Final Grade


The final grade is pattern-based, not average-based. It is determined by the distribution of technical dimension scores across all evaluator outputs:

Grade Pattern Requirement
Beginner >40% of scores are 1–2 Regardless of occasional highs
Intermediate ≥60% of scores are 3+ Majority competent
Expert ≥60% of scores are 4–5, with zero 1s Majority proficient, no failures

Additional factors:

Improving trajectory across threads can push a grade up. Collapsing under follow-up pressure can push a grade down. Technical dimensions are weighted higher than general dimensions. The grade applies to the specific component evaluated, not to general engineering ability.



A weighted numeric score (on a 1–5 scale) is also computed and used internally for badge aggregation but is not displayed to candidates or hiring teams. A valid grade requires at least 6 scored responses. If too many responses go unscored due to technical issues, the result is marked “Insufficient Data.”