〔AIVIA〕Evaluation Rubric & Scoring

stemaway · April 24, 2026, 6:24pm

Rubric & Scoring

HOW EVALUATIONS ARE SCORED, FROM INDIVIDUAL ANSWERS TO FINAL GRADE

01 Rubric Dimensions

Every evaluation scores answers on two types of rubric dimensions:

Technical dimensions (3 visible + 1 hidden):

Evaluations operate in one of two modes. Each mode has its own set of technical dimensions:

Mode	Dimension 1	Dimension 2	Dimension 3	Dimension 4
Debug	Root Cause Isolation	Evidence Interpretation	Impact Assessment	Hidden (LLM-generated)
Design	Tradeoff Analysis	Gap Identification	Preventive Thinking	Hidden (LLM-generated)

Candidates see dimensions 1–3 listed in the scenario post under “What You’ll Be Evaluated On.” Dimension 4 is generated by the LLM based on the specific component and is not shown in advance.

General dimensions (fixed, all modes):

Clarity, Prioritization, and Reasoning Quality. These apply to every evaluation regardless of mode.

02 Dimension Selection & Scoring Scale

The evaluator picks 3–6 of the most relevant dimensions per response from the full pool of 7 (4 technical + 3 general). Not every dimension is scored on every answer.

Score	Label	Meaning
1	Insufficient	Missing, off-topic, or no understanding
2	Developing	Relevant but incomplete or contains errors
3	Competent	Adequate for the expected background level
4	Proficient	Clear, correct, good reasoning
5	Excellent	Exceptional insight, depth, or mastery

03 Coached Answer Cap

If a candidate provides an answer only after the evaluator hinted at it in a previous turn, that dimension is capped at 3/5. A score of 4 or 5 requires the candidate to lead the discovery independently.

04 Weighting

AIVIA applies two layers of weighting to scores:

Layer 1 — Follow-up depth (within each question thread):

Response	Weight	Rationale
Initial answer	1.0×	Could be memorized
Follow-up 1 answer	1.2×	Probes actual understanding
Follow-up 2 answer	1.5×	Deepest probe, hardest to fake

Layer 2 — Recency (across question threads):

Thread	Weight	Rationale
Thread 1	1.0×	Baseline
Thread 2	1.2×	Consistency check
Thread 3	1.5×	Ceiling test

The highest-weighted single response is Thread 3 / Follow-up 2 — the deepest probe in the final thread. This design rewards candidates who improve under pressure and penalizes those who collapse under follow-up probing.

05 Final Grade

The final grade is pattern-based, not average-based. It is determined by the distribution of technical dimension scores across all evaluator outputs:

Grade	Pattern	Requirement
Beginner	>40% of scores are 1–2	Regardless of occasional highs
Intermediate	≥60% of scores are 3+	Majority competent
Expert	≥60% of scores are 4–5, with zero 1s	Majority proficient, no failures

Additional factors:

Improving trajectory across threads can push a grade up. Collapsing under follow-up pressure can push a grade down. Technical dimensions are weighted higher than general dimensions. The grade applies to the specific component evaluated, not to general engineering ability.

A weighted numeric score (on a 1–5 scale) is also computed and used internally for badge aggregation but is not displayed to candidates or hiring teams. A valid grade requires at least 6 scored responses. If too many responses go unscored due to technical issues, the result is marked “Insufficient Data.”