FILE 07.03 / SCENARIO

~25 MIN · DESIGN

SIMULATED · FOR CANDIDATE EVALUATION · NOT REPRESENTATIVE OF PRODUCTION SYSTEMS

Improve the RAG

A retrieval prompt with imperfect groundedness. Improve it, then propose the metric that would catch a regression in your rewrite.

ALIGNED ROLES /AI Engineer

FILE 07.03.1 / SETUP

Our RAG layer powers Lena's grounded reasoning. When an operator asks Lena a question — "why is conf room 7 dropping calls" — the retrieval layer pulls the most relevant chunks from operational telemetry, structured device state, and product documentation, and the model answers with citations.

In this scenario's sample evaluation snapshot — visible in the telemetry below — the prompt clears the groundedness bar 60% of the time. That number is curated for the exercise, not a live production claim: a deliberately mid-range result so the candidate has something real to push against.

The non-grounded responses break down into a mix (see the telemetry): retrieval misses, retrieval distracts, the model ignores what it retrieved, the model confabulates from a partial doc. The mix is informative — it's not all one failure class. That's the shape of the work an AI engineer does on retrieval: read the failure decomposition, decide which lever moves the most, change one thing, measure again.

We're inviting candidates to take a swing at this. Two scopes:

Improve the retrieval prompt itself. The current prompt is simple: system instructions, the question, the retrieved context, an "I don't know" branch. There's a lot of room to do better — how you constrain the model, how you handle refusal, whether you make the model cite back to specific chunks, how you handle the case where retrieval is borderline.
Propose the eval metric that would catch a regression in your rewrite. Groundedness rate is too blunt — it catches the worst-case failures but misses partial-citation failures and over-refusal. Pick one new metric, define it precisely, and tell us the threshold you'd alarm on.

This is the kind of evaluation-and-iteration loop we run. Walk us through what you'd do.

FILE 07.03.2 / TELEMETRY

OPERATOR / SCREEN

What is on the operator's screen.

Real-shaped operational data. Anonymized device IDs, real-shaped timing. The same view an on-call engineer would see in the moment.

SOURCE · 01 / current_promptLIVE

systemYou are NetSpeek's operations assistant. Use the retrieved context to answer the operator's question. If the context does not answer the question, say 'I don't know.'

user_templateQuestion: {question} Context: {context} Answer:

chunks_retrieved6

chunk_size_tokens512

retrieval_strategytop-k vector similarity (cosine)

current retrieval prompt template

SOURCE · 02 / eval_summaryLIVE

groundedness_rate0.6

hallucination_rate0.18

refusal_rate0.09

irrelevant_answer_rate0.13

questions_evaluated240

eval_judgehuman + reasoning-llm double-pass

last 100 evaluation runs

SOURCE · 03 / failure_breakdownLIVE

retrieval_missed_relevant_doc31% of failures

retrieval_returned_irrelevant_doc24% of failures

model_ignored_retrieved_doc22% of failures

model_confabulated_with_partial_doc18% of failures

context_truncated5% of failures

failure-mode breakdown of the 40% non-grounded responses

FILE 07.03.4 / YOUR RESPONSE

CANDIDATE / INPUT

Show us how you would design it.

Short and specific beats long and vague. The next step is the application form — we save what you have written here so you do not lose it.

Q · 01Rewrite the retrieval prompt to improve groundedness. Show the new system + user template.

We're reading for: how you constrain the model, how you handle the 'I don't know' branch, and whether you cite chunks back to the operator.

0 / 1500

Q · 02Two sentences: what's the single biggest behavior change you expect from your rewrite, and what's the tradeoff?

0 / 400

Q · 03Propose one new eval metric (beyond groundedness rate) we'd add to catch a regression in this rewrite. Name it, define it, and tell us what threshold you'd alarm on.

0 / 600

Q · 04If you could change one thing about the retrieval layer (not the prompt), what would it be?

0 / 300

ROLE / TARGETREQUIRED

Which role are you applying for?

This scenario maps to one role. Pick the one you want your application attached to.

AI Engineer

Mid IC (1-3 YOE)

SKIP / TAKE FIELD NOTE PATH

Fill in each prompt to continue. The soft minimums are guidance, not gatekeeping.