NetSpeek

FILE 07.03 / SCENARIO

SIMULATED · FOR CANDIDATE EVALUATION · NOT REPRESENTATIVE OF PRODUCTION SYSTEMS

Improve the RAG

A retrieval prompt with imperfect groundedness. Improve it, then propose the metric that would catch a regression in your rewrite.

ALIGNED ROLES /AI Engineer

FILE 07.03.1 / SETUP

Our RAG layer powers Lena's grounded reasoning. When an operator asks Lena a question — "why is conf room 7 dropping calls" — the retrieval layer pulls the most relevant chunks from operational telemetry, structured device state, and product documentation, and the model answers with citations.

In this scenario's sample evaluation snapshot — visible in the telemetry below — the prompt clears the groundedness bar 60% of the time. That number is curated for the exercise, not a live production claim: a deliberately mid-range result so the candidate has something real to push against.

The non-grounded responses break down into a mix (see the telemetry): retrieval misses, retrieval distracts, the model ignores what it retrieved, the model confabulates from a partial doc. The mix is informative — it's not all one failure class. That's the shape of the work an AI engineer does on retrieval: read the failure decomposition, decide which lever moves the most, change one thing, measure again.

We're inviting candidates to take a swing at this. Two scopes:

  1. Improve the retrieval prompt itself. The current prompt is simple: system instructions, the question, the retrieved context, an "I don't know" branch. There's a lot of room to do better — how you constrain the model, how you handle refusal, whether you make the model cite back to specific chunks, how you handle the case where retrieval is borderline.

  2. Propose the eval metric that would catch a regression in your rewrite. Groundedness rate is too blunt — it catches the worst-case failures but misses partial-citation failures and over-refusal. Pick one new metric, define it precisely, and tell us the threshold you'd alarm on.

This is the kind of evaluation-and-iteration loop we run. Walk us through what you'd do.

FILE 07.03.2 / TELEMETRY

What is on the operator's screen.

Real-shaped operational data. Anonymized device IDs, real-shaped timing. The same view an on-call engineer would see in the moment.

SOURCE · 01 / current_promptLIVE
systemYou are NetSpeek's operations assistant. Use the retrieved context to answer the operator's question. If the context does not answer the question, say 'I don't know.'
user_templateQuestion: {question} Context: {context} Answer:
chunks_retrieved6
chunk_size_tokens512
retrieval_strategytop-k vector similarity (cosine)
current retrieval prompt template
SOURCE · 02 / eval_summaryLIVE
groundedness_rate0.6
hallucination_rate0.18
refusal_rate0.09
irrelevant_answer_rate0.13
questions_evaluated240
eval_judgehuman + reasoning-llm double-pass
last 100 evaluation runs
SOURCE · 03 / failure_breakdownLIVE
retrieval_missed_relevant_doc31% of failures
retrieval_returned_irrelevant_doc24% of failures
model_ignored_retrieved_doc22% of failures
model_confabulated_with_partial_doc18% of failures
context_truncated5% of failures
failure-mode breakdown of the 40% non-grounded responses

FILE 07.03.4 / YOUR RESPONSE

Show us how you would design it.

Short and specific beats long and vague. The next step is the application form — we save what you have written here so you do not lose it.

We're reading for: how you constrain the model, how you handle the 'I don't know' branch, and whether you cite chunks back to the operator.

0 / 1500
0 / 400
0 / 600
0 / 300
ROLE / TARGETREQUIRED

This scenario maps to one role. Pick the one you want your application attached to.

SKIP / TAKE FIELD NOTE PATH

Fill in each prompt to continue. The soft minimums are guidance, not gatekeeping.