Coverage for src / local_deep_research / benchmarks / templates.py: 100%
3 statements
« prev ^ index » next coverage.py v7.12.0, created at 2026-01-11 00:51 +0000
« prev ^ index » next coverage.py v7.12.0, created at 2026-01-11 00:51 +0000
1"""
2Templates for benchmark evaluation.
4This module contains prompt templates used for evaluating model outputs
5against reference answers.
6"""
8# Template for SimpleQA grading
9SIMPLEQA_GRADER_TEMPLATE = """
10You are evaluating a question-answering system. Given a question, the correct answer,
11and the model's response, determine if the model's answer is correct.
13Question: {question}
15Correct Answer: {correct_answer}
17Model Response: {response}
19First, extract the model's final answer, ignoring explanations or reasoning.
20Then determine if the model's answer is correct by comparing it to the correct answer.
21Allow for minor variations in wording but be strict about factual accuracy.
23Provide your evaluation in this format:
24Extracted Answer: [The model's final answer, extracted from the full response]
25Reasoning: [Your explanation for why the answer is correct or incorrect]
26Correct: [yes/no]
27"""
29# Template for BrowseComp grading
30BROWSECOMP_GRADER_TEMPLATE = """
31Judge whether the following [response] to [question] is correct or not based on
32the precise and unambiguous [correct_answer] below.
34[question]: {question}
36[response]: {response}
38Your judgement must be in the format and criteria specified below:
40extracted_final_answer: The final exact answer extracted from the [response].
41Put the extracted answer as 'None' if there is no exact, final answer to extract
42from the response.
44[correct_answer]: {correct_answer}
46reasoning: Explain why the extracted_final_answer is correct or incorrect based
47on [correct_answer], focusing only on if there are meaningful differences between
48[correct_answer] and the extracted_final_answer.
50correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given
51above, or is within a small margin of error for numerical problems. Answer 'no' otherwise.
53confidence: The extracted confidence score between 0% and 100% from [response].
54Put 100 if there is no confidence score available.
55"""
57# Template for formatted BrowseComp queries
58BROWSECOMP_QUERY_TEMPLATE = """
59{question}
61Your response should be in the following format:
62Explanation: {{your explanation for your final answer}}
63Exact Answer: {{your succinct, final answer}}
64Confidence: {{your confidence score between 0% and 100% for your answer}}
65"""