Coverage for src/local_deep_research/benchmarks/templates.py: 100%

1"""

2Templates for benchmark evaluation.

4This module contains prompt templates used for evaluating model outputs

5against reference answers.

6"""

8# Template for SimpleQA grading

9SIMPLEQA_GRADER_TEMPLATE = """

10You are evaluating a question-answering system. Given a question, the correct answer,

11and the model's response, determine if the model's answer is correct.

13Question: {question}

15Correct Answer: {correct_answer}

17Model Response: {response}

19First, extract the model's final answer, ignoring explanations or reasoning.

20Then determine if the model's answer is correct by comparing it to the correct answer.

21Allow for minor variations in wording but be strict about factual accuracy.

23Provide your evaluation in this format:

24Extracted Answer: [The model's final answer, extracted from the full response]

25Reasoning: [Your explanation for why the answer is correct or incorrect]

26Correct: [yes/no]

27"""

29# Template for BrowseComp grading

30BROWSECOMP_GRADER_TEMPLATE = """

31Judge whether the following [response] to [question] is correct or not based on

32the precise and unambiguous [correct_answer] below.

34[question]: {question}

36[response]: {response}

38Your judgement must be in the format and criteria specified below:

40extracted_final_answer: The final exact answer extracted from the [response].

41Put the extracted answer as 'None' if there is no exact, final answer to extract

42from the response.

44[correct_answer]: {correct_answer}

46reasoning: Explain why the extracted_final_answer is correct or incorrect based

47on [correct_answer], focusing only on if there are meaningful differences between

48[correct_answer] and the extracted_final_answer.

50correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given

51above, or is within a small margin of error for numerical problems. Answer 'no' otherwise.

53confidence: The extracted confidence score between 0% and 100% from [response].

54Put 100 if there is no confidence score available.

55"""

57# Template for formatted BrowseComp queries

58BROWSECOMP_QUERY_TEMPLATE = """

59{question}

61Your response should be in the following format:

62Explanation: {{your explanation for your final answer}}

63Exact Answer: {{your succinct, final answer}}

64Confidence: {{your confidence score between 0% and 100% for your answer}}

65"""

Coverage for src / local_deep_research / benchmarks / templates.py: 100%

3 statements