4/27/2023 0 Comments Gog beyond a steel skyInclude at least 100 high quality examples.This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval. Includes good signal around what is the right behavior.Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.For example, we can create an eval on cases where the model fails to reason about the physical world. We'd like to see a number of prompts all demonstrating some particular failure mode. Thematically consistent: The eval should be thematically consistent.In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). īelow are some of the criteria we look for in a good eval. For reference, here is our NAACL 2022 paper.QuALITY has been a useful benchmark for scalable oversight research (Bowman et al., 2022 ) and other alignment research (e.g., Kadavath et al., 2022 ).QuALITY has been a useful benchmark for long-context understanding (Shaham et al., 2022 ). All the articles are either CC-BY or in the public domain.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |