Gog beyond a steel sky

4/27/2023 0 Comments

Gog beyond a steel sky

Include at least 100 high quality examples.This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval. Includes good signal around what is the right behavior.Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.For example, we can create an eval on cases where the model fails to reason about the physical world. We'd like to see a number of prompts all demonstrating some particular failure mode. Thematically consistent: The eval should be thematically consistent.In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). īelow are some of the criteria we look for in a good eval. For reference, here is our NAACL 2022 paper.QuALITY has been a useful benchmark for scalable oversight research (Bowman et al., 2022 ) and other alignment research (e.g., Kadavath et al., 2022 ).QuALITY has been a useful benchmark for long-context understanding (Shaham et al., 2022 ).

All the articles are either CC-BY or in the public domain.

We believe that the crowdsourcing strategies are carefully chosen and executed.
There is still a significant gap between the current state-of-the-art QuALITY performance and the human performance (66.9 vs ~93.5 see next section of this PR for more details).
The model should be able to understand long-context fiction and non-fiction articles. Test the model's ability to answer multiple-choice questions based on long documents (4k+ English words on average 5k+ tokens based on spaCy tokenization on average).

0 Comments

YOUR CART

Gog beyond a steel sky

Leave a Reply.

Author

Archives

Categories