OpenAI
OpenAI open sourcing new GPT-4 Turbo evals
OpenAI today announced that it is open-sourcing a GitHub repository to run popular evals on various models including the new GPT-4 Turbo.
The company has improved writing, math, logical reasoning, and coding capabilities with the new GPT-4 Turbo. The model comes with responses that are more direct and less verbose. The responses will have more conversational language compared to the predecessor.
The repository on Github contains a library of evaluating language models. These now include:
- MMLU: Measuring Massive Multitask Language Understanding
- MATH: Measuring Mathematical Problem Solving With the MATH Dataset
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark,
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
- MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners
- HumanEval: Evaluating Large Language Models Trained on Code
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Evals are sensitive to prompting and there’s a variation in the formulations used in recent publications and libraries. These approaches are carryovers from evaluating base models and from models that were worse at following instructions.
For example, when writing with ChatGPT, responses will be more direct, less verbose, and use more conversational language. pic.twitter.com/PHxrmCtpyl
— OpenAI (@OpenAI) April 12, 2024