OpenAI

OpenAI open sourcing new GPT-4 Turbo evals

Published

on

OpenAI today announced that it is open-sourcing a GitHub repository to run popular evals on various models including the new GPT-4 Turbo.

The company has improved writing, math, logical reasoning, and coding capabilities with the new GPT-4 Turbo. The model comes with responses that are more direct and less verbose. The responses will have more conversational language compared to the predecessor.

Advertisement

OpenAI GPT-4 Turbo (Image Credit: OpenAI)

The repository on Github contains a library of evaluating language models. These now include:

  • MMLU: Measuring Massive Multitask Language Understanding
  • MATH: Measuring Mathematical Problem Solving With the MATH Dataset
  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark,
  • DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
  • MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners
  • HumanEval: Evaluating Large Language Models Trained on Code
  • MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Evals are sensitive to prompting and there’s a variation in the formulations used in recent publications and libraries. These approaches are carryovers from evaluating base models and from models that were worse at following instructions.

Advertisement
Exit mobile version