Large-Language Models such as ChatGPT have the potential to revo-
lutionize academic teaching in physics in a similar way the electronic calculator,
the home computer or the internet did. AI models are patient, produce answers
tailored to a student’s needs and are accessible whenever needed. Those involved
in academic teaching are facing a number of questions: Just how reliable are pub-
licly accessible models in answering, how does the question’s language affect the
models’ performance and how well do the models perform with more difficult tasks
beyond retrieval? To adress these questions, we benchmark a number of publicly
available models on the mlphys101 dataset, a new set of 823 university level MC5
questions and answers released alongside this work. While the original questions
are in English, we employ GPT-4 to translate them into various other languages,
followed by revision and refinement by native speakers. Our findings indicate that
state-of-the-art models perform well on questions involving the replication of facts,
definitions, and basic concepts, but struggle with multi-step quantitative reason-
ing. This aligns with existing literature that highlights the challenges LLMs face
in mathematical and logical reasoning tasks. We conclude that the most advanced
current LLMs are a valuable addition to the academic curriculum and LLM pow-
ered translations are a viable method to increase the accessibility of materials, but
their utility for more difficult quantitative tasks remains limited.
The dataset is available in English here only and will be removed, once the mlphys101 publication was accepted and released to the public.
The dataset is available in English here only and will be removed, once the mlphys101 publication was accepted and released to the public.