TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students

1Georgia Institute of Technology,
*Equal contribution
TutorGym Teaser

TutorGym is a testbed for evaluating AI agents as tutors and students

Abstract

Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than the ability to generate problem solutions. To evaluate these applications more directly we introduce TutorGym, a standardized interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including CTAT tutors, Apprentice Tutors, and OATutors.

TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring support—such as generating examples, hints, and step-level correctness feedback—which can be evaluated directly against the adaptive step-by-step support provided by existing ITSs. As students, agents directly learn from ITS instruction, and their mistakes and learning trajectories can be compared to student data.

TutorGym establishes a common framework for training and evaluating diverse AI agents, including LLMs, computational models of learning, and reinforcement learning agents, within a growing suite of learning environments. Currently, TutorGym includes 223 different tutor domains. In an initial evaluation of LLM-based agents with TutorGym, we find that LLMs are relatively poor at tutoring in TutorGym's ITS interfaces, but show a remarkable ability to produce learning curves similar to human learning curves when trained against those ITSs with in-context learning.

Results

Table 1: LLM accuracy at labeling correct and incorrect student actions and generating next step demonstrations (bottom out hints) across the tutors.
Tutor Platform LLM Model Correct Accuracy Incorrect Accuracy Demo Accuracy
CTAT Mathtutors
(10 domains)
Sonnet-3.5 61.92% 36.21% 56.50%
Haiku-3.5 81.06% 25.05% 36.49%
GPT-4o 28.11% 42.63% 38.89%
DeepSeek-v2.5 59.96% 30.84% 39.33%
Apprentice
(30 domains)
Sonnet-3.5 86.54% 46.56% 64.20%
Haiku-3.5 88.36% 22.34% 49.73%
GPT-4o 74.61% 49.80% 70.75%
DeepSeek-v2.5 82.00% 48.92% 58.35%
OATutor
(183 domains)
Sonnet-3.5 79.59% 38.85% 52.10%
Haiku-3.5 92.36% 10.82% 36.63%
GPT-4o 71.20% 38.69% 51.07%
DeepSeek-v2.5 90.87% 17.21% 43.89%

Want to add your model or domain to TutorGym? Reach out to Danny or Momin!

Updates

We are going to AIED 2025! April 2nd, 2025
We will be presenting our work at AIED 2025!

BibTeX

@misc{weitekamp2025tutorgymtestbedevaluatingai,
      title={TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students}, 
      author={Daniel Weitekamp and Momin N. Siddiqui and Christopher J. MacLellan},
      year={2025},
      eprint={2505.01563},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.01563}, 
}

Collaborators