Research: Evaluating Generative AI Agents
- Ron Di Carlantonio
- May 26
- 3 min read
Updated: Jun 9
Evaluating Generative AI Agents
From Academic Research to Real-World AI Evaluation
As organizations increasingly deploy AI Agents into products, services, vehicles, and operational systems, a critical challenge has emerged:
How do you measure whether an AI Agent is actually performing well?
While generative AI has made it easier than ever to build intelligent assistants, evaluating their quality, accuracy, completeness, and reliability remains one of the most difficult problems in the industry.
For iNAGO, this challenge has been the focus of a multi-year research collaboration with York University aimed at developing practical methods for evaluating AI Agents operating in real-world environments.

A Collaboration Between Industry and Academia
This research was conducted through a collaboration between iNAGO and York University under the leadership of Professor Aijun An, a leading researcher in artificial intelligence and machine learning.
The project brought together academic researchers and industry practitioners to explore how organizations can evaluate and improve AI Agents deployed within actual products, systems, and operational workflows.

Conference attendees discuss AI Agent evaluation methodologies with members of the research team.

Research Team
York University
Niloufar Beyranvand
Hamidreza Dastmalchi
Professor Aijun An
Heidar Davoudi
iNAGO
Winston Chan
Ron DiCarlantonio
The collaboration combined York University’s expertise in artificial intelligence research with iNAGO’s experience deploying AI technologies into real-world products and operational environments.
Supported by SmartTO
Early stages of this research were supported through the SmartTO Innovation Challenge Program.
SmartTO connects Ontario companies with leading academic researchers to solve practical technology challenges through collaborative innovation projects.
The program helped establish the partnership between iNAGO and York University and provided an opportunity to investigate one of the most important emerging questions in AI:
How can organizations objectively measure AI Agent quality?
The Challenge of AI Evaluation
Unlike traditional software systems, AI Agents do not always produce identical outputs.
Responses may vary depending on:
User phrasing
Context
Knowledge sources
Model selection
Agent configuration
External system integrations
As a result, organizations need new approaches for determining whether an AI Agent is performing reliably and consistently.
For applications in mobility, manufacturing, customer support, and enterprise operations, understanding AI quality becomes essential to building systems that users can trust.
Presented at EMNLP 2025
The results of this work were published and presented at EMNLP 2025 (Conference on Empirical Methods in Natural Language Processing), one of the world’s leading conferences for Natural Language Processing and Large Language Model research.
The paper explores practical methodologies for evaluating generative AI Agents using structured testing approaches and objective quality measurements.
The work contributes to a growing body of research focused on improving the reliability, trustworthiness, and deployment readiness of AI systems.

From Research to Evaluator
The ideas and methodologies developed through this research directly influenced the development of Evaluator, iNAGO’s AI Agent evaluation platform.
Evaluator helps organizations:
Measure AI Agent quality
Compare models and configurations
Identify weaknesses and improvement opportunities
Evaluate AI systems against manuals and domain-specific documentation
Track performance improvements over time
The platform supports the complete AI lifecycle:
Build → Measure → Improve → Deploy
By combining academic research with practical deployment experience, Evaluator helps organizations develop AI systems that are more reliable, more controllable, and better suited for real-world use.
Looking Forward
As AI Agents move from demonstrations to production systems, evaluation will become an increasingly important capability.
Organizations need more than powerful AI models—they need confidence that those models perform correctly, consistently, and safely within real-world environments.
iNAGO continues to collaborate with academic and industry partners to advance the science of AI evaluation and bring research-driven methodologies into practical tools that organizations can use every day.
The research presented at EMNLP 2025 represents an important milestone in that journey and provides part of the foundation for the next generation of AI quality measurement and improvement technologies.
Research Paper
Evaluation of Generative AI Agents
Presented at EMNLP 2025
