Research: Evaluating Generative AI Agents

Ron Di Carlantonio
May 26
3 min read

Updated: Jun 9

Evaluating Generative AI Agents

From Academic Research to Real-World AI Evaluation

As organizations increasingly deploy AI Agents into products, services, vehicles, and operational systems, a critical challenge has emerged:

How do you measure whether an AI Agent is actually performing well?

While generative AI has made it easier than ever to build intelligent assistants, evaluating their quality, accuracy, completeness, and reliability remains one of the most difficult problems in the industry.

For iNAGO, this challenge has been the focus of a multi-year research collaboration with York University aimed at developing practical methods for evaluating AI Agents operating in real-world environments.

A Collaboration Between Industry and Academia

This research was conducted through a collaboration between iNAGO and York University under the leadership of Professor Aijun An, a leading researcher in artificial intelligence and machine learning.

The project brought together academic researchers and industry practitioners to explore how organizations can evaluate and improve AI Agents deployed within actual products, systems, and operational workflows.

Conference attendees discuss AI Agent evaluation methodologies with members of the research team.

Research Team

York University

Niloufar Beyranvand
Hamidreza Dastmalchi
Professor Aijun An
Heidar Davoudi

iNAGO

Winston Chan
Ron DiCarlantonio

The collaboration combined York University’s expertise in artificial intelligence research with iNAGO’s experience deploying AI technologies into real-world products and operational environments.

Supported by SmartTO

Early stages of this research were supported through the SmartTO Innovation Challenge Program.

SmartTO connects Ontario companies with leading academic researchers to solve practical technology challenges through collaborative innovation projects.

The program helped establish the partnership between iNAGO and York University and provided an opportunity to investigate one of the most important emerging questions in AI:

How can organizations objectively measure AI Agent quality?

The Challenge of AI Evaluation

Unlike traditional software systems, AI Agents do not always produce identical outputs.

Responses may vary depending on:

User phrasing
Context
Knowledge sources
Model selection
Agent configuration
External system integrations

As a result, organizations need new approaches for determining whether an AI Agent is performing reliably and consistently.

For applications in mobility, manufacturing, customer support, and enterprise operations, understanding AI quality becomes essential to building systems that users can trust.

Presented at EMNLP 2025

The results of this work were published and presented at EMNLP 2025 (Conference on Empirical Methods in Natural Language Processing), one of the world’s leading conferences for Natural Language Processing and Large Language Model research.

The paper explores practical methodologies for evaluating generative AI Agents using structured testing approaches and objective quality measurements.

The work contributes to a growing body of research focused on improving the reliability, trustworthiness, and deployment readiness of AI systems.

*Research findings on AI Agent evaluation presented at EMNLP 2025.*

From Research to Evaluator

The ideas and methodologies developed through this research directly influenced the development of Evaluator, iNAGO’s AI Agent evaluation platform.

Evaluator helps organizations:

Measure AI Agent quality
Compare models and configurations
Identify weaknesses and improvement opportunities
Evaluate AI systems against manuals and domain-specific documentation
Track performance improvements over time

The platform supports the complete AI lifecycle:

Build → Measure → Improve → Deploy

By combining academic research with practical deployment experience, Evaluator helps organizations develop AI systems that are more reliable, more controllable, and better suited for real-world use.

Looking Forward

As AI Agents move from demonstrations to production systems, evaluation will become an increasingly important capability.

Organizations need more than powerful AI models—they need confidence that those models perform correctly, consistently, and safely within real-world environments.

iNAGO continues to collaborate with academic and industry partners to advance the science of AI evaluation and bring research-driven methodologies into practical tools that organizations can use every day.

The research presented at EMNLP 2025 represents an important milestone in that journey and provides part of the foundation for the next generation of AI quality measurement and improvement technologies.

Research Paper

Evaluation of Generative AI Agents

Presented at EMNLP 2025

Build with iNAGO

netpeople Platform

netpeople Starter

netpeople Agent Evaluator

Build. Measure. Improve.

Products

Research: Evaluating Generative AI Agents

Build with iNAGO

netpeople® Platform

netpeople® Starter

netpeople® Agent Evaluator

Build. Measure. Improve.

Products

Manufacturing

Complex Products

Mobility

Solutions

AI Solutions by Industry