评估

要评估代理的性能,可以使用 LangSmith 评估。您需要首先定义一个评估器函数来判断代理的结果,例如最终输出或轨迹。根据您的评估技术,这可能涉及或不涉及参考输出。

<span id="__span-0-1">def evaluator(*, outputs: dict, reference_outputs: dict):
<span id="__span-0-2">    # compare agent outputs against reference outputs
<span id="__span-0-3">    output_messages = outputs["messages"]
<span id="__span-0-4">    reference_messages = reference["messages"]
<span id="__span-0-5">    score = compare_messages(output_messages, reference_messages)
<span id="__span-0-6">    return {"key": "evaluator_score", "score": score}

要开始使用,您可以使用 AgentEvals 包中的预构建评估器。

<span id="__span-1-1">pip install -U agentevals

创建评估器

评估代理性能的一种常用方法是将其轨迹(调用工具的顺序)与参考轨迹进行比较。

<span id="__span-2-1">import json
<span id="__span-2-2">from agentevals.trajectory.match import create_trajectory_match_evaluator
<span id="__span-2-3">
<span id="__span-2-4">outputs = [
<span id="__span-2-5">    {
<span id="__span-2-6">        "role": "assistant",
<span id="__span-2-7">        "tool_calls": [
<span id="__span-2-8">            {
<span id="__span-2-9">                "function": {
<span id="__span-2-10">                    "name": "get_weather",
<span id="__span-2-11">                    "arguments": json.dumps({"city": "san francisco"}),
<span id="__span-2-12">                }
<span id="__span-2-13">            },
<span id="__span-2-14">            {
<span id="__span-2-15">                "function": {
<span id="__span-2-16">                    "name": "get_directions",
<span id="__span-2-17">                    "arguments": json.dumps({"destination": "presidio"}),
<span id="__span-2-18">                }
<span id="__span-2-19">            }
<span id="__span-2-20">        ],
<span id="__span-2-21">    }
<span id="__span-2-22">]
<span id="__span-2-23">reference_outputs = [
<span id="__span-2-24">    {
<span id="__span-2-25">        "role": "assistant",
<span id="__span-2-26">        "tool_calls": [
<span id="__span-2-27">            {
<span id="__span-2-28">                "function": {
<span id="__span-2-29">                    "name": "get_weather",
<span id="__span-2-30">                    "arguments": json.dumps({"city": "san francisco"}),
<span id="__span-2-31">                }
<span id="__span-2-32">            },
<span id="__span-2-33">        ],
<span id="__span-2-34">    }
<span id="__span-2-35">]
<span id="__span-2-36">
<span id="__span-2-37"># Create the evaluator
<span id="__span-2-38">evaluator = create_trajectory_match_evaluator(
<span id="__span-2-39">    trajectory_match_mode="superset",  
<span id="__span-2-40">)
<span id="__span-2-41">
<span id="__span-2-42"># Run the evaluator
<span id="__span-2-43">result = evaluator(
<span id="__span-2-44">    outputs=outputs, reference_outputs=reference_outputs
<span id="__span-2-45">)

下一步,了解如何自定义轨迹匹配评估器

LLM 作为裁判

您可以使用 LLM 作为裁判的评估器,它使用一个 LLM 来比较轨迹与参考输出并输出分数。

<span id="__span-3-1">import json
<span id="__span-3-2">from agentevals.trajectory.llm import (
<span id="__span-3-3">    create_trajectory_llm_as_judge,
<span id="__span-3-4">    TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE
<span id="__span-3-5">)
<span id="__span-3-6">
<span id="__span-3-7">evaluator = create_trajectory_llm_as_judge(
<span id="__span-3-8">    prompt=TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
<span id="__span-3-9">    model="openai:o3-mini"
<span id="__span-3-10">)

运行评估器

要运行评估器,您首先需要创建一个 LangSmith 数据集。要使用预构建的 AgentEvals 评估器,您需要一个具有以下模式的数据集:

  • 输入: {"messages": [...]} 用于调用代理的输入消息。
  • 输出: {"messages": [...]} 代理输出中预期的消息历史。对于轨迹评估,您可以选择只保留助手消息。

API 参考: create_react_agent

<span id="__span-4-1">from langsmith import Client
<span id="__span-4-2">from langgraph.prebuilt import create_react_agent
<span id="__span-4-3">from agentevals.trajectory.match import create_trajectory_match_evaluator
<span id="__span-4-4">
<span id="__span-4-5">client = Client()
<span id="__span-4-6">agent = create_react_agent(...)
<span id="__span-4-7">evaluator = create_trajectory_match_evaluator(...)
<span id="__span-4-8">
<span id="__span-4-9">experiment_results = client.evaluate(
<span id="__span-4-10">    lambda inputs: agent.invoke(inputs),
<span id="__span-4-11">    # replace with your dataset name
<span id="__span-4-12">    data="&lt;Name of your dataset&gt;",
<span id="__span-4-13">    evaluators=[evaluator]
<span id="__span-4-14">)