AutomationFlowsAI & RAG › Score Customer Support AI Responses with Gpt‑4 Judge Metrics

Score Customer Support AI Responses with Gpt‑4 Judge Metrics

ByElvis Sarvia @elvissaravia on n8n.io

Score open-ended AI responses with a judge model. This template shows how to evaluate a customer support agent using a separate LLM that rates each response on correctness and helpfulness, going beyond what exact match scoring can capture.

Chat trigger trigger★★★★☆ complexityAI-powered14 nodesChat TriggerEvaluation TriggerAgentOpenAI ChatEvaluationOpenAI
AI & RAG Trigger: Chat trigger Nodes: 14 Complexity: ★★★★☆ AI nodes: yes Added:

This workflow corresponds to n8n.io template #15134 — we link there as the canonical source.

This workflow follows the Agent → Chat Trigger recipe pattern — see all workflows that pair these two integrations.

The workflow JSON

Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →

Download .json
{
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "nodes": [
    {
      "id": "0787733d-4c2f-43b4-a865-a8bf640982d9",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -448,
        -80
      ],
      "parameters": {
        "width": 668,
        "height": 832,
        "content": "# LLM-as-a-Judge Evaluation\n\n### How it works\n1. **Production path:** A Chat Trigger receives a customer question, the AI Agent generates a support response, and the result is returned via Return chat response.\n2. **Evaluation path:** The Evaluation Trigger reads test cases (question + expected answer) from a Data Table and feeds each one through the same AI Agent.\n3. **Routing:** The Evaluating? node sends production traffic downstream and evaluation traffic into the judge branch.\n4. **Judging:** A separate judge model (Judge - Score Response) evaluates each AI response on correctness and helpfulness, returning a 1-5 score for each dimension.\n5. **Recording:** Evaluation - Set Outputs and Set Metrics record scores in the Evaluations tab alongside token usage and execution time.\n\n### Setup\n1. Add credentials for the OpenAI Chat Model (production agent) and the judge model (evaluation).\n2. Create the Data Table with question + expected answer pairs that reflect real support scenarios.\n3. Open the Evaluations tab in this workflow and click Run Test to score the agent across your test cases.\n\n### Customization\n- Swap the judge model for any capable LLM (Claude, Gemini, GPT-5, etc.). Use a model at least as capable as the one being evaluated.\n- Replace the custom judge prompt with n8n's built-in Correctness or Helpfulness metrics for less setup.\n- Add domain-specific scoring criteria (tone, compliance, completeness) by extending the judge prompt.\n- Use comparative judging (compare prompt A vs prompt B) when iterating on prompts to get more consistent scores.\n\n---\nThis template is a learning companion to the **Production AI Playbook**, a series that explores strategies, shares best practices, and provides practical examples for building reliable AI systems in n8n."
      },
      "typeVersion": 1
    },
    {
      "id": "ec465cf9-b6ec-4ecd-8169-3767ddcb89d0",
      "name": "When chat message received",
      "type": "@n8n/n8n-nodes-langchain.chatTrigger",
      "position": [
        528,
        256
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 1.1
    },
    {
      "id": "fa18df09-357f-44fb-9c77-70e0551b4ff1",
      "name": "When fetching a dataset row",
      "type": "n8n-nodes-base.evaluationTrigger",
      "position": [
        304,
        64
      ],
      "parameters": {
        "source": "dataTable",
        "dataTableId": {
          "__rl": true,
          "mode": "list",
          "value": "VPCxS9mO1gPbvyRa",
          "cachedResultUrl": "/projects/5xhYaLjYeyMka6t9/datatables/VPCxS9mO1gPbvyRa",
          "cachedResultName": "Customer Support QA Test Cases"
        }
      },
      "typeVersion": 4.6
    },
    {
      "id": "d48695ff-3f43-4466-8daa-ad978ff3fa2f",
      "name": "Format eval input",
      "type": "n8n-nodes-base.code",
      "position": [
        528,
        64
      ],
      "parameters": {
        "jsCode": "const row = $input.first().json;\nreturn [{ json: { chatInput: row.input || row.question } }];"
      },
      "typeVersion": 2
    },
    {
      "id": "5a48c44b-8ece-41a8-81ef-d054d5eaa3d7",
      "name": "AI Agent",
      "type": "@n8n/n8n-nodes-langchain.agent",
      "position": [
        752,
        160
      ],
      "parameters": {
        "text": "={{ $json.chatInput }}",
        "options": {
          "systemMessage": "You are a friendly and knowledgeable customer support agent. Your role is to help customers with their questions about accounts, billing, subscriptions, and product features. Always be polite, provide clear and accurate information, and offer actionable next steps. If you are unsure about something, let the customer know and offer to escalate to a specialist."
        },
        "promptType": "define"
      },
      "typeVersion": 1.9
    },
    {
      "id": "9c4243b8-8f78-4276-a082-183c66f6a362",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        832,
        384
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o-mini",
          "cachedResultName": "GPT-4O-MINI"
        },
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "a45bd302-1f3d-43c6-a290-8d69d83fc4a8",
      "name": "Evaluating?",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1104,
        160
      ],
      "parameters": {
        "operation": "checkIfEvaluating"
      },
      "typeVersion": 4.6
    },
    {
      "id": "843103a0-454b-46cc-8227-bdcc3969b52f",
      "name": "Return chat response",
      "type": "n8n-nodes-base.noOp",
      "position": [
        1392,
        256
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "f398d26a-7c0f-4597-9557-19d6cb423fd8",
      "name": "Judge - Score Response",
      "type": "@n8n/n8n-nodes-langchain.openAi",
      "position": [
        1328,
        64
      ],
      "parameters": {
        "modelId": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o",
          "cachedResultName": "GPT-4O"
        },
        "options": {},
        "messages": {
          "values": [
            {
              "content": "=You are an expert evaluator assessing the quality of AI-generated customer support responses.\n\nEvaluate the following response on two dimensions:\n\n**Correctness (1-5):** Does the response contain accurate information? Does it align with the expected answer?\n- 5: Fully correct, matches expected answer\n- 4: Mostly correct, minor omissions\n- 3: Partially correct, some inaccuracies\n- 2: Mostly incorrect\n- 1: Completely wrong or hallucinated\n\n**Helpfulness (1-5):** Does the response actually help the user? Is it clear, actionable, and complete?\n- 5: Extremely helpful, clear next steps\n- 4: Helpful with minor gaps\n- 3: Somewhat helpful but vague\n- 2: Minimally helpful\n- 1: Not helpful at all\n\n---\n\n**User Question:** {{ $('When fetching a dataset row').first().json.input }}\n\n**Expected Answer:** {{ $('When fetching a dataset row').first().json.expected_output }}\n\n**AI Response:** {{ $json.output }}\n\n---\n\nRespond with ONLY valid JSON in this format:\n{\"correctness\": <1-5>, \"helpfulness\": <1-5>, \"correctness_justification\": \"<brief reason>\", \"helpfulness_justification\": \"<brief reason>\"}"
            }
          ]
        }
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.8
    },
    {
      "id": "b9d8be76-3b3f-40be-9a66-b1a5ab7fae88",
      "name": "Evaluation - Set Outputs",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1680,
        64
      ],
      "parameters": {
        "source": "dataTable",
        "outputs": {
          "values": [
            {
              "outputName": "correctness_score",
              "outputValue": "={{ JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).correctness }}"
            },
            {
              "outputName": "helpfulness_score",
              "outputValue": "={{ JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).helpfulness }}"
            },
            {
              "outputName": "avg_score",
              "outputValue": "={{ (JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).correctness + JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).helpfulness) / 2 }}"
            }
          ]
        },
        "dataTableId": {
          "__rl": true,
          "mode": "list",
          "value": "VPCxS9mO1gPbvyRa",
          "cachedResultUrl": "/projects/5xhYaLjYeyMka6t9/datatables/VPCxS9mO1gPbvyRa",
          "cachedResultName": "Customer Support QA Test Cases"
        }
      },
      "typeVersion": 4.6
    },
    {
      "id": "e1986910-eb71-4465-b501-8c6169dffa62",
      "name": "Set Metrics",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1904,
        64
      ],
      "parameters": {
        "metrics": {
          "assignments": [
            {
              "id": "m1",
              "name": "correctness",
              "type": "number",
              "value": "={{ JSON.parse($json.message.content.replace(/```json\\n?|\\n?```/g, '').trim()).correctness }}"
            },
            {
              "id": "m2",
              "name": "helpfulness",
              "type": "number",
              "value": "={{ JSON.parse($json.message.content.replace(/```json\\n?|\\n?```/g, '').trim()).helpfulness }}"
            }
          ]
        },
        "operation": "setMetrics"
      },
      "typeVersion": 4.6
    },
    {
      "id": "57639b04-68c1-4f42-922f-2ca1712a7eb8",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        240,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 448,
        "height": 640,
        "content": "## Receive Customer Query"
      },
      "typeVersion": 1
    },
    {
      "id": "c1115cdd-aa6d-42de-aeae-2c81859d377e",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        704,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 336,
        "height": 640,
        "content": "## Customer Query AI Reply"
      },
      "typeVersion": 1
    },
    {
      "id": "77114286-a214-46f3-80f7-41af85cc9ed0",
      "name": "Sticky Note4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1056,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 1024,
        "height": 640,
        "content": "## Evaluate with LLM-as-a-Judge"
      },
      "typeVersion": 1
    }
  ],
  "connections": {
    "AI Agent": {
      "main": [
        [
          {
            "node": "Evaluating?",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Evaluating?": {
      "main": [
        [
          {
            "node": "Judge - Score Response",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Return chat response",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Format eval input": {
      "main": [
        [
          {
            "node": "AI Agent",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "AI Agent",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Judge - Score Response": {
      "main": [
        [
          {
            "node": "Evaluation - Set Outputs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Evaluation - Set Outputs": {
      "main": [
        [
          {
            "node": "Set Metrics",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "When chat message received": {
      "main": [
        [
          {
            "node": "AI Agent",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "When fetching a dataset row": {
      "main": [
        [
          {
            "node": "Format eval input",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Credentials you'll need

Each integration node will prompt for credentials when you import. We strip credential IDs before publishing — you'll add your own.

Pro

For the full experience including quality scoring and batch install features for each workflow upgrade to Pro

About this workflow

Score open-ended AI responses with a judge model. This template shows how to evaluate a customer support agent using a separate LLM that rates each response on correctness and helpfulness, going beyond what exact match scoring can capture.

Source: https://n8n.io/workflows/15134/ — original creator credit. Request a take-down →

More AI & RAG workflows → · Browse all categories →

Related workflows

Workflows that share integrations, category, or trigger type with this one. All free to copy and import.

AI & RAG

This is a template for n8n's evaluation feature.

Agent, OpenAI Chat, Tool Calculator +4
AI & RAG

This is a template for n8n's evaluation feature.

OpenAI Chat, Evaluation Trigger, Evaluation +3
AI & RAG

HDW Lead Geländewagen. Uses chatTrigger, lmChatOpenAi, memoryBufferWindow, outputParserStructured. Chat trigger; 92 nodes.

Chat Trigger, OpenAI Chat, Memory Buffer Window +5
AI & RAG

Who’s it for Creators who want to create faceless videos automatically, while keeping human oversight and quality control.

Read Write File, Agent, OpenAI Chat +7
AI & RAG

The Best Linkedin Posting System. Uses httpRequest, lmChatOpenAi, agent, chatTrigger. Chat trigger; 49 nodes.

HTTP Request, OpenAI Chat, Agent +8