{
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "nodes": [
    {
      "id": "0787733d-4c2f-43b4-a865-a8bf640982d9",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -448,
        -80
      ],
      "parameters": {
        "width": 668,
        "height": 832,
        "content": "# LLM-as-a-Judge Evaluation\n\n### How it works\n1. **Production path:** A Chat Trigger receives a customer question, the AI Agent generates a support response, and the result is returned via Return chat response.\n2. **Evaluation path:** The Evaluation Trigger reads test cases (question + expected answer) from a Data Table and feeds each one through the same AI Agent.\n3. **Routing:** The Evaluating? node sends production traffic downstream and evaluation traffic into the judge branch.\n4. **Judging:** A separate judge model (Judge - Score Response) evaluates each AI response on correctness and helpfulness, returning a 1-5 score for each dimension.\n5. **Recording:** Evaluation - Set Outputs and Set Metrics record scores in the Evaluations tab alongside token usage and execution time.\n\n### Setup\n1. Add credentials for the OpenAI Chat Model (production agent) and the judge model (evaluation).\n2. Create the Data Table with question + expected answer pairs that reflect real support scenarios.\n3. Open the Evaluations tab in this workflow and click Run Test to score the agent across your test cases.\n\n### Customization\n- Swap the judge model for any capable LLM (Claude, Gemini, GPT-5, etc.). Use a model at least as capable as the one being evaluated.\n- Replace the custom judge prompt with n8n's built-in Correctness or Helpfulness metrics for less setup.\n- Add domain-specific scoring criteria (tone, compliance, completeness) by extending the judge prompt.\n- Use comparative judging (compare prompt A vs prompt B) when iterating on prompts to get more consistent scores.\n\n---\nThis template is a learning companion to the **Production AI Playbook**, a series that explores strategies, shares best practices, and provides practical examples for building reliable AI systems in n8n."
      },
      "typeVersion": 1
    },
    {
      "id": "ec465cf9-b6ec-4ecd-8169-3767ddcb89d0",
      "name": "When chat message received",
      "type": "@n8n/n8n-nodes-langchain.chatTrigger",
      "position": [
        528,
        256
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 1.1
    },
    {
      "id": "fa18df09-357f-44fb-9c77-70e0551b4ff1",
      "name": "When fetching a dataset row",
      "type": "n8n-nodes-base.evaluationTrigger",
      "position": [
        304,
        64
      ],
      "parameters": {
        "source": "dataTable",
        "dataTableId": {
          "__rl": true,
          "mode": "list",
          "value": "VPCxS9mO1gPbvyRa",
          "cachedResultUrl": "/projects/5xhYaLjYeyMka6t9/datatables/VPCxS9mO1gPbvyRa",
          "cachedResultName": "Customer Support QA Test Cases"
        }
      },
      "typeVersion": 4.6
    },
    {
      "id": "d48695ff-3f43-4466-8daa-ad978ff3fa2f",
      "name": "Format eval input",
      "type": "n8n-nodes-base.code",
      "position": [
        528,
        64
      ],
      "parameters": {
        "jsCode": "const row = $input.first().json;\nreturn [{ json: { chatInput: row.input || row.question } }];"
      },
      "typeVersion": 2
    },
    {
      "id": "5a48c44b-8ece-41a8-81ef-d054d5eaa3d7",
      "name": "AI Agent",
      "type": "@n8n/n8n-nodes-langchain.agent",
      "position": [
        752,
        160
      ],
      "parameters": {
        "text": "={{ $json.chatInput }}",
        "options": {
          "systemMessage": "You are a friendly and knowledgeable customer support agent. Your role is to help customers with their questions about accounts, billing, subscriptions, and product features. Always be polite, provide clear and accurate information, and offer actionable next steps. If you are unsure about something, let the customer know and offer to escalate to a specialist."
        },
        "promptType": "define"
      },
      "typeVersion": 1.9
    },
    {
      "id": "9c4243b8-8f78-4276-a082-183c66f6a362",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        832,
        384
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o-mini",
          "cachedResultName": "GPT-4O-MINI"
        },
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "a45bd302-1f3d-43c6-a290-8d69d83fc4a8",
      "name": "Evaluating?",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1104,
        160
      ],
      "parameters": {
        "operation": "checkIfEvaluating"
      },
      "typeVersion": 4.6
    },
    {
      "id": "843103a0-454b-46cc-8227-bdcc3969b52f",
      "name": "Return chat response",
      "type": "n8n-nodes-base.noOp",
      "position": [
        1392,
        256
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "f398d26a-7c0f-4597-9557-19d6cb423fd8",
      "name": "Judge - Score Response",
      "type": "@n8n/n8n-nodes-langchain.openAi",
      "position": [
        1328,
        64
      ],
      "parameters": {
        "modelId": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o",
          "cachedResultName": "GPT-4O"
        },
        "options": {},
        "messages": {
          "values": [
            {
              "content": "=You are an expert evaluator assessing the quality of AI-generated customer support responses.\n\nEvaluate the following response on two dimensions:\n\n**Correctness (1-5):** Does the response contain accurate information? Does it align with the expected answer?\n- 5: Fully correct, matches expected answer\n- 4: Mostly correct, minor omissions\n- 3: Partially correct, some inaccuracies\n- 2: Mostly incorrect\n- 1: Completely wrong or hallucinated\n\n**Helpfulness (1-5):** Does the response actually help the user? Is it clear, actionable, and complete?\n- 5: Extremely helpful, clear next steps\n- 4: Helpful with minor gaps\n- 3: Somewhat helpful but vague\n- 2: Minimally helpful\n- 1: Not helpful at all\n\n---\n\n**User Question:** {{ $('When fetching a dataset row').first().json.input }}\n\n**Expected Answer:** {{ $('When fetching a dataset row').first().json.expected_output }}\n\n**AI Response:** {{ $json.output }}\n\n---\n\nRespond with ONLY valid JSON in this format:\n{\"correctness\": <1-5>, \"helpfulness\": <1-5>, \"correctness_justification\": \"<brief reason>\", \"helpfulness_justification\": \"<brief reason>\"}"
            }
          ]
        }
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.8
    },
    {
      "id": "b9d8be76-3b3f-40be-9a66-b1a5ab7fae88",
      "name": "Evaluation - Set Outputs",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1680,
        64
      ],
      "parameters": {
        "source": "dataTable",
        "outputs": {
          "values": [
            {
              "outputName": "correctness_score",
              "outputValue": "={{ JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).correctness }}"
            },
            {
              "outputName": "helpfulness_score",
              "outputValue": "={{ JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).helpfulness }}"
            },
            {
              "outputName": "avg_score",
              "outputValue": "={{ (JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).correctness + JSON.parse($json.message.content.replace(/```json\\n?/g, '').replace(/```\\n?/g, '')).helpfulness) / 2 }}"
            }
          ]
        },
        "dataTableId": {
          "__rl": true,
          "mode": "list",
          "value": "VPCxS9mO1gPbvyRa",
          "cachedResultUrl": "/projects/5xhYaLjYeyMka6t9/datatables/VPCxS9mO1gPbvyRa",
          "cachedResultName": "Customer Support QA Test Cases"
        }
      },
      "typeVersion": 4.6
    },
    {
      "id": "e1986910-eb71-4465-b501-8c6169dffa62",
      "name": "Set Metrics",
      "type": "n8n-nodes-base.evaluation",
      "position": [
        1904,
        64
      ],
      "parameters": {
        "metrics": {
          "assignments": [
            {
              "id": "m1",
              "name": "correctness",
              "type": "number",
              "value": "={{ JSON.parse($json.message.content.replace(/```json\\n?|\\n?```/g, '').trim()).correctness }}"
            },
            {
              "id": "m2",
              "name": "helpfulness",
              "type": "number",
              "value": "={{ JSON.parse($json.message.content.replace(/```json\\n?|\\n?```/g, '').trim()).helpfulness }}"
            }
          ]
        },
        "operation": "setMetrics"
      },
      "typeVersion": 4.6
    },
    {
      "id": "57639b04-68c1-4f42-922f-2ca1712a7eb8",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        240,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 448,
        "height": 640,
        "content": "## Receive Customer Query"
      },
      "typeVersion": 1
    },
    {
      "id": "c1115cdd-aa6d-42de-aeae-2c81859d377e",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        704,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 336,
        "height": 640,
        "content": "## Customer Query AI Reply"
      },
      "typeVersion": 1
    },
    {
      "id": "77114286-a214-46f3-80f7-41af85cc9ed0",
      "name": "Sticky Note4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1056,
        -80
      ],
      "parameters": {
        "color": 7,
        "width": 1024,
        "height": 640,
        "content": "## Evaluate with LLM-as-a-Judge"
      },
      "typeVersion": 1
    }
  ],
  "connections": {
    "AI Agent": {
      "main": [
        [
          {
            "node": "Evaluating?",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Evaluating?": {
      "main": [
        [
          {
            "node": "Judge - Score Response",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Return chat response",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Format eval input": {
      "main": [
        [
          {
            "node": "AI Agent",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "AI Agent",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Judge - Score Response": {
      "main": [
        [
          {
            "node": "Evaluation - Set Outputs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Evaluation - Set Outputs": {
      "main": [
        [
          {
            "node": "Set Metrics",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "When chat message received": {
      "main": [
        [
          {
            "node": "AI Agent",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "When fetching a dataset row": {
      "main": [
        [
          {
            "node": "Format eval input",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}