AutomationFlowsAI & RAG › Automatically Discover and Extract Reports From Websites Using Gpt and…

Automatically Discover and Extract Reports From Websites Using Gpt and…

Original n8n title: Automatically Discover and Extract Reports From Websites Using Gpt and Google Sheets

ByOmer Fayyaz @omerfayyaz on n8n.io

AI-Powered Content Analysis - Uses advanced language models (GPT-4/GPT-5.1) to understand page context and identify downloadable reports, even when links aren't explicitly labeled, handling complex page layouts and dynamic content Structured Output Parsing - Enforces JSON schema…

Event trigger★★★★☆ complexityAI-powered21 nodesExecute Workflow TriggerGoogle SheetsHTTP RequestAgentOutput Parser StructuredOpenAI Chat
AI & RAG Trigger: Event Nodes: 21 Complexity: ★★★★☆ AI nodes: yes Added:

This workflow corresponds to n8n.io template #11232 — we link there as the canonical source.

This workflow follows the Agent → Execute Workflow Trigger recipe pattern — see all workflows that pair these two integrations.

The workflow JSON

Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →

Download .json
{
  "id": "k2Tspf5WURvgp7Xj",
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "name": "AI-Powered Report Discovery Agent",
  "tags": [],
  "nodes": [
    {
      "id": "0d42d70c-807a-4a8b-9e9d-65fdb38826c9",
      "name": "Sticky Note - Introduction",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -352,
        80
      ],
      "parameters": {
        "width": 504,
        "height": 676,
        "content": "## AI-Powered Report Discovery Agent\n\nUse AI to browse publication websites and identify the latest relevant downloadable reports.\n\n### How it works  \n- **Trigger Sources**: Initiates from manual trigger, scheduled daily, or another workflow.  \n- **Source Data**: Reads active report sources from Google Sheets (e.g., \"Report Sources\").  \n- **Process Pages**: Loops over sources, fetches HTML publication pages, and converts to Markdown.  \n- **AI Extraction**: Uses AI to identify and extract the most recent, relevant downloadable report (PDF, DOCX, etc.).  \n- **Validation**: Verifies the report's validity (direct download link, correct format, etc.).  \n- **Save/Log**: Saves valid reports to \"Discovered Reports\" in Google Sheets. Logs no report found to \"Discovery Log\".  \n- **Completion Summary**: Records the number of sources processed and the timestamp of completion.\n\n### Setup steps  \n1. **Google Sheets Integration**: Connect your Google Sheets account and provide credentials.  \n2. **Configure Sheets**: Set sheet names for \"Report Sources\", \"Discovered Reports\", and \"Discovery Log\".  \n3. **Configure Trigger**: Choose how to trigger the workflow (manual, scheduled, or from another workflow).  \n4. **Run Workflow**: Activate and monitor the workflow for discovering reports and logging the results.\n"
      },
      "typeVersion": 1
    },
    {
      "id": "b7415f77-8fda-4849-8d1c-2a9331302bdc",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        224,
        176
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "fa175f95-4bb3-465b-a38a-c51b39077523",
      "name": "Schedule (Daily)",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [
        224,
        368
      ],
      "parameters": {
        "rule": {
          "interval": [
            {}
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "ef6fd252-30d6-4ece-863f-976aeed79a42",
      "name": "Called by Another Workflow",
      "type": "n8n-nodes-base.executeWorkflowTrigger",
      "position": [
        224,
        560
      ],
      "parameters": {
        "inputSource": "passthrough"
      },
      "typeVersion": 1.1
    },
    {
      "id": "65f6ead5-c16c-44c1-af09-7fc51341fa34",
      "name": "Read Active Sources",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        544,
        368
      ],
      "parameters": {
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Report Sources"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "961d9fa1-2cf5-447a-93c4-676708f6759c",
      "name": "Loop Over Sources",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        816,
        368
      ],
      "parameters": {
        "options": {
          "reset": false
        }
      },
      "typeVersion": 3
    },
    {
      "id": "43b9d8cd-3ded-48a8-95a4-511da002cbe7",
      "name": "Fetch Publication Page",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        1264,
        384
      ],
      "parameters": {
        "url": "={{ $json.Source_URL }}",
        "options": {
          "timeout": 30000
        },
        "sendHeaders": true,
        "headerParameters": {
          "parameters": [
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/0.0.0.0 Safari/537.36"
            },
            {
              "name": "Accept",
              "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            },
            {
              "name": "Accept-Language",
              "value": "en-US,en;q=0.5"
            }
          ]
        }
      },
      "typeVersion": 4.2
    },
    {
      "id": "18b5b93b-572a-48c6-9f47-f9ba63fa51e5",
      "name": "Convert HTML to Markdown",
      "type": "n8n-nodes-base.markdown",
      "onError": "continueRegularOutput",
      "position": [
        1488,
        384
      ],
      "parameters": {
        "html": "={{ $json.data }}",
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "c34ea7e5-ca8d-425d-b233-14d6f1ac229d",
      "name": "AI Report Discovery Agent",
      "type": "@n8n/n8n-nodes-langchain.agent",
      "position": [
        1712,
        256
      ],
      "parameters": {
        "text": "={{ $json.data }}",
        "options": {
          "systemMessage": "# \ud83d\udd0d Report Discovery Agent\n\nYou are a **Report Discovery Agent** operating inside an n8n AI workflow.\nYour role is to analyze publication pages and identify the **latest downloadable reports, research papers, or data files**.\n\n---\n\n## \ud83c\udfaf OBJECTIVE\n\nAnalyze the provided page content and extract information about the **most recent and relevant downloadable report**.\n\n**You must return exactly ONE report** \u2014 the most recent and relevant one on the page.\n\n---\n\n## \ud83d\udccb EXTRACTION RULES\n\n### 1. **Identify Downloadable Content**\n- Look for links to: PDFs, Excel files, Word documents, PowerPoint presentations\n- Common patterns: \"Download\", \"Report\", \"Research\", \"Analysis\", \"White Paper\", \"Study\"\n- File extensions: `.pdf`, `.xlsx`, `.xls`, `.doc`, `.docx`, `.pptx`\n\n### 2. **Prioritize Recency**\n- Select the **most recently published** report\n- Look for date indicators: publication dates, issue numbers, version numbers\n- If multiple reports exist, choose the one at the top of the list (usually newest)\n\n### 3. **Validate Links**\n- Ensure the link is a **direct download URL** (not a landing page)\n- Convert relative URLs to absolute URLs using the base domain\n- Exclude navigation links, category pages, or signup pages\n\n### 4. **Extract Metadata**\n- **title**: The report's official title or headline\n- **link**: Full absolute URL to download the file\n- **file_type**: The file format (pdf, xlsx, doc, pptx)\n- **description**: Brief 1-2 sentence summary of what the report covers\n\n---\n\n## \u26a0\ufe0f IMPORTANT RULES\n\n1. **Always return valid JSON** matching the schema\n2. **Never return null for required fields** \u2014 use \"Unknown\" if not found\n3. **Links must be absolute URLs** starting with http:// or https://\n4. **Only return ONE report** \u2014 the best/newest match\n5. **If no valid report found**, return the schema with \"No report found\" as title\n\n---\n\n## \ud83d\udcd6 EXAMPLES\n\n**Good Output:**\n```json\n{\n  \"source\": \"Industry Research Corp\",\n  \"title\": \"Q4 2024 Market Analysis Report\",\n  \"link\": \"https://example.com/reports/q4-2024-analysis.pdf\",\n  \"file_type\": \"pdf\",\n  \"description\": \"Comprehensive analysis of Q4 2024 market trends and forecasts.\"\n}\n```\n\n**If No Report Found:**\n```json\n{\n  \"source\": \"Industry Research Corp\",\n  \"title\": \"No report found\",\n  \"link\": \"\",\n  \"file_type\": \"\",\n  \"description\": \"No downloadable reports were found on this page.\"\n}\n```"
        },
        "promptType": "define",
        "hasOutputParser": true
      },
      "typeVersion": 2.2
    },
    {
      "id": "5bb62ec3-d3e9-4523-8d9b-8459680a260b",
      "name": "Structured Output Parser",
      "type": "@n8n/n8n-nodes-langchain.outputParserStructured",
      "position": [
        1856,
        480
      ],
      "parameters": {
        "jsonSchemaExample": "{\n  \"source\": \"Publisher Name\",\n  \"title\": \"Report Title\",\n  \"link\": \"https://example.com/report.pdf\",\n  \"file_type\": \"pdf\",\n  \"description\": \"Brief description of the report content\"\n}"
      },
      "typeVersion": 1.3
    },
    {
      "id": "d6edbc05-c767-49c3-ab7b-cb93d9116036",
      "name": "Validate & Normalize Output",
      "type": "n8n-nodes-base.code",
      "position": [
        2064,
        256
      ],
      "parameters": {
        "jsCode": "// Extract and validate AI output\nconst results = [];\n\nfor (const item of items) {\n  const output = item.json.output || item.json;\n  const sourceData = $('Loop Over Sources').item.json;\n  \n  // Get values with fallbacks\n  const source = output.source || sourceData.Source_Name || \"Unknown\";\n  const title = output.title || \"No title\";\n  const link = output.link || \"\";\n  const fileType = output.file_type || \"\";\n  const description = output.description || \"\";\n  \n  // Validate the result\n  const isValid = link && \n                  link.startsWith(\"http\") && \n                  title !== \"No report found\" &&\n                  title !== \"\";\n  \n  // Determine status\n  let status = \"Discovered\";\n  if (!isValid) {\n    status = \"No Report Found\";\n  } else if (!link.includes(\".pdf\") && !link.includes(\".xlsx\") && !link.includes(\".doc\")) {\n    status = \"Link May Not Be Direct Download\";\n  }\n  \n  results.push({\n    json: {\n      source: source,\n      title: title,\n      link: link,\n      fileType: fileType,\n      description: description,\n      sourceUrl: sourceData.Source_URL || \"\",\n      category: sourceData.Category || \"General\",\n      discoveredAt: new Date().toISOString(),\n      status: status,\n      isValid: isValid\n    }\n  });\n}\n\nreturn results;"
      },
      "typeVersion": 2
    },
    {
      "id": "c065a81e-0f3b-49fa-87d5-6a189c3d43af",
      "name": "Valid Report Found?",
      "type": "n8n-nodes-base.if",
      "position": [
        2288,
        256
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "valid-check",
              "operator": {
                "type": "boolean",
                "operation": "equals"
              },
              "leftValue": "={{ $json.isValid }}",
              "rightValue": true
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "1759a157-62dd-4973-afa8-57547b808457",
      "name": "Save Discovered Report",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2768,
        480
      ],
      "parameters": {
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Discovered Reports"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "e6c67ab6-4c8e-4e27-b720-90a33b091c79",
      "name": "Log No Report Found",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2528,
        480
      ],
      "parameters": {
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Discovery Log"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "312b3ff5-753a-4f31-aadd-74484b799d0f",
      "name": "Completion Summary",
      "type": "n8n-nodes-base.set",
      "position": [
        1024,
        256
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "count",
              "name": "sourcesChecked",
              "type": "number",
              "value": "={{ $items().length }}"
            },
            {
              "id": "timestamp",
              "name": "completedAt",
              "type": "string",
              "value": "={{ $now.toISO() }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "5e572710-10e7-43ba-9cd3-acf97cb80c30",
      "name": "OpenAI GPT-5.1",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        1712,
        480
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-5.1",
          "cachedResultName": "gpt-5.1"
        },
        "options": {
          "temperature": 0.1
        }
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "6ce4884b-3b66-40de-9c4b-cdf858a81b1f",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        448,
        112
      ],
      "parameters": {
        "color": 7,
        "width": 272,
        "height": 560,
        "content": "## Read the Source URLs"
      },
      "typeVersion": 1
    },
    {
      "id": "e96b1aee-6d93-4596-82db-167abfa514ba",
      "name": "Sticky Note1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1216,
        112
      ],
      "parameters": {
        "color": 7,
        "width": 400,
        "height": 560,
        "content": "## Fetch the Publication and Convert to Markdown"
      },
      "typeVersion": 1
    },
    {
      "id": "255ff434-2189-4e76-8cbb-dde9d6d29bf0",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1664,
        112
      ],
      "parameters": {
        "color": 7,
        "width": 320,
        "height": 560,
        "content": "## Process the Publication using the LLM"
      },
      "typeVersion": 1
    },
    {
      "id": "b3495bdd-01ae-47f2-bde4-c64341692c0c",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2016,
        112
      ],
      "parameters": {
        "color": 7,
        "width": 432,
        "height": 560,
        "content": "## Validate the AI Output\n"
      },
      "typeVersion": 1
    },
    {
      "id": "9e6d8ea9-7de5-4225-bacf-b300611a6eb9",
      "name": "Sticky Note4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2480,
        112
      ],
      "parameters": {
        "color": 7,
        "width": 448,
        "height": 560,
        "content": "## Log the Results in Google Sheets\n"
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "f3ec8d20-c4e7-4c50-9f3c-ae87725de7e6",
  "connections": {
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Read Active Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI GPT-5.1": {
      "ai_languageModel": [
        [
          {
            "node": "AI Report Discovery Agent",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Schedule (Daily)": {
      "main": [
        [
          {
            "node": "Read Active Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Loop Over Sources": {
      "main": [
        [
          {
            "node": "Completion Summary",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Fetch Publication Page",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Log No Report Found": {
      "main": [
        [
          {
            "node": "Loop Over Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Read Active Sources": {
      "main": [
        [
          {
            "node": "Loop Over Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Valid Report Found?": {
      "main": [
        [
          {
            "node": "Save Discovered Report",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Log No Report Found",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Fetch Publication Page": {
      "main": [
        [
          {
            "node": "Convert HTML to Markdown",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Save Discovered Report": {
      "main": [
        [
          {
            "node": "Loop Over Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Convert HTML to Markdown": {
      "main": [
        [
          {
            "node": "AI Report Discovery Agent",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Structured Output Parser": {
      "ai_outputParser": [
        [
          {
            "node": "AI Report Discovery Agent",
            "type": "ai_outputParser",
            "index": 0
          }
        ]
      ]
    },
    "AI Report Discovery Agent": {
      "main": [
        [
          {
            "node": "Validate & Normalize Output",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Called by Another Workflow": {
      "main": [
        [
          {
            "node": "Read Active Sources",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Validate & Normalize Output": {
      "main": [
        [
          {
            "node": "Valid Report Found?",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Credentials you'll need

Each integration node will prompt for credentials when you import. We strip credential IDs before publishing — you'll add your own.

Pro

For the full experience including quality scoring and batch install features for each workflow upgrade to Pro

About this workflow

AI-Powered Content Analysis - Uses advanced language models (GPT-4/GPT-5.1) to understand page context and identify downloadable reports, even when links aren't explicitly labeled, handling complex page layouts and dynamic content Structured Output Parsing - Enforces JSON schema…

Source: https://n8n.io/workflows/11232/ — original creator credit. Request a take-down →

More AI & RAG workflows → · Browse all categories →

Related workflows

Workflows that share integrations, category, or trigger type with this one. All free to copy and import.

AI & RAG

This workflow contains community nodes that are only compatible with the self-hosted version of n8n.

Output Parser Structured, Telegram, N8N Nodes Tesseractjs +14
AI & RAG

AI Blog Publisher – Automated Blog Content Workflow This workflow is designed for individuals and teams who regularly publish content on their blog and want to automate the entire process from start t

WordPress, HTTP Request, Memory Buffer Window +9
AI & RAG

Automatically publish blog content to WordPress with AI-generated branded images, internal linking, and client reporting using Google Sheets, OpenAI, and Gemini

Execute Workflow Trigger, Google Sheets, Agent +6
AI & RAG

Automated Research Report Generation with OpenAI, Wikipedia, Google Search, and Gmail/Telegram. Uses lmChatOpenAi, memoryBufferWindow, toolHttpRequest, agent. Event-driven trigger; 26 nodes.

OpenAI Chat, Memory Buffer Window, Tool Http Request +8
AI & RAG

This workflow automates the process of generating professional research reports for researchers, students, and professionals. It eliminates manual research and report formatting by aggregating data, g

OpenAI Chat, Memory Buffer Window, Tool Http Request +8