AutomationFlowsAI & RAG › Discover Article Urls From Any Website with Gpt-5-mini and Google Sheets

Discover Article Urls From Any Website with Gpt-5-mini and Google Sheets

ByOmer Fayyaz @omerfayyaz on n8n.io

AI-Powered Intelligence - Uses GPT-5-mini to understand webpage context and identify actual articles vs navigation pages, eliminating false positives Browser Spoofing - Includes realistic User-Agent headers and request patterns to avoid bot detection on publisher sites Smart URL…

Cron / scheduled trigger★★★★☆ complexityAI-powered19 nodesGoogle SheetsHTTP RequestAgentOpenAI Chat
AI & RAG Trigger: Cron / scheduled Nodes: 19 Complexity: ★★★★☆ AI nodes: yes Added:

This workflow corresponds to n8n.io template #11211 — we link there as the canonical source.

This workflow follows the Agent → Google Sheets recipe pattern — see all workflows that pair these two integrations.

The workflow JSON

Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →

Download .json
{
  "id": "cH8l9UbxggwcnShG",
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "name": "AI-Powered Multi-Source URL Discovery Engine",
  "tags": [],
  "nodes": [
    {
      "id": "bce47b0d-6d94-4dbf-9b27-ef7b1f1385e9",
      "name": "Sticky Note - Introduction",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -400,
        80
      ],
      "parameters": {
        "color": 5,
        "width": 460,
        "height": 540,
        "content": "## \ud83d\ude80 AI-Powered Multi-Source URL Discovery Engine\n\nThis workflow automatically discovers article URLs from any website using AI intelligence.\n\n### How it works:\n1. **Input URLs** \u2192 Read seed URLs from Google Sheets\n2. **Fetch HTML** \u2192 Download webpage content with browser spoofing\n3. **Convert to Markdown** \u2192 Clean HTML for AI processing\n4. **AI Extraction** \u2192 LLM identifies valid article links\n5. **Normalize & Parse** \u2192 Clean URLs, remove duplicates\n6. **Save Results** \u2192 Append to Google Sheets with source tracking\n\n### Quick Setup:\n1. Connect your Google Sheets credentials\n2. Connect your OpenAI API credentials\n3. Create a sheet with columns: `URL`, `Source`, `Status`\n4. Add seed URLs to the input sheet\n5. Run the workflow!"
      },
      "typeVersion": 1
    },
    {
      "id": "e30f2f04-4f9f-4ca9-8561-236c6a69f234",
      "name": "Sticky Note - Input",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        336,
        80
      ],
      "parameters": {
        "color": 4,
        "width": 280,
        "height": 536,
        "content": "## \ud83d\udce5 Input Stage\n\nReads seed URLs from your Google Sheets.\n\n**Required Sheet Columns:**\n- `URL` - The webpage to crawl\n\n**Tip:** Add multiple publisher homepages, blog indexes, or news feeds to discover all their articles."
      },
      "typeVersion": 1
    },
    {
      "id": "6822173c-3e3a-45a3-9b35-495d9d933766",
      "name": "Sticky Note - Loop",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        720,
        80
      ],
      "parameters": {
        "color": 3,
        "width": 424,
        "height": 536,
        "content": "## \ud83d\udd04 Processing Loop\n\nProcesses each URL with rate limiting to avoid being blocked.\n\n**Features:**\n- Batch processing (1 at a time)\n- Wait node prevents rate limiting\n- Error handling continues on failure"
      },
      "typeVersion": 1
    },
    {
      "id": "156a24f3-8e3f-4ffa-853d-2da7721e6f82",
      "name": "Sticky Note - Fetch",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1424,
        80
      ],
      "parameters": {
        "color": 6,
        "width": 440,
        "height": 536,
        "content": "## \ud83c\udf10 Web Fetching\n\nFetches HTML with browser User-Agent to avoid bot detection.\n\n**Key Settings:**\n- Custom User-Agent header\n- Error handling: continue on failure\n- Converts HTML \u2192 Markdown for AI"
      },
      "typeVersion": 1
    },
    {
      "id": "6e4911b5-c368-43e2-8398-5814e7a972b4",
      "name": "Sticky Note - AI",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1952,
        80
      ],
      "parameters": {
        "color": 2,
        "width": 440,
        "height": 708,
        "content": "## \ud83e\udd16 AI URL Extraction\n\nThe AI Agent analyzes page content and extracts valid article URLs.\n\n**What it identifies:**\n\u2705 Article/blog/news URLs\n\u2705 Multi-word slugs with dates\n\u2705 Content pages\n\n**What it excludes:**\n\u274c Navigation pages\n\u274c Category/tag pages\n\u274c PDFs and downloads\n\u274c About/contact pages"
      },
      "typeVersion": 1
    },
    {
      "id": "7ee53dbe-63b3-4903-9a67-3169f7bbd2db",
      "name": "Sticky Note - Parser",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2464,
        80
      ],
      "parameters": {
        "color": 5,
        "width": 280,
        "height": 536,
        "content": "## \ud83d\udd27 URL Parser & Normalizer\n\nCleans and validates AI output.\n\n**Processing:**\n- Strips markdown code fences\n- Parses JSON array\n- Normalizes URLs (removes query params)\n- Removes duplicates\n- Handles edge cases"
      },
      "typeVersion": 1
    },
    {
      "id": "923f5e7d-57ac-47d9-9863-7164d666dbe0",
      "name": "Sticky Note - Output",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2848,
        80
      ],
      "parameters": {
        "color": 4,
        "width": 328,
        "height": 716,
        "content": "## \ud83d\udcbe Output Stage\n\nSaves discovered URLs to Google Sheets.\n\n**Output Columns:**\n- `URL` - Discovered article URL\n- `Source` - Publisher/category\n- `Status` - Set to \"Pending\"\n\n**Deduplication:**\nUses URL as match key to avoid duplicates."
      },
      "typeVersion": 1
    },
    {
      "id": "daf17cd7-1dc4-4770-99a1-70ee9e0b27c7",
      "name": "Daily Schedule (6 AM)",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [
        128,
        304
      ],
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "cronExpression",
              "expression": "0 6 * * *"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "a0ee457a-f997-454f-b4ee-ec411d40688f",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        128,
        496
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "8763f6c9-f9e0-4b48-be2d-7489f1db81d9",
      "name": "Read Seed URLs",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        432,
        400
      ],
      "parameters": {
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Seed URLs"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET_NAME"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "69fe92b6-996f-4225-9bda-7785bda6daae",
      "name": "Loop Over URLs",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        784,
        400
      ],
      "parameters": {
        "options": {
          "reset": false
        }
      },
      "typeVersion": 3
    },
    {
      "id": "7173466b-9bef-4683-8358-86cb7726d357",
      "name": "Rate Limit (3s)",
      "type": "n8n-nodes-base.wait",
      "position": [
        1008,
        416
      ],
      "parameters": {
        "amount": 3
      },
      "typeVersion": 1.1
    },
    {
      "id": "27de86ef-f8e0-4b22-85eb-3d67a73340f9",
      "name": "Fetch Webpage HTML",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        1488,
        416
      ],
      "parameters": {
        "url": "={{ $json.URL }}",
        "options": {
          "timeout": 30000
        },
        "sendHeaders": true,
        "headerParameters": {
          "parameters": [
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/0.0.0.0 Safari/537.36"
            },
            {
              "name": "Accept",
              "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
            },
            {
              "name": "Accept-Language",
              "value": "en-US,en;q=0.5"
            }
          ]
        }
      },
      "typeVersion": 4.2
    },
    {
      "id": "6cbae844-1781-438d-b8e4-6e1ae000b836",
      "name": "HTML to Markdown",
      "type": "n8n-nodes-base.markdown",
      "onError": "continueRegularOutput",
      "position": [
        1712,
        416
      ],
      "parameters": {
        "html": "={{ $json.data }}",
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "1472c962-8dcf-4b43-bf84-14626ed71583",
      "name": "AI URL Extractor",
      "type": "@n8n/n8n-nodes-langchain.agent",
      "onError": "continueErrorOutput",
      "position": [
        2048,
        416
      ],
      "parameters": {
        "text": "={{ $json.data }}",
        "options": {
          "systemMessage": "=You are an AI agent that extracts article, blog, news, or report URLs from webpage content. Follow these rules:\n\n## ARTICLE IDENTIFICATION\n\n**Include URLs that are:**\n- Actual articles, blog posts, news stories, or reports\n- URLs with multi-word slugs (e.g., /2025/best-practices-for-automation)\n- URLs with dates in the path (e.g., /2025/01/15/article-title)\n- Content pages with substantial text\n\n**Exclude URLs that are:**\n- Navigation pages (about, contact, careers, terms, privacy)\n- Category or tag listing pages (e.g., /category/news, /tag/automation)\n- Author profile pages\n- Search result pages\n- PDF downloads or file links\n- External links to other domains\n- Single-word slugs that are clearly categories (e.g., /news, /blog)\n- Pagination links (e.g., /page/2)\n- Homepage or root URLs\n\n## URL PROCESSING\n\n1. Convert relative URLs to absolute URLs using the page's domain\n2. Remove tracking parameters (?utm_*, &ref=, etc.)\n3. Remove hash fragments (#section)\n4. Deduplicate - return each unique URL only once\n\n## SOURCE CATEGORIZATION\n\nAssign a `source` value based on the domain:\n- Use lowercase, no spaces\n- Format: `domain_category` if content has distinct categories\n- Examples: `techcrunch`, `nytimes_business`, `medium_tech`\n\n## OUTPUT FORMAT\n\nReturn ONLY a valid JSON array. No explanations or markdown formatting.\n\n```json\n[\n  {\n    \"url\": \"https://example.com/2024/article-title\",\n    \"source\": \"example\"\n  },\n  {\n    \"url\": \"https://example.com/blog/another-post\",\n    \"source\": \"example_blog\"\n  }\n]\n```\n\nIf no valid article URLs are found, return an empty array: `[]`"
        },
        "promptType": "define"
      },
      "typeVersion": 2.2
    },
    {
      "id": "6caf2efa-9170-4c17-8931-8d2725188a30",
      "name": "URL Parser & Normalizer",
      "type": "n8n-nodes-base.code",
      "position": [
        2560,
        400
      ],
      "parameters": {
        "jsCode": "// Parse AI output and return [{url, source}]\nlet results = [];\n\nfor (const item of items) {\n  let data = item.json.output ?? item.json;\n\n  // Clean markdown fences if present\n  if (typeof data === \"string\") {\n    let cleaned = data\n      .replace(/```json\\s*/gi, \"\")\n      .replace(/```\\s*/g, \"\")\n      .trim();\n    \n    try {\n      data = JSON.parse(cleaned);\n    } catch (e) {\n      // If it's just a raw URL string, wrap it\n      if (/^https?:\\/\\//i.test(cleaned)) {\n        results.push({ url: cleaned, source: \"unknown\" });\n      }\n      continue;\n    }\n  }\n\n  // If we got an array of objects\n  if (Array.isArray(data)) {\n    data.forEach(entry => {\n      if (entry && typeof entry === \"object\" && entry.url) {\n        const normalized = normalizeUrl(entry.url);\n        if (normalized) {\n          results.push({\n            url: normalized,\n            source: entry.source ?? \"unknown\"\n          });\n        }\n      }\n    });\n  }\n\n  // If it's a single object\n  if (data && typeof data === \"object\" && !Array.isArray(data) && data.url) {\n    const normalized = normalizeUrl(data.url);\n    if (normalized) {\n      results.push({\n        url: normalized,\n        source: data.source ?? \"unknown\"\n      });\n    }\n  }\n}\n\n// Helper: normalize URL - strip query params, hash, trailing slash\nfunction normalizeUrl(url) {\n  try {\n    let u = new URL(url);\n    \n    // Remove tracking parameters\n    const trackingParams = ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term', 'ref', 'source', 'fbclid', 'gclid'];\n    trackingParams.forEach(param => u.searchParams.delete(param));\n    \n    // If no other params remain, clear search entirely\n    if (u.searchParams.toString() === '') {\n      u.search = '';\n    }\n    \n    u.hash = \"\";\n    return u.href.replace(/\\/$/, \"\");\n  } catch (e) {\n    return null;\n  }\n}\n\n// Deduplicate by URL\nconst seen = new Set();\nresults = results.filter(r => {\n  if (seen.has(r.url)) return false;\n  seen.add(r.url);\n  return true;\n});\n\nreturn results.map(r => ({ json: r }));"
      },
      "typeVersion": 2
    },
    {
      "id": "5af45284-92bc-4ae9-8b36-f078765b0a57",
      "name": "Save Discovered URLs",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2960,
        608
      ],
      "parameters": {
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Discovered URLs"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET_NAME"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "45d35d27-b373-480a-95c0-2d82fade93d0",
      "name": "Completion Summary",
      "type": "n8n-nodes-base.set",
      "position": [
        1216,
        224
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "url-count",
              "name": "urlsDiscovered",
              "type": "number",
              "value": "={{ $items().length }}"
            },
            {
              "id": "timestamp",
              "name": "completedAt",
              "type": "string",
              "value": "={{ $now.toISO() }}"
            },
            {
              "id": "status",
              "name": "status",
              "type": "string",
              "value": "completed"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "e5d9866e-c489-44ae-88bd-a79f3a7869c9",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        2096,
        624
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-5-mini",
          "cachedResultName": "gpt-5-mini"
        },
        "options": {},
        "builtInTools": {}
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.3
    }
  ],
  "active": false,
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "2258e740-b5ea-45ea-bdb7-ec15dc2a4e26",
  "connections": {
    "Loop Over URLs": {
      "main": [
        [
          {
            "node": "Completion Summary",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Rate Limit (3s)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Read Seed URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Read Seed URLs": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Rate Limit (3s)": {
      "main": [
        [
          {
            "node": "Fetch Webpage HTML",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "AI URL Extractor": {
      "main": [
        [
          {
            "node": "URL Parser & Normalizer",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTML to Markdown": {
      "main": [
        [
          {
            "node": "AI URL Extractor",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "AI URL Extractor",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Fetch Webpage HTML": {
      "main": [
        [
          {
            "node": "HTML to Markdown",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Save Discovered URLs": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Daily Schedule (6 AM)": {
      "main": [
        [
          {
            "node": "Read Seed URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "URL Parser & Normalizer": {
      "main": [
        [
          {
            "node": "Save Discovered URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Credentials you'll need

Each integration node will prompt for credentials when you import. We strip credential IDs before publishing — you'll add your own.

Pro

For the full experience including quality scoring and batch install features for each workflow upgrade to Pro

About this workflow

AI-Powered Intelligence - Uses GPT-5-mini to understand webpage context and identify actual articles vs navigation pages, eliminating false positives Browser Spoofing - Includes realistic User-Agent headers and request patterns to avoid bot detection on publisher sites Smart URL…

Source: https://n8n.io/workflows/11211/ — original creator credit. Request a take-down →

More AI & RAG workflows → · Browse all categories →

Related workflows

Workflows that share integrations, category, or trigger type with this one. All free to copy and import.

AI & RAG

This n8n automation workflow automates the creation, scripting, production, and posting of YouTube videos. It leverages AI (OpenAI), image generation (PIAPI), video rendering (Shotstack), and platform

Agent, OpenAI Chat, Airtable Tool +7
AI & RAG

This workflow is designed for: Content creators and marketers E-commerce and product-based businesses Agencies producing social media visuals and videos Automation builders looking for AI-powered crea

HTTP Request, Edit Image, Google Drive +7
AI & RAG

Generate product images with NanoBanana Pro to Veo videos and Blotato - vide 2 ok. Uses httpRequest, editImage, googleDrive, googleSheets. Scheduled trigger; 76 nodes.

HTTP Request, Edit Image, Google Drive +7
AI & RAG

Created by: Peyton Leveillee Last updated: October 2025

OpenAI Chat, Google Sheets, HTTP Request +5
AI & RAG

The Multi-Model Agency Content Engine is a high-performance editorial system designed for agencies. It solves the "blank page" problem by alternating between real-world social proof and strategic expe

Google Sheets, Gmail, Google Drive +6