{
  "id": "cH8l9UbxggwcnShG",
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "name": "AI-Powered Multi-Source URL Discovery Engine",
  "tags": [],
  "nodes": [
    {
      "id": "bce47b0d-6d94-4dbf-9b27-ef7b1f1385e9",
      "name": "Sticky Note - Introduction",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -400,
        80
      ],
      "parameters": {
        "color": 5,
        "width": 460,
        "height": 540,
        "content": "## \ud83d\ude80 AI-Powered Multi-Source URL Discovery Engine\n\nThis workflow automatically discovers article URLs from any website using AI intelligence.\n\n### How it works:\n1. **Input URLs** \u2192 Read seed URLs from Google Sheets\n2. **Fetch HTML** \u2192 Download webpage content with browser spoofing\n3. **Convert to Markdown** \u2192 Clean HTML for AI processing\n4. **AI Extraction** \u2192 LLM identifies valid article links\n5. **Normalize & Parse** \u2192 Clean URLs, remove duplicates\n6. **Save Results** \u2192 Append to Google Sheets with source tracking\n\n### Quick Setup:\n1. Connect your Google Sheets credentials\n2. Connect your OpenAI API credentials\n3. Create a sheet with columns: `URL`, `Source`, `Status`\n4. Add seed URLs to the input sheet\n5. Run the workflow!"
      },
      "typeVersion": 1
    },
    {
      "id": "e30f2f04-4f9f-4ca9-8561-236c6a69f234",
      "name": "Sticky Note - Input",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        336,
        80
      ],
      "parameters": {
        "color": 4,
        "width": 280,
        "height": 536,
        "content": "## \ud83d\udce5 Input Stage\n\nReads seed URLs from your Google Sheets.\n\n**Required Sheet Columns:**\n- `URL` - The webpage to crawl\n\n**Tip:** Add multiple publisher homepages, blog indexes, or news feeds to discover all their articles."
      },
      "typeVersion": 1
    },
    {
      "id": "6822173c-3e3a-45a3-9b35-495d9d933766",
      "name": "Sticky Note - Loop",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        720,
        80
      ],
      "parameters": {
        "color": 3,
        "width": 424,
        "height": 536,
        "content": "## \ud83d\udd04 Processing Loop\n\nProcesses each URL with rate limiting to avoid being blocked.\n\n**Features:**\n- Batch processing (1 at a time)\n- Wait node prevents rate limiting\n- Error handling continues on failure"
      },
      "typeVersion": 1
    },
    {
      "id": "156a24f3-8e3f-4ffa-853d-2da7721e6f82",
      "name": "Sticky Note - Fetch",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1424,
        80
      ],
      "parameters": {
        "color": 6,
        "width": 440,
        "height": 536,
        "content": "## \ud83c\udf10 Web Fetching\n\nFetches HTML with browser User-Agent to avoid bot detection.\n\n**Key Settings:**\n- Custom User-Agent header\n- Error handling: continue on failure\n- Converts HTML \u2192 Markdown for AI"
      },
      "typeVersion": 1
    },
    {
      "id": "6e4911b5-c368-43e2-8398-5814e7a972b4",
      "name": "Sticky Note - AI",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1952,
        80
      ],
      "parameters": {
        "color": 2,
        "width": 440,
        "height": 708,
        "content": "## \ud83e\udd16 AI URL Extraction\n\nThe AI Agent analyzes page content and extracts valid article URLs.\n\n**What it identifies:**\n\u2705 Article/blog/news URLs\n\u2705 Multi-word slugs with dates\n\u2705 Content pages\n\n**What it excludes:**\n\u274c Navigation pages\n\u274c Category/tag pages\n\u274c PDFs and downloads\n\u274c About/contact pages"
      },
      "typeVersion": 1
    },
    {
      "id": "7ee53dbe-63b3-4903-9a67-3169f7bbd2db",
      "name": "Sticky Note - Parser",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2464,
        80
      ],
      "parameters": {
        "color": 5,
        "width": 280,
        "height": 536,
        "content": "## \ud83d\udd27 URL Parser & Normalizer\n\nCleans and validates AI output.\n\n**Processing:**\n- Strips markdown code fences\n- Parses JSON array\n- Normalizes URLs (removes query params)\n- Removes duplicates\n- Handles edge cases"
      },
      "typeVersion": 1
    },
    {
      "id": "923f5e7d-57ac-47d9-9863-7164d666dbe0",
      "name": "Sticky Note - Output",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2848,
        80
      ],
      "parameters": {
        "color": 4,
        "width": 328,
        "height": 716,
        "content": "## \ud83d\udcbe Output Stage\n\nSaves discovered URLs to Google Sheets.\n\n**Output Columns:**\n- `URL` - Discovered article URL\n- `Source` - Publisher/category\n- `Status` - Set to \"Pending\"\n\n**Deduplication:**\nUses URL as match key to avoid duplicates."
      },
      "typeVersion": 1
    },
    {
      "id": "daf17cd7-1dc4-4770-99a1-70ee9e0b27c7",
      "name": "Daily Schedule (6 AM)",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [
        128,
        304
      ],
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "cronExpression",
              "expression": "0 6 * * *"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "a0ee457a-f997-454f-b4ee-ec411d40688f",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        128,
        496
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "8763f6c9-f9e0-4b48-be2d-7489f1db81d9",
      "name": "Read Seed URLs",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        432,
        400
      ],
      "parameters": {
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Seed URLs"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET_NAME"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "69fe92b6-996f-4225-9bda-7785bda6daae",
      "name": "Loop Over URLs",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        784,
        400
      ],
      "parameters": {
        "options": {
          "reset": false
        }
      },
      "typeVersion": 3
    },
    {
      "id": "7173466b-9bef-4683-8358-86cb7726d357",
      "name": "Rate Limit (3s)",
      "type": "n8n-nodes-base.wait",
      "position": [
        1008,
        416
      ],
      "parameters": {
        "amount": 3
      },
      "typeVersion": 1.1
    },
    {
      "id": "27de86ef-f8e0-4b22-85eb-3d67a73340f9",
      "name": "Fetch Webpage HTML",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        1488,
        416
      ],
      "parameters": {
        "url": "={{ $json.URL }}",
        "options": {
          "timeout": 30000
        },
        "sendHeaders": true,
        "headerParameters": {
          "parameters": [
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/0.0.0.0 Safari/537.36"
            },
            {
              "name": "Accept",
              "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
            },
            {
              "name": "Accept-Language",
              "value": "en-US,en;q=0.5"
            }
          ]
        }
      },
      "typeVersion": 4.2
    },
    {
      "id": "6cbae844-1781-438d-b8e4-6e1ae000b836",
      "name": "HTML to Markdown",
      "type": "n8n-nodes-base.markdown",
      "onError": "continueRegularOutput",
      "position": [
        1712,
        416
      ],
      "parameters": {
        "html": "={{ $json.data }}",
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "1472c962-8dcf-4b43-bf84-14626ed71583",
      "name": "AI URL Extractor",
      "type": "@n8n/n8n-nodes-langchain.agent",
      "onError": "continueErrorOutput",
      "position": [
        2048,
        416
      ],
      "parameters": {
        "text": "={{ $json.data }}",
        "options": {
          "systemMessage": "=You are an AI agent that extracts article, blog, news, or report URLs from webpage content. Follow these rules:\n\n## ARTICLE IDENTIFICATION\n\n**Include URLs that are:**\n- Actual articles, blog posts, news stories, or reports\n- URLs with multi-word slugs (e.g., /2025/best-practices-for-automation)\n- URLs with dates in the path (e.g., /2025/01/15/article-title)\n- Content pages with substantial text\n\n**Exclude URLs that are:**\n- Navigation pages (about, contact, careers, terms, privacy)\n- Category or tag listing pages (e.g., /category/news, /tag/automation)\n- Author profile pages\n- Search result pages\n- PDF downloads or file links\n- External links to other domains\n- Single-word slugs that are clearly categories (e.g., /news, /blog)\n- Pagination links (e.g., /page/2)\n- Homepage or root URLs\n\n## URL PROCESSING\n\n1. Convert relative URLs to absolute URLs using the page's domain\n2. Remove tracking parameters (?utm_*, &ref=, etc.)\n3. Remove hash fragments (#section)\n4. Deduplicate - return each unique URL only once\n\n## SOURCE CATEGORIZATION\n\nAssign a `source` value based on the domain:\n- Use lowercase, no spaces\n- Format: `domain_category` if content has distinct categories\n- Examples: `techcrunch`, `nytimes_business`, `medium_tech`\n\n## OUTPUT FORMAT\n\nReturn ONLY a valid JSON array. No explanations or markdown formatting.\n\n```json\n[\n  {\n    \"url\": \"https://example.com/2024/article-title\",\n    \"source\": \"example\"\n  },\n  {\n    \"url\": \"https://example.com/blog/another-post\",\n    \"source\": \"example_blog\"\n  }\n]\n```\n\nIf no valid article URLs are found, return an empty array: `[]`"
        },
        "promptType": "define"
      },
      "typeVersion": 2.2
    },
    {
      "id": "6caf2efa-9170-4c17-8931-8d2725188a30",
      "name": "URL Parser & Normalizer",
      "type": "n8n-nodes-base.code",
      "position": [
        2560,
        400
      ],
      "parameters": {
        "jsCode": "// Parse AI output and return [{url, source}]\nlet results = [];\n\nfor (const item of items) {\n  let data = item.json.output ?? item.json;\n\n  // Clean markdown fences if present\n  if (typeof data === \"string\") {\n    let cleaned = data\n      .replace(/```json\\s*/gi, \"\")\n      .replace(/```\\s*/g, \"\")\n      .trim();\n    \n    try {\n      data = JSON.parse(cleaned);\n    } catch (e) {\n      // If it's just a raw URL string, wrap it\n      if (/^https?:\\/\\//i.test(cleaned)) {\n        results.push({ url: cleaned, source: \"unknown\" });\n      }\n      continue;\n    }\n  }\n\n  // If we got an array of objects\n  if (Array.isArray(data)) {\n    data.forEach(entry => {\n      if (entry && typeof entry === \"object\" && entry.url) {\n        const normalized = normalizeUrl(entry.url);\n        if (normalized) {\n          results.push({\n            url: normalized,\n            source: entry.source ?? \"unknown\"\n          });\n        }\n      }\n    });\n  }\n\n  // If it's a single object\n  if (data && typeof data === \"object\" && !Array.isArray(data) && data.url) {\n    const normalized = normalizeUrl(data.url);\n    if (normalized) {\n      results.push({\n        url: normalized,\n        source: data.source ?? \"unknown\"\n      });\n    }\n  }\n}\n\n// Helper: normalize URL - strip query params, hash, trailing slash\nfunction normalizeUrl(url) {\n  try {\n    let u = new URL(url);\n    \n    // Remove tracking parameters\n    const trackingParams = ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term', 'ref', 'source', 'fbclid', 'gclid'];\n    trackingParams.forEach(param => u.searchParams.delete(param));\n    \n    // If no other params remain, clear search entirely\n    if (u.searchParams.toString() === '') {\n      u.search = '';\n    }\n    \n    u.hash = \"\";\n    return u.href.replace(/\\/$/, \"\");\n  } catch (e) {\n    return null;\n  }\n}\n\n// Deduplicate by URL\nconst seen = new Set();\nresults = results.filter(r => {\n  if (seen.has(r.url)) return false;\n  seen.add(r.url);\n  return true;\n});\n\nreturn results.map(r => ({ json: r }));"
      },
      "typeVersion": 2
    },
    {
      "id": "5af45284-92bc-4ae9-8b36-f078765b0a57",
      "name": "Save Discovered URLs",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2960,
        608
      ],
      "parameters": {
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Discovered URLs"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET_NAME"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "45d35d27-b373-480a-95c0-2d82fade93d0",
      "name": "Completion Summary",
      "type": "n8n-nodes-base.set",
      "position": [
        1216,
        224
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "url-count",
              "name": "urlsDiscovered",
              "type": "number",
              "value": "={{ $items().length }}"
            },
            {
              "id": "timestamp",
              "name": "completedAt",
              "type": "string",
              "value": "={{ $now.toISO() }}"
            },
            {
              "id": "status",
              "name": "status",
              "type": "string",
              "value": "completed"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "e5d9866e-c489-44ae-88bd-a79f3a7869c9",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        2096,
        624
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-5-mini",
          "cachedResultName": "gpt-5-mini"
        },
        "options": {},
        "builtInTools": {}
      },
      "credentials": {
        "openAiApi": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 1.3
    }
  ],
  "active": false,
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "2258e740-b5ea-45ea-bdb7-ec15dc2a4e26",
  "connections": {
    "Loop Over URLs": {
      "main": [
        [
          {
            "node": "Completion Summary",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Rate Limit (3s)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Read Seed URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Read Seed URLs": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Rate Limit (3s)": {
      "main": [
        [
          {
            "node": "Fetch Webpage HTML",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "AI URL Extractor": {
      "main": [
        [
          {
            "node": "URL Parser & Normalizer",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTML to Markdown": {
      "main": [
        [
          {
            "node": "AI URL Extractor",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "AI URL Extractor",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Fetch Webpage HTML": {
      "main": [
        [
          {
            "node": "HTML to Markdown",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Save Discovered URLs": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Daily Schedule (6 AM)": {
      "main": [
        [
          {
            "node": "Read Seed URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "URL Parser & Normalizer": {
      "main": [
        [
          {
            "node": "Save Discovered URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}