AutomationFlowsData & Sheets › Build a Multi-site Content Aggregator with Google Sheets & Custom Extraction…

Build a Multi-site Content Aggregator with Google Sheets & Custom Extraction…

Original n8n title: Build a Multi-site Content Aggregator with Google Sheets & Custom Extraction Logic

ByOmer Fayyaz @omerfayyaz on n8n.io

Intelligent Source Routing - Uses a Switch node to route URLs to specialized extractors based on source identifier, enabling custom CSS selectors per publisher for maximum accuracy Universal Fallback Parser - Advanced regex-based extractor handles unknown sources automatically,…

Event trigger★★★★☆ complexity27 nodesGoogle SheetsHTTP Request
Data & Sheets Trigger: Event Nodes: 27 Complexity: ★★★★☆ Added:

This workflow corresponds to n8n.io template #11224 — we link there as the canonical source.

This workflow follows the Google Sheets → HTTP Request recipe pattern — see all workflows that pair these two integrations.

The workflow JSON

Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →

Download .json
{
  "id": "WhZiGSdO9IICm2Y5",
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "name": "Multi-Site Web Scraper with Source Routing",
  "tags": [],
  "nodes": [
    {
      "id": "0bbe2f45-e7a6-485c-8fea-1df1d31239c7",
      "name": "Sticky Note - Introduction",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -416,
        368
      ],
      "parameters": {
        "width": 504,
        "height": 904,
        "content": "## Multi-Site Web Scraper with Source Routing\n\nIntelligent web scraper that routes URLs to different extraction logic based on source domain.\n\n### How it works  \n- **Trigger**: Starts manually, on a schedule (every 4 hours), or from another workflow.  \n- **Read URLs**: Fetches URLs from Google Sheets (\"URLs to Process\") with source identifiers.  \n- **Rate Limiting**: Adds a 3-second delay between requests to avoid overwhelming servers.  \n- **Source Routing**: Routes each URL to a specific extraction logic based on the source (e.g., Site A, Site B).  \n- **Extraction**: Extracts content using site-specific CSS selectors or fallback logic.  \n- **Freshness Filter**: Validates article age (defaults to 45 days), marks outdated articles as \"Outdated\".  \n- **Normalization**: Cleans and standardizes the extracted data.  \n- **Save & Log**: Saves extracted data to the \"Article Feed\" and updates URL status in Google Sheets.  \n- **Status Updates**: Tracks success or failure per URL and updates the status accordingly.\n\n### Setup steps  \n1. **Google Sheets Integration**: Connect your Google Sheets account.  \n2. **Configure Sheets**: Set sheet names for \"URLs to Process\" and \"Article Feed\".  \n3. **Customize Extraction**: Define CSS selectors for each site's extractor.  \n4. **Configure Freshness Filter**: Set the article age threshold (default: 45 days).  \n5. **Run Workflow**: Trigger manually or set a schedule to scrape data regularly.\n\n### Adding New Sources:\n1. Add a new output to the Switch node\n2. Create an HTML or Code node with site-specific selectors\n3. Connect to the Freshness Filter"
      },
      "typeVersion": 1
    },
    {
      "id": "41d85cb2-a70f-4f83-8e84-4e56467c554c",
      "name": "Sticky Note - Input",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        384,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 228,
        "height": 648,
        "content": "## Reads URLs with source identifiers."
      },
      "typeVersion": 1
    },
    {
      "id": "24d476ac-54cf-4271-9464-3d4196a33b24",
      "name": "Sticky Note - Router",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1408,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 296,
        "height": 644,
        "content": "## Source Router (Switch)"
      },
      "typeVersion": 1
    },
    {
      "id": "63470f2c-d372-46ca-b2f7-3a944e02bddf",
      "name": "Sticky Note - Extractors",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1856,
        448
      ],
      "parameters": {
        "color": 7,
        "width": 616,
        "height": 832,
        "content": "## Site-Specific Extractors\nCustom CSS selectors per publisher."
      },
      "typeVersion": 1
    },
    {
      "id": "3d735620-b4c4-4ab8-85a7-4921a4d80a52",
      "name": "Sticky Note - Freshness",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2640,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 280,
        "height": 640,
        "content": "## Freshness Filter\nFilters articles by publication date."
      },
      "typeVersion": 1
    },
    {
      "id": "7fa04235-06b1-474b-bfa6-9d8fbbdb323d",
      "name": "Sticky Note - Output",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3280,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 488,
        "height": 640,
        "content": "## Output & Status Tracking\nSaves extracted data and updates source status."
      },
      "typeVersion": 1
    },
    {
      "id": "894dc531-b7d0-44ca-b031-7c22c925adf1",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        144,
        928
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "df99475b-e36e-4d8b-b8e8-6e252ced9f42",
      "name": "Schedule (Every 4 Hours)",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [
        144,
        1120
      ],
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "hours",
              "hoursInterval": 4
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "a0b9999e-860a-4d1a-8f52-c942aa3fcc81",
      "name": "Read Pending URLs",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        448,
        1024
      ],
      "parameters": {
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "URLs to Process"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "7219c3b5-785c-4ee9-bc72-20c382d8f6d2",
      "name": "Loop Over URLs",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        720,
        1024
      ],
      "parameters": {
        "options": {
          "reset": false
        }
      },
      "typeVersion": 3
    },
    {
      "id": "72bdf7e2-bb39-4de3-bb71-634c7163d5be",
      "name": "Rate Limit (3s)",
      "type": "n8n-nodes-base.wait",
      "position": [
        944,
        800
      ],
      "parameters": {
        "amount": 3
      },
      "typeVersion": 1.1
    },
    {
      "id": "896b1bb9-4d3c-4ea5-93fd-d3a1e019c029",
      "name": "Fetch HTML",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        1184,
        800
      ],
      "parameters": {
        "url": "={{ $json.URL }}",
        "options": {
          "timeout": 30000,
          "response": {
            "response": {
              "fullResponse": true,
              "responseFormat": "text"
            }
          }
        },
        "sendHeaders": true,
        "headerParameters": {
          "parameters": [
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/0.0.0.0 Safari/537.36"
            },
            {
              "name": "Accept",
              "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
            },
            {
              "name": "Accept-Language",
              "value": "en-US,en;q=0.5"
            }
          ]
        }
      },
      "typeVersion": 4.2
    },
    {
      "id": "aa9fac82-5076-45ac-90d1-da65faf5c206",
      "name": "Source Router",
      "type": "n8n-nodes-base.switch",
      "position": [
        1504,
        752
      ],
      "parameters": {
        "rules": {
          "values": [
            {
              "outputKey": "Site A",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "dd700c7f-06c4-4a76-93ba-adaa51b1814e",
                    "operator": {
                      "type": "string",
                      "operation": "equals"
                    },
                    "leftValue": "={{ $('Loop Over URLs').item.json.Source }}",
                    "rightValue": "Site A"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "Site B",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "6da5c75a-2057-414e-ac16-cee607861b83",
                    "operator": {
                      "type": "string",
                      "operation": "equals"
                    },
                    "leftValue": "={{ $('Loop Over URLs').item.json.Source }}",
                    "rightValue": "Site B"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "Site C",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "7d437ee9-fb8a-48be-8d00-2ae46934a8eb",
                    "operator": {
                      "type": "string",
                      "operation": "equals"
                    },
                    "leftValue": "={{ $('Loop Over URLs').item.json.Source }}",
                    "rightValue": "Site C"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "Site D",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "a2abaacb-55c7-4304-b92c-ef2714d10557",
                    "operator": {
                      "type": "string",
                      "operation": "equals"
                    },
                    "leftValue": "={{ $('Loop Over URLs').item.json.Source }}",
                    "rightValue": "Site D"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "fallback",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "14362d61-d2e8-4d40-8edf-12eccbea7b00",
                    "operator": {
                      "type": "string",
                      "operation": "exists"
                    },
                    "leftValue": "={{ $('Loop Over URLs').item.json.Source }}",
                    "rightValue": ""
                  }
                ]
              },
              "renameOutput": true
            }
          ]
        },
        "options": {
          "allMatchingOutputs": false
        }
      },
      "typeVersion": 3.2
    },
    {
      "id": "4bbec03a-1f1f-4e1a-9b1d-2d14ae6620c3",
      "name": "Extract: Fallback (Universal)",
      "type": "n8n-nodes-base.code",
      "position": [
        2064,
        1136
      ],
      "parameters": {
        "jsCode": "// Universal fallback extractor for unknown sources\nconst results = [];\n\nfor (const item of items) {\n  const html = item.json.data || item.json.body || \"\";\n  const requestUrl = item.json.URL || item.json.url || \"\";\n  const source = item.json.Source || \"unknown\";\n  \n  let title = null;\n  let description = null;\n  let author = null;\n  let datePublished = null;\n  let imageUrl = null;\n  let canonicalUrl = null;\n\n  // --- TITLE ---\n  const titlePatterns = [\n    /<h1[^>]*>([\\s\\S]*?)<\\/h1>/i,\n    /<meta[^>]+property=[\"']og:title[\"'][^>]+content=[\"']([^\"']+)[\"']/i,\n    /<title[^>]*>([\\s\\S]*?)<\\/title>/i\n  ];\n  for (const pattern of titlePatterns) {\n    const match = html.match(pattern);\n    if (match) {\n      title = match[1].replace(/<[^>]+>/g, \"\").trim();\n      if (title) break;\n    }\n  }\n\n  // --- DESCRIPTION ---\n  const descPatterns = [\n    /<meta[^>]+name=[\"']description[\"'][^>]+content=[\"']([^\"']+)[\"']/i,\n    /<meta[^>]+property=[\"']og:description[\"'][^>]+content=[\"']([^\"']+)[\"']/i\n  ];\n  for (const pattern of descPatterns) {\n    const match = html.match(pattern);\n    if (match) {\n      description = match[1].trim();\n      if (description) break;\n    }\n  }\n  \n  if (!description) {\n    const paragraphs = [...html.matchAll(/<p[^>]*>([\\s\\S]*?)<\\/p>/gi)]\n      .map(m => m[1].replace(/<[^>]+>/g, \"\").trim())\n      .filter(t => t && t.length > 50);\n    if (paragraphs.length) {\n      description = paragraphs.slice(0, 2).join(\" \").substring(0, 500);\n    }\n  }\n\n  // --- AUTHOR ---\n  const authorPatterns = [\n    /<meta[^>]+name=[\"']author[\"'][^>]+content=[\"']([^\"']+)[\"']/i,\n    /by\\s+([A-Z][a-z]+\\s+[A-Z][a-z]+)/i,\n    /<a[^>]+rel=[\"']author[\"'][^>]*>([^<]+)<\\/a>/i\n  ];\n  for (const pattern of authorPatterns) {\n    const match = html.match(pattern);\n    if (match) {\n      author = match[1].trim();\n      if (author) break;\n    }\n  }\n  \n  if (!author) {\n    const ldMatch = html.match(/<script[^>]+application\\/ld\\+json[^>]*>([\\s\\S]*?)<\\/script>/i);\n    if (ldMatch) {\n      try {\n        const ld = JSON.parse(ldMatch[1]);\n        const a = ld.author;\n        if (a) {\n          author = typeof a === \"string\" ? a : (a.name || (Array.isArray(a) ? a[0].name : null));\n        }\n      } catch (e) {}\n    }\n  }\n\n  // --- DATE PUBLISHED ---\n  const datePatterns = [\n    /<time[^>]+datetime=[\"']([^\"']+)[\"']/i,\n    /<meta[^>]+property=[\"']article:published_time[\"'][^>]+content=[\"']([^\"']+)[\"']/i,\n    /(\\d{4}-\\d{2}-\\d{2})/\n  ];\n  for (const pattern of datePatterns) {\n    const match = html.match(pattern);\n    if (match) {\n      datePublished = match[1].trim();\n      if (datePublished) break;\n    }\n  }\n  \n  if (!datePublished) {\n    const ldMatch = html.match(/<script[^>]+application\\/ld\\+json[^>]*>([\\s\\S]*?)<\\/script>/i);\n    if (ldMatch) {\n      try {\n        const ld = JSON.parse(ldMatch[1]);\n        datePublished = ld.datePublished || ld.dateCreated || null;\n      } catch (e) {}\n    }\n  }\n\n  // --- IMAGE URL ---\n  const imgPatterns = [\n    /<meta[^>]+property=[\"']og:image[\"'][^>]+content=[\"']([^\"']+)[\"']/i,\n    /<meta[^>]+name=[\"']twitter:image[\"'][^>]+content=[\"']([^\"']+)[\"']/i\n  ];\n  for (const pattern of imgPatterns) {\n    const match = html.match(pattern);\n    if (match) {\n      imageUrl = match[1].trim();\n      if (imageUrl) break;\n    }\n  }\n\n  // --- CANONICAL URL ---\n  const canonicalMatch = html.match(/<link[^>]+rel=[\"']canonical[\"'][^>]+href=[\"']([^\"']+)[\"']/i);\n  if (canonicalMatch) {\n    canonicalUrl = canonicalMatch[1].trim();\n  } else {\n    const ogUrlMatch = html.match(/<meta[^>]+property=[\"']og:url[\"'][^>]+content=[\"']([^\"']+)[\"']/i);\n    if (ogUrlMatch) canonicalUrl = ogUrlMatch[1].trim();\n  }\n\n  results.push({\n    json: {\n      title,\n      description,\n      author,\n      datePublished,\n      imageUrl,\n      canonicalUrl: canonicalUrl || requestUrl,\n      source,\n      sourceUrl: requestUrl\n    }\n  });\n}\n\nreturn results;"
      },
      "typeVersion": 2
    },
    {
      "id": "63b8461f-a33a-4b58-8c67-6eea1f4ef492",
      "name": "Normalize Extracted Data",
      "type": "n8n-nodes-base.code",
      "position": [
        2336,
        784
      ],
      "parameters": {
        "jsCode": "// Normalize extracted data from site-specific extractors\nconst results = [];\n\nfor (const item of items) {\n  const input = item.json;\n  const loopData = $('Loop Over URLs').item.json;\n  \n  results.push({\n    json: {\n      title: (input.title && input.title.trim()) ? input.title.trim() : null,\n      description: (input.description && input.description.trim()) ? input.description.trim().substring(0, 1000) : null,\n      author: (input.author && input.author.trim()) ? input.author.trim() : null,\n      datePublished: (input.datePublished && input.datePublished.trim()) ? input.datePublished.trim() : null,\n      imageUrl: (input.imageUrl && input.imageUrl.trim()) ? input.imageUrl.trim() : null,\n      canonicalUrl: (input.canonicalUrl && input.canonicalUrl.trim()) ? input.canonicalUrl.trim() : loopData.URL,\n      source: loopData.Source || \"unknown\",\n      sourceUrl: loopData.URL\n    }\n  });\n}\n\nreturn results;"
      },
      "typeVersion": 2
    },
    {
      "id": "6e67b344-df32-47c5-b04e-a626b607ee30",
      "name": "Freshness Filter (45 days)",
      "type": "n8n-nodes-base.if",
      "position": [
        2736,
        912
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "loose"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "freshness-check",
              "operator": {
                "type": "boolean",
                "operation": "equals"
              },
              "leftValue": "={{ (function() {\n  var dateStr = $json.datePublished;\n  if (!dateStr) return true;\n  \n  var date;\n  if (dateStr.match(/^\\d{4}-\\d{2}-\\d{2}/)) {\n    date = new Date(dateStr);\n  } else {\n    var cleaned = dateStr.replace(/(\\d+)(st|nd|rd|th)/, '$1');\n    date = new Date(cleaned);\n  }\n  \n  if (isNaN(date.getTime())) return true;\n  \n  var cutoffDate = new Date();\n  cutoffDate.setDate(cutoffDate.getDate() - 45);\n  return date >= cutoffDate;\n})() }}",
              "rightValue": true
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "24a813c5-0431-4151-ae17-fd81d4854b44",
      "name": "Calculate Tier & Status",
      "type": "n8n-nodes-base.code",
      "position": [
        3040,
        832
      ],
      "parameters": {
        "jsCode": "// Calculate tier based on article age\nconst results = [];\n\nfor (const item of items) {\n  const input = item.json;\n  let tier = \"Unknown\";\n  let freshnessStatus = \"Fresh\";\n  \n  if (input.datePublished) {\n    const dateStr = input.datePublished;\n    let articleDate;\n    \n    if (dateStr.match(/^\\d{4}-\\d{2}-\\d{2}/)) {\n      articleDate = new Date(dateStr);\n    } else {\n      const cleaned = dateStr.replace(/(\\d+)(st|nd|rd|th)/, '$1');\n      articleDate = new Date(cleaned);\n    }\n    \n    if (!isNaN(articleDate.getTime())) {\n      const now = new Date();\n      const daysDiff = Math.floor((now - articleDate) / (1000 * 60 * 60 * 24));\n      \n      if (daysDiff <= 7) {\n        tier = \"Tier 1\";\n        freshnessStatus = \"Fresh\";\n      } else if (daysDiff <= 14) {\n        tier = \"Tier 2\";\n        freshnessStatus = \"Fresh\";\n      } else if (daysDiff <= 30) {\n        tier = \"Tier 3\";\n        freshnessStatus = \"Fresh\";\n      } else {\n        tier = \"Archive\";\n        freshnessStatus = \"Fresh\";\n      }\n    }\n  }\n  \n  results.push({\n    json: {\n      ...input,\n      tier,\n      freshnessStatus,\n      extractedAt: new Date().toISOString()\n    }\n  });\n}\n\nreturn results;"
      },
      "typeVersion": 2
    },
    {
      "id": "6641b358-6d38-4a9a-8bb3-1e00c83b12f7",
      "name": "Mark as Outdated",
      "type": "n8n-nodes-base.set",
      "position": [
        3392,
        1120
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "status-outdated",
              "name": "freshnessStatus",
              "type": "string",
              "value": "Outdated"
            },
            {
              "id": "reason",
              "name": "reason",
              "type": "string",
              "value": "Article older than 45 days"
            },
            {
              "id": "sourceUrl",
              "name": "sourceUrl",
              "type": "string",
              "value": "={{ $json.sourceUrl }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "e43b32e6-0b16-4b0c-a51f-8684d6509817",
      "name": "Save to Article Feed",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        3392,
        832
      ],
      "parameters": {
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "Article Feed"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "9e0a7cac-7b50-4c9e-a1fb-7ce47a9b1355",
      "name": "Update URL Status",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        3600,
        1120
      ],
      "parameters": {
        "operation": "update",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "URLs to Process"
        },
        "documentId": {
          "__rl": true,
          "mode": "list",
          "value": "",
          "cachedResultUrl": "",
          "cachedResultName": "YOUR_SPREADSHEET"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "af00d978-f194-4a53-a5ec-49861ea66452",
      "name": "Completion Summary",
      "type": "n8n-nodes-base.set",
      "position": [
        944,
        1104
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "count",
              "name": "articlesProcessed",
              "type": "number",
              "value": "={{ $items().length }}"
            },
            {
              "id": "timestamp",
              "name": "completedAt",
              "type": "string",
              "value": "={{ $now.toISO() }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "ec75d39d-4457-4e09-9163-9e4169b0dfdc",
      "name": "Extract: Site B",
      "type": "n8n-nodes-base.html",
      "onError": "continueRegularOutput",
      "position": [
        2064,
        704
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "h1, article h1"
            },
            {
              "key": "description",
              "attribute": "content",
              "cssSelector": "meta[name='description']",
              "returnValue": "attribute"
            },
            {
              "key": "author",
              "cssSelector": "a[data-testid='authorName'], .author-info a, a[rel='author']"
            },
            {
              "key": "datePublished",
              "cssSelector": "span[data-testid='storyPublishDate'], time[datetime]"
            },
            {
              "key": "imageUrl",
              "attribute": "content",
              "cssSelector": "meta[property='og:image']",
              "returnValue": "attribute"
            },
            {
              "key": "canonicalUrl",
              "attribute": "href",
              "cssSelector": "link[rel='canonical']",
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "91ad55da-5b9d-4f47-b622-2cc674247b47",
      "name": "Extract: Site C",
      "type": "n8n-nodes-base.html",
      "onError": "continueRegularOutput",
      "position": [
        2064,
        848
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "h1.entry-title, .post-title, h1.wp-block-post-title"
            },
            {
              "key": "description",
              "attribute": "content",
              "cssSelector": "meta[name='description']",
              "returnValue": "attribute"
            },
            {
              "key": "author",
              "cssSelector": ".author-name, .byline a, span.author a, .entry-author a"
            },
            {
              "key": "datePublished",
              "attribute": "datetime",
              "cssSelector": "time.entry-date, .post-date time, time[datetime]",
              "returnValue": "attribute"
            },
            {
              "key": "imageUrl",
              "attribute": "content",
              "cssSelector": "meta[property='og:image']",
              "returnValue": "attribute"
            },
            {
              "key": "canonicalUrl",
              "attribute": "href",
              "cssSelector": "link[rel='canonical']",
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "5e9d8787-971b-45b1-8b4d-450c59e4e5c3",
      "name": "Extract: Site D",
      "type": "n8n-nodes-base.html",
      "onError": "continueRegularOutput",
      "position": [
        2064,
        992
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "#hs_cos_wrapper_name, h1.blog-post__title, .post-header h1"
            },
            {
              "key": "description",
              "cssSelector": "#hs_cos_wrapper_post_body p:first-of-type, meta[name='description']"
            },
            {
              "key": "author",
              "cssSelector": "p[data-hubspot-name='Blog Author'] a, .author-info a, .blog-author__name"
            },
            {
              "key": "datePublished",
              "cssSelector": "span.blog--single--meta--date, time[datetime], .post-date"
            },
            {
              "key": "imageUrl",
              "attribute": "content",
              "cssSelector": "meta[property='og:image']",
              "returnValue": "attribute"
            },
            {
              "key": "canonicalUrl",
              "attribute": "href",
              "cssSelector": "link[rel='canonical']",
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "cdd20c9c-cd91-4ccf-8e6e-7ea95e4289ab",
      "name": "Sticky Note - Input1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1120,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 246,
        "height": 648,
        "content": "## Fetch the HTML content"
      },
      "typeVersion": 1
    },
    {
      "id": "bd580a4c-9c60-4b1c-8f55-f2d11973fc79",
      "name": "Extract: Site A",
      "type": "n8n-nodes-base.html",
      "onError": "continueRegularOutput",
      "position": [
        2064,
        560
      ],
      "parameters": {
        "options": {},
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "h1.article__title, h1[data-testid='ContentHeader'], .post-title h1"
            },
            {
              "key": "description",
              "attribute": "content",
              "cssSelector": "meta[name='description']",
              "returnValue": "attribute"
            },
            {
              "key": "author",
              "cssSelector": ".article__byline a, a[rel='author'], .author-card__name"
            },
            {
              "key": "datePublished",
              "attribute": "datetime",
              "cssSelector": "time[datetime]",
              "returnValue": "attribute"
            },
            {
              "key": "imageUrl",
              "attribute": "content",
              "cssSelector": "meta[property='og:image']",
              "returnValue": "attribute"
            },
            {
              "key": "canonicalUrl",
              "attribute": "href",
              "cssSelector": "link[rel='canonical']",
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "bb20578e-1c3d-4848-b857-3082264298c7",
      "name": "Sticky Note - Freshness1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2944,
        640
      ],
      "parameters": {
        "color": 7,
        "width": 280,
        "height": 640,
        "content": "## Tier Status\nCalculate Tier Status based on content "
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "87886f32-8c5a-4dcd-93aa-2e2109b99cee",
  "connections": {
    "Fetch HTML": {
      "main": [
        [
          {
            "node": "Source Router",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Source Router": {
      "main": [
        [
          {
            "node": "Extract: Site A",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Extract: Site B",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Extract: Site C",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Extract: Site D",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Extract: Fallback (Universal)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Loop Over URLs": {
      "main": [
        [
          {
            "node": "Completion Summary",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Rate Limit (3s)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Read Pending URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract: Site A": {
      "main": [
        [
          {
            "node": "Normalize Extracted Data",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract: Site B": {
      "main": [
        [
          {
            "node": "Normalize Extracted Data",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract: Site C": {
      "main": [
        [
          {
            "node": "Normalize Extracted Data",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract: Site D": {
      "main": [
        [
          {
            "node": "Normalize Extracted Data",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Rate Limit (3s)": {
      "main": [
        [
          {
            "node": "Fetch HTML",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Mark as Outdated": {
      "main": [
        [
          {
            "node": "Update URL Status",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Read Pending URLs": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Update URL Status": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Save to Article Feed": {
      "main": [
        [
          {
            "node": "Update URL Status",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Calculate Tier & Status": {
      "main": [
        [
          {
            "node": "Save to Article Feed",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Normalize Extracted Data": {
      "main": [
        [
          {
            "node": "Freshness Filter (45 days)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Schedule (Every 4 Hours)": {
      "main": [
        [
          {
            "node": "Read Pending URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Freshness Filter (45 days)": {
      "main": [
        [
          {
            "node": "Calculate Tier & Status",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Mark as Outdated",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract: Fallback (Universal)": {
      "main": [
        [
          {
            "node": "Freshness Filter (45 days)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Credentials you'll need

Each integration node will prompt for credentials when you import. We strip credential IDs before publishing — you'll add your own.

Pro

For the full experience including quality scoring and batch install features for each workflow upgrade to Pro

About this workflow

Intelligent Source Routing - Uses a Switch node to route URLs to specialized extractors based on source identifier, enabling custom CSS selectors per publisher for maximum accuracy Universal Fallback Parser - Advanced regex-based extractor handles unknown sources automatically,…

Source: https://n8n.io/workflows/11224/ — original creator credit. Request a take-down →

More Data & Sheets workflows → · Browse all categories →

Related workflows

Workflows that share integrations, category, or trigger type with this one. All free to copy and import.

Data & Sheets

This template is ideal for solo store owners, eCommerce marketers, automation beginners, or anyone using Shopify and Gmail who wants to recover lost revenue without coding.

HTTP Request, Gmail, Twilio +3
Data & Sheets

PCN. Uses googleSheets, httpRequest, @n-octo-n/n8n-nodes-json-database, itemLists. Event-driven trigger; 60 nodes.

Google Sheets, HTTP Request, @N Octo N/N8N Nodes Json Database +3
Data & Sheets

The workflow automates the process of gathering extensive keyword data for a "Main Keyword." It starts by reading initial parameters from a Google Sheets template, creates a new dedicated Google Sheet

Google Sheets, Google Drive, HTTP Request
Data & Sheets

🔥 March Sale – n8n Community Members Get ideoGener8r for Just $27! (Reg. $47) Use Coupon Code: (Valid until 3/31/2025 for n8n community members)

HTTP Request, Google Drive, Google Sheets
Data & Sheets

📄 Documentation: Notion Guide

Google Sheets, Google Drive, HTTP Request +2