This workflow corresponds to n8n.io template #16352 — we link there as the canonical source.
This workflow follows the Documentdefaultdataloader → OpenAI Embeddings recipe pattern — see all workflows that pair these two integrations.
The workflow JSON
Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →
{
"id": "BYBLI1Ib8gzNOOfx",
"name": "BytezTech Chat RAG Web Scrapt FAQ",
"tags": [],
"nodes": [
{
"id": "1675b46e-b5e1-4ced-b5c2-8870d197f777",
"name": "\ud83d\udccb Workflow Overview (Read Me First)",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1024,
144
],
"parameters": {
"width": 576,
"height": 680,
"content": "## Website FAQ Sync to Pinecone RAG Pipeline\n\n### Required Credentials\n* **OpenAI API:** Required for GPT-4o Q&A generation and text-embedding-3-small.\n* **Pinecone API:** Requires a pre-created Pinecone index (1536 dimensions).\n\n### Required Node Configurations\n1. **Get Sitemap Index:** Replace the placeholder URL with your actual XML sitemap URL.\n2. **Upsert FAQ Chunks to Pinecone:** Set your Pinecone index name and namespace.\n3. **OpenAI / Pinecone Nodes:** Select your connected credentials in all respective API nodes.\n4. *(Optional)* **Schedule Trigger:** Adjust the cron expression to change the sync frequency.\n5. *(Optional)* **Build GPT Request:** Edit the system prompt to match your desired output formatting.\n6. *(Optional)* **Flatten & Filter All URLs:** Modify the `skipList` variable to exclude specific website paths."
},
"typeVersion": 1
},
{
"id": "2de5cd7f-acd0-44f5-8091-020ae2c5129e",
"name": "\ud83d\udce1 Phase 1 \u2013 Sitemap Discovery",
"type": "n8n-nodes-base.stickyNote",
"position": [
-272,
-16
],
"parameters": {
"color": 7,
"width": 868,
"height": 448,
"content": "## Phase 1 \u2013 Sitemap Discovery\n\nFetches the XML sitemap index from your website, parses it into JSON, then extracts and iterates through each sub-sitemap URL to collect all individual page URLs.\n\n**Configure:** Update the URL in ** Get Sitemap Index** to point to your own sitemap (e.g. `https://yoursite.com/sitemap_index.xml`)."
},
"typeVersion": 1
},
{
"id": "5c934bfe-6744-45dd-9740-0908bacc0cbe",
"name": "\ud83d\udd17 Phase 2 \u2013 URL Filtering & Batching",
"type": "n8n-nodes-base.stickyNote",
"position": [
672,
0
],
"parameters": {
"color": 7,
"width": 888,
"height": 432,
"content": "## \ud83d\udd17 Phase 2 \u2013 URL Filtering & Batch Loop\n\nAll collected page URLs are merged, deduplicated, and filtered to remove assets, admin paths, CDN files, and third-party links. Clean URLs are then passed into a batch loop (10 at a time) for efficient scraping.\n\n\u27a1\ufe0f **Tip:** Increase `batchSize` in ** Loop URLs in Batches** if your site has 100+ pages and your API rate limits allow it."
},
"typeVersion": 1
},
{
"id": "46cd10f8-1774-4f89-b78c-1a0d12237cf2",
"name": "\ud83d\udd77\ufe0f Phase 3 \u2013 Scraping & Extraction",
"type": "n8n-nodes-base.stickyNote",
"position": [
-304,
464
],
"parameters": {
"color": 7,
"width": 672,
"height": 500,
"content": "## Phase 3 \u2013 Page Scraping & Content Extraction\n\nEach URL is fetched as raw HTML. Script, style, nav, footer, and other non-content tags are stripped out. The node then extracts: page title, meta description, H1\u2013H3 headings, body paragraphs, and list items \u2014 assembling clean text up to 5,000 characters.\n\nPages returning fewer than 100 characters of content are skipped automatically."
},
"typeVersion": 1
},
{
"id": "1c391945-4489-473b-82ff-5b1cf2684899",
"name": "\ud83e\udde0 Phase 4 \u2013 AI FAQ Generation",
"type": "n8n-nodes-base.stickyNote",
"position": [
416,
464
],
"parameters": {
"color": 7,
"width": 860,
"height": 528,
"content": "## Phase 4 \u2013 AI FAQ Generation (GPT-4o)\n\nThe cleaned page text is sent to GPT-4o with a structured prompt that requires:\n- At least one FAQ per major heading/section for full coverage\n- Zero duplicate or overlapping questions\n- Structured JSON output: `{ question, answer, topic, author }`\n\nEach FAQ is assigned a deterministic `chunk_id` (based on URL + index) so that weekly re-runs safely overwrite existing vectors without creating duplicates in Pinecone.\n\n **Tip:** Edit the system prompt inside ** Build GPT Request ** to match your brand tone and company name."
},
"typeVersion": 1
},
{
"id": "25f900f2-c38c-46ce-8fe8-99a162dc6455",
"name": "\ud83d\udce6 Phase 5 \u2013 Embedding & Upsert",
"type": "n8n-nodes-base.stickyNote",
"position": [
1584,
416
],
"parameters": {
"color": 7,
"width": 736,
"height": 648,
"content": "## Phase 5 \u2013 Embedding & Pinecone Upsert\n\nEach validated FAQ chunk is embedded using **text-embedding-3-small** (1536 dimensions) and upserted into Pinecone using its deterministic `chunk_id` as the vector ID.\n\nThis ensures idempotent weekly syncs \u2014 re-running never creates duplicate vectors.\n\nA **2-second wait** between each batch prevents OpenAI and Pinecone API rate-limit errors.\n\n **Configure:** Set your Pinecone **index name** and **namespace** inside ** Upsert FAQ Chunks to Pinecone**."
},
"typeVersion": 1
},
{
"id": "6a4f5cfd-7b49-4de1-9616-acf835bcaabe",
"name": "Schedule Trigger \u2013 Every Monday Midnight IST",
"type": "n8n-nodes-base.scheduleTrigger",
"notes": "Cron: 30 18 * * 0 = Every Sunday 18:30 UTC = Monday 00:00 IST",
"position": [
-224,
224
],
"parameters": {
"rule": {
"interval": [
{
"field": "cronExpression",
"expression": "30 18 * * 0"
}
]
}
},
"typeVersion": 1.1
},
{
"id": "902e83fe-610e-49a6-98ff-d4c9e7f4fd83",
"name": "Get Sitemap Index",
"type": "n8n-nodes-base.httpRequest",
"position": [
0,
224
],
"parameters": {
"url": "https://example.com/sitemap_index.xml",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "9e07caf6-425e-49e0-bae4-6a46aa572901",
"name": "Parse Sitemap Index XML",
"type": "n8n-nodes-base.xml",
"position": [
224,
224
],
"parameters": {
"options": {}
},
"typeVersion": 1
},
{
"id": "f0830f70-428e-4fe6-ab54-c3e60aeb5e60",
"name": "Extract Sub-Sitemap URLs",
"type": "n8n-nodes-base.code",
"position": [
448,
224
],
"parameters": {
"jsCode": "const sitemaps = $input.first().json.sitemapindex?.sitemap || [];\nconst sitemapArray = Array.isArray(sitemaps) ? sitemaps : [sitemaps];\n\nreturn sitemapArray.map(s => ({\n json: { loc: typeof s.loc === 'string' ? s.loc : (s.loc._text || s.loc['#text'] || '') }\n})).filter(item => item.json.loc);"
},
"typeVersion": 2
},
{
"id": "d07352fe-8fb3-4674-99bf-4c1f3f5aaf2b",
"name": "Scrape Page HTML",
"type": "n8n-nodes-base.httpRequest",
"onError": "continueRegularOutput",
"position": [
-256,
768
],
"parameters": {
"url": "={{ $json.loc }}",
"options": {
"timeout": 20000,
"response": {
"response": {
"responseFormat": "text"
}
},
"allowUnauthorizedCerts": true
}
},
"typeVersion": 4.2
},
{
"id": "099c229c-f93f-4623-902d-ecc688c7fa21",
"name": "Extract Text, Headings & Metadata",
"type": "n8n-nodes-base.code",
"position": [
-32,
768
],
"parameters": {
"jsCode": "const SKIP_LIST = [\n \"/wp-\", \"/wp-content\", \"/wp-admin\", \"/cdn-cgi\",\n \".png\", \".jpg\", \".jpeg\", \".gif\", \".svg\", \".webp\",\n \".pdf\", \".css\", \".js\", \".xml\", \".ico\", \".woff\",\n \"google.com\", \"facebook.com\", \"shortpixel\"\n];\n\nfunction stripTags(html) {\n let result = \"\";\n let inside = false;\n for (const ch of html) {\n if (ch === \"<\") inside = true;\n else if (ch === \">\") inside = false;\n else if (!inside) result += ch;\n }\n return result;\n}\n\nfunction decodeEntities(text) {\n return text\n .replace(/&/g, \"&\")\n .replace(/</g, \"<\")\n .replace(/>/g, \">\")\n .replace(/ /g, \" \")\n .replace(/"/g, '\"')\n .replace(/'/g, \"'\")\n .replace(/–/g, \"-\")\n .replace(/—/g, \"-\")\n .replace(/–/g, \"-\")\n .replace(/—/g, \"-\")\n .replace(/‘/g, \"'\")\n .replace(/’/g, \"'\")\n .replace(/“/g, '\"')\n .replace(/”/g, '\"');\n}\n\nfunction collapseSpace(text) {\n let t = text;\n while (t.includes(\" \")) t = t.replace(/ /g, \" \");\n while (t.includes(\"\\n\\n\\n\")) t = t.replace(/\\n\\n\\n/g, \"\\n\\n\");\n return t.trim();\n}\n\nfunction extractBetween(html, openTag, closeTag) {\n const results = [];\n const lower = html.toLowerCase();\n const openL = openTag.toLowerCase();\n const closeL = closeTag.toLowerCase();\n let pos = 0;\n const length = html.length;\n\n while (pos < length) {\n const start = lower.indexOf(openL, pos);\n if (start === -1) break;\n const tagEnd = lower.indexOf(\">\", start);\n if (tagEnd === -1) break;\n const contentStart = tagEnd + 1;\n const end = lower.indexOf(closeL, contentStart);\n if (end === -1) break;\n results.push(html.slice(contentStart, end));\n pos = end + closeL.length;\n }\n return results;\n}\n\nfunction removeTagBlocks(html, tag) {\n const openT = \"<\" + tag;\n const closeT = \"</\" + tag + \">\";\n const lower = html.toLowerCase();\n let result = \"\";\n let pos = 0;\n const length = html.length;\n\n while (pos < length) {\n const idx = lower.indexOf(openT, pos);\n if (idx === -1) {\n result += html.slice(pos);\n break;\n }\n result += html.slice(pos, idx);\n const end = lower.indexOf(closeT, idx);\n if (end === -1) break;\n pos = end + closeT.length;\n }\n return result;\n}\n\nfunction extractMeta(html, name) {\n const lower = html.toLowerCase();\n let needle = `name=\"${name}\"`;\n let idx = lower.indexOf(needle);\n if (idx === -1) {\n needle = `name='${name}'`;\n idx = lower.indexOf(needle);\n }\n if (idx === -1) return \"\";\n\n const blockStart = lower.lastIndexOf(\"<meta\", idx);\n if (blockStart === -1) return \"\";\n const blockEnd = lower.indexOf(\">\", blockStart);\n if (blockEnd === -1) return \"\";\n\n const block = html.slice(blockStart, blockEnd);\n const lowerBlock = block.toLowerCase();\n\n for (const q of ['\"', \"'\"]) {\n const key = \"content=\" + q;\n const ci = lowerBlock.indexOf(key);\n if (ci !== -1) {\n const start = ci + key.length;\n const end = block.indexOf(q, start);\n if (end !== -1) return block.slice(start, end).trim();\n }\n }\n return \"\";\n}\n\nfunction extractOg(html, propertyName) {\n const lower = html.toLowerCase();\n let needle = `property=\"og:${propertyName}\"`;\n let idx = lower.indexOf(needle);\n if (idx === -1) {\n needle = `property='og:${propertyName}'`;\n idx = lower.indexOf(needle);\n }\n if (idx === -1) return \"\";\n\n const blockStart = lower.lastIndexOf(\"<meta\", idx);\n if (blockStart === -1) return \"\";\n const blockEnd = lower.indexOf(\">\", blockStart);\n if (blockEnd === -1) return \"\";\n\n const block = html.slice(blockStart, blockEnd);\n const lowerBlock = block.toLowerCase();\n\n for (const q of ['\"', \"'\"]) {\n const key = \"content=\" + q;\n const ci = lowerBlock.indexOf(key);\n if (ci !== -1) {\n const start = ci + key.length;\n const end = block.indexOf(q, start);\n if (end !== -1) return block.slice(start, end).trim();\n }\n }\n return \"\";\n}\n\nfunction cleanText(raw) {\n let text = stripTags(raw);\n text = decodeEntities(text);\n text = collapseSpace(text);\n return text;\n}\n\nconst result = [];\n\nfor (const item of $input.all()) {\n let html = item.json.data || \"\";\n const url = item.json.url || \"\";\n const section = item.json.section || \"\";\n\n if (html.length < 200) continue;\n\n for (const tag of [\"script\", \"style\", \"nav\", \"footer\", \"header\", \"noscript\", \"iframe\", \"svg\", \"button\", \"form\"]) {\n html = removeTagBlocks(html, tag);\n }\n\n // Title\n let titleRaw = \"\";\n const lowerHtml = html.toLowerCase();\n const tStart = lowerHtml.indexOf(\"<title\");\n if (tStart !== -1) {\n const tEnd = lowerHtml.indexOf(\"</title>\", tStart);\n if (tEnd !== -1) {\n const tagClose = html.indexOf(\">\", tStart);\n titleRaw = html.slice(tagClose + 1, tEnd);\n }\n }\n let title = cleanText(titleRaw);\n if (title.includes(\" | \")) title = title.slice(0, title.indexOf(\" | \")).trim();\n if (title.includes(\" - \")) title = title.slice(0, title.indexOf(\" - \")).trim();\n\n // Description\n let description = extractMeta(html, \"description\");\n if (!description) description = extractOg(html, \"description\");\n description = decodeEntities(description);\n\n // Author\n let author = extractMeta(html, \"author\");\n if (!author) author = \"Content Team\";\n\n // OG Image\n const ogImage = extractOg(html, \"image\");\n\n // Headings\n let headings = [];\n for (const tag of [\"h1\", \"h2\", \"h3\"]) {\n const chunks = extractBetween(html, \"<\" + tag, \"</\" + tag + \">\");\n for (const chunk of chunks.slice(0, 6)) {\n const h = cleanText(chunk);\n if (h && h.length > 3 && h.length < 200) headings.push(h);\n }\n }\n headings = headings.slice(0, 10);\n\n // Paragraphs\n const paraChunks = extractBetween(html, \"<p\", \"</p>\");\n let paragraphs = [];\n for (const chunk of paraChunks) {\n const p = cleanText(chunk);\n if (p.length > 40) paragraphs.push(p);\n }\n paragraphs = paragraphs.slice(0, 20);\n\n // List items\n const liChunks = extractBetween(html, \"<li\", \"</li>\");\n let listItems = [];\n for (const chunk of liChunks) {\n const li = cleanText(chunk);\n if (li.length > 15 && li.length < 300) listItems.push(\"- \" + li);\n }\n listItems = listItems.slice(0, 20);\n\n // Build full text\n const parts = [];\n if (title) parts.push(\"PAGE: \" + title);\n if (description) parts.push(\"DESCRIPTION: \" + description);\n if (headings.length) parts.push(\"HEADINGS: \" + headings.join(\" | \"));\n if (paragraphs.length) parts.push(\"CONTENT: \" + paragraphs.join(\" \"));\n if (listItems.length) parts.push(\"POINTS: \" + listItems.join(\" \"));\n\n let fullText = parts.join(\" \");\n fullText = fullText.slice(0, 5000);\n\n if (fullText.length < 100) continue;\n\n result.push({\n json: {\n url,\n title,\n author,\n description,\n og_image: ogImage,\n section,\n headings: headings.slice(0, 5).join(\" | \"),\n text: fullText,\n char_count: fullText.length\n }\n });\n}\n\nreturn result;"
},
"typeVersion": 2
},
{
"id": "42e6dbb9-a863-4b3e-a8b6-3884e4bd6fc0",
"name": "Filter: Has Enough Content",
"type": "n8n-nodes-base.filter",
"position": [
192,
768
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 1,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "c1",
"operator": {
"type": "string",
"operation": "notEmpty"
},
"leftValue": "={{ $json.text }}",
"rightValue": ""
},
{
"id": "c2",
"operator": {
"type": "number",
"operation": "gt"
},
"leftValue": "={{ $json.char_count }}",
"rightValue": 100
}
]
}
},
"typeVersion": 2
},
{
"id": "8c54d35c-6fe5-4241-80d5-1ec585096899",
"name": "GPT-4o: Generate FAQs",
"type": "n8n-nodes-base.httpRequest",
"position": [
688,
800
],
"parameters": {
"url": "https://api.openai.com/v1/chat/completions",
"method": "POST",
"options": {
"response": {
"response": {
"responseFormat": "json"
}
}
},
"jsonBody": "={{ $json.gpt_body }}",
"sendBody": true,
"specifyBody": "json",
"authentication": "predefinedCredentialType",
"nodeCredentialType": "openAiApi"
},
"typeVersion": 4.2
},
{
"id": "42c0fc18-4931-4887-b2cb-d648229ad1f7",
"name": "Build GPT Request",
"type": "n8n-nodes-base.code",
"position": [
464,
800
],
"parameters": {
"jsCode": "// Builds a fully escaped GPT-4o request body using JSON.stringify.\n// Also carries all page metadata forward so the Parse FAQs node can access it.\n\nconst items_out = [];\n\nfor (const item of $input.all()) {\n const url = item.json.url || '';\n const title = item.json.title || '';\n const author = item.json.author || 'Content Team';\n const section = item.json.section || '';\n const description = item.json.description || '';\n const headings = item.json.headings || '';\n const text = item.json.text || '';\n const og_image = item.json.og_image || '';\n\n const headingCount = headings ? headings.split('|').length : 0;\n\n const systemPrompt = `You are a professional FAQ generator. Your goal is to help website visitors quickly find answers to any question they might have after reading a page.\n\nTASK: Given the content of ONE webpage, generate a COMPLETE set of FAQs covering EVERY distinct section, heading, feature, benefit, statistic, and concept mentioned \u2014 so that a visitor's question on ANY part of this page can be answered.\n\nRULES:\n1. Coverage: For EACH major heading/section, generate AT LEAST ONE FAQ capturing its core point. Generate separate FAQs for distinct facts within the same section. No upper limit \u2014 generate as many as needed for full coverage.\n2. No repetition: Each FAQ must cover a DIFFERENT point. No two FAQs should overlap in topic or answer.\n3. Specificity: Every question must be answerable using ONLY the provided content.\n4. Topic tag: Include a \"topic\" field \u2014 a short 2\u20135 word category label matching the section it came from.\n5. Answers: 2\u20134 sentences, factual, based only on the given content. No hallucination.\n6. Author field: Always include an \"author\" field copied exactly from the input Author value.\n7. Low-content pages: If the page has very little real content (contact page, legal page), return [].\n8. Output: Return ONLY a valid JSON array. No preamble, no markdown, no code fences.\n\nOutput format: [{\"question\": \"...\", \"answer\": \"...\", \"topic\": \"...\", \"author\": \"...\"}, ...]\nIf no good FAQs can be generated, return: []`;\n\n const userPrompt = 'Page URL: ' + url + '\\nPage Title: ' + title + '\\nSection: ' + section + '\\nAuthor: ' + author + '\\nMeta Description: ' + description + '\\nKey Headings (' + headingCount + ' total): ' + headings + '\\n\\n--- Page Content ---\\n' + text + '\\n--- End Content ---\\n\\nThis page has ' + headingCount + ' major headings/sections. Generate FAQs covering ALL of them plus the intro/conclusion, with no duplicate or overlapping questions. Include \"author\": \"' + author + '\" in every FAQ object. Return as a JSON array with question, answer, topic, and author fields. Return [] only if the page has no substantive content:';\n\n const gpt_body = JSON.stringify({\n model: 'gpt-4o',\n max_tokens: 4096,\n temperature: 0.2,\n messages: [\n { role: 'system', content: systemPrompt },\n { role: 'user', content: userPrompt }\n ]\n });\n\n items_out.push({\n json: {\n url, title, author, section, description, headings, og_image,\n heading_count: headingCount,\n gpt_body\n }\n });\n}\n\nreturn items_out;"
},
"typeVersion": 2
},
{
"id": "53c01e8e-5438-42a1-b7d1-d1d0be2d339d",
"name": "Parse FAQs & Build Chunks",
"type": "n8n-nodes-base.code",
"position": [
912,
800
],
"parameters": {
"jsCode": "// Parses the GPT-4o response and builds individual FAQ chunk items.\n// Retrieves page metadata from the paired upstream node (\ud83d\udd27 Build GPT Request).\n\nfunction findJsonArray(text) {\n const start = text.indexOf('[');\n const end = text.lastIndexOf(']');\n if (start === -1 || end === -1 || end <= start) return '';\n return text.slice(start, end + 1);\n}\n\nfunction makeChunkId(url, index) {\n let slug = url\n .replace(/https?:\\/\\//, '')\n .replace(/www\\./, '')\n .replace(/[^a-zA-Z0-9]/g, '_')\n .replace(/_+/g, '_')\n .replace(/^_+|_+$/g, '')\n .slice(0, 70);\n return slug + '__faq_' + String(index).padStart(2, '0');\n}\n\nconst items_out = [];\n\nfor (const item of $input.all()) {\n\n const choices = item.json.choices || [];\n if (!choices.length) continue;\n\n const raw = (choices[0]?.message?.content || '').trim();\n if (!raw) continue;\n\n // Get metadata from the paired upstream node\n const meta = $('Build GPT Request').item.json;\n\n const arrayStr = findJsonArray(raw);\n if (!arrayStr) continue;\n\n let faqs = [];\n try {\n faqs = JSON.parse(arrayStr);\n } catch(e) {\n continue;\n }\n\n if (!Array.isArray(faqs)) continue;\n\n for (let i = 0; i < faqs.length; i++) {\n const q = (faqs[i].question || '').trim();\n const a = (faqs[i].answer || '').trim();\n\n if (!q || !a || q.length < 10 || a.length < 10) continue;\n\n items_out.push({\n json: {\n chunk_id : makeChunkId(meta.url, i),\n chunk_text : 'Q: ' + q + ' A: ' + a,\n question : q,\n answer : a,\n url : meta.url,\n title : meta.title,\n author : meta.author,\n section : meta.section,\n og_image : meta.og_image,\n faq_index : i\n }\n });\n }\n}\n\nreturn items_out;"
},
"typeVersion": 2
},
{
"id": "f8eab7fc-bca8-4807-88d2-d65afb514ccd",
"name": "Filter: Valid FAQ Chunks Only",
"type": "n8n-nodes-base.filter",
"position": [
1136,
800
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 1,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "c3",
"operator": {
"type": "string",
"operation": "notEmpty"
},
"leftValue": "={{ $json.chunk_text }}",
"rightValue": ""
},
{
"id": "c4",
"operator": {
"type": "string",
"operation": "notEquals"
},
"leftValue": "={{ $json.chunk_id }}",
"rightValue": "error"
}
]
}
},
"typeVersion": 2
},
{
"id": "7550a92d-0d71-4855-89c8-8a01d7cc4411",
"name": "Fetch Sub-Sitemap",
"type": "n8n-nodes-base.httpRequest",
"position": [
736,
224
],
"parameters": {
"url": "={{ $json.loc }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "1edaf4b5-9f55-49e9-b8fc-403a8aa1ae4b",
"name": "Parse Sub-Sitemap XML",
"type": "n8n-nodes-base.xml",
"position": [
960,
224
],
"parameters": {
"options": {}
},
"typeVersion": 1
},
{
"id": "da306605-d8c9-434f-b2dd-c48d149d10db",
"name": "Flatten & Filter All URLs",
"type": "n8n-nodes-base.code",
"position": [
1184,
224
],
"parameters": {
"jsCode": "const skipList = [\n \"/wp-\", \"/wp-content\", \"/wp-admin\", \"/cdn-cgi\",\n \".png\", \".jpg\", \".jpeg\", \".gif\", \".svg\", \".webp\",\n \".pdf\", \".css\", \".js\", \".xml\", \".ico\", \".woff\",\n \"google.com\", \"facebook.com\", \"shortpixel\"\n];\n\nlet allUrls = [];\n\n// Gather all URLs from all sub-sitemaps\nfor (const item of $input.all()) {\n const urls = item.json?.urlset?.url || [];\n const urlArray = Array.isArray(urls) ? urls : [urls];\n\n for (const u of urlArray) {\n if (!u || !u.loc) continue;\n const locString = typeof u.loc === 'string' ? u.loc : (u.loc._text || u.loc['#text'] || '');\n if (locString) allUrls.push(locString);\n }\n}\n\n// Filter out asset/admin URLs and deduplicate\nconst cleanUrls = [...new Set(allUrls)].filter(url => {\n const lowerUrl = url.toLowerCase();\n return !skipList.some(skip => lowerUrl.includes(skip));\n});\n\n// Output as individual items for the Loop node\nreturn cleanUrls.map(url => ({ json: { loc: url } }));"
},
"typeVersion": 2
},
{
"id": "dcc34f66-1052-4932-9d11-a387b7f9bb3c",
"name": "Loop URLs in Batches",
"type": "n8n-nodes-base.splitInBatches",
"position": [
1408,
224
],
"parameters": {
"options": {},
"batchSize": 10
},
"typeVersion": 3
},
{
"id": "14bf0d92-9e31-40b7-9598-8962c406ee6b",
"name": "Upsert FAQ Chunks to Pinecone",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
1712,
688
],
"parameters": {
"mode": "insert",
"options": {
"pineconeNamespace": "rag-context"
},
"pineconeIndex": {
"__rl": true,
"mode": "list",
"value": "your-pinecone-index-name",
"cachedResultName": "your-pinecone-index-name"
}
},
"typeVersion": 1.3
},
{
"id": "23cb9cc5-6ae9-415b-a752-8cb6b75dc9c8",
"name": "OpenAI Text Embeddings (text-embedding-3-small)",
"type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
"position": [
1728,
912
],
"parameters": {
"options": {
"dimensions": 1536
}
},
"typeVersion": 1.2
},
{
"id": "7faf6bad-9bf0-49d7-8e4c-ba58fda3fddf",
"name": "Load FAQ Chunk Text",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
1856,
912
],
"parameters": {
"options": {},
"jsonData": "={{ $json.chunk_text }}",
"jsonMode": "expressionData"
},
"typeVersion": 1.1
},
{
"id": "e71a5320-741b-48ab-b955-92ead2905a7f",
"name": "Wait 2s Between Batches",
"type": "n8n-nodes-base.wait",
"position": [
2064,
688
],
"parameters": {
"amount": 2
},
"typeVersion": 1.1
}
],
"active": false,
"settings": {
"binaryMode": "separate",
"availableInMCP": false,
"executionOrder": "v1"
},
"versionId": "283871cd-cfd8-4f13-ac56-5f90d0816455",
"nodeGroups": [],
"connections": {
"Scrape Page HTML": {
"main": [
[
{
"node": "Extract Text, Headings & Metadata",
"type": "main",
"index": 0
}
]
]
},
"Build GPT Request": {
"main": [
[
{
"node": "GPT-4o: Generate FAQs",
"type": "main",
"index": 0
}
]
]
},
"Fetch Sub-Sitemap": {
"main": [
[
{
"node": "Parse Sub-Sitemap XML",
"type": "main",
"index": 0
}
]
]
},
"Get Sitemap Index": {
"main": [
[
{
"node": "Parse Sitemap Index XML",
"type": "main",
"index": 0
}
]
]
},
"Load FAQ Chunk Text": {
"ai_document": [
[
{
"node": "Upsert FAQ Chunks to Pinecone",
"type": "ai_document",
"index": 0
}
]
]
},
"Loop URLs in Batches": {
"main": [
[],
[
{
"node": "Scrape Page HTML",
"type": "main",
"index": 0
}
]
]
},
"GPT-4o: Generate FAQs": {
"main": [
[
{
"node": "Parse FAQs & Build Chunks",
"type": "main",
"index": 0
}
]
]
},
"Parse Sub-Sitemap XML": {
"main": [
[
{
"node": "Flatten & Filter All URLs",
"type": "main",
"index": 0
}
]
]
},
"Parse Sitemap Index XML": {
"main": [
[
{
"node": "Extract Sub-Sitemap URLs",
"type": "main",
"index": 0
}
]
]
},
"Wait 2s Between Batches": {
"main": [
[
{
"node": "Loop URLs in Batches",
"type": "main",
"index": 0
}
]
]
},
"Extract Sub-Sitemap URLs": {
"main": [
[
{
"node": "Fetch Sub-Sitemap",
"type": "main",
"index": 0
}
]
]
},
"Flatten & Filter All URLs": {
"main": [
[
{
"node": "Loop URLs in Batches",
"type": "main",
"index": 0
}
]
]
},
"Parse FAQs & Build Chunks": {
"main": [
[
{
"node": "Filter: Valid FAQ Chunks Only",
"type": "main",
"index": 0
}
]
]
},
"Filter: Has Enough Content": {
"main": [
[
{
"node": "Build GPT Request",
"type": "main",
"index": 0
}
]
]
},
"Filter: Valid FAQ Chunks Only": {
"main": [
[
{
"node": "Upsert FAQ Chunks to Pinecone",
"type": "main",
"index": 0
}
]
]
},
"Upsert FAQ Chunks to Pinecone": {
"main": [
[
{
"node": "Wait 2s Between Batches",
"type": "main",
"index": 0
}
]
]
},
"Extract Text, Headings & Metadata": {
"main": [
[
{
"node": "Filter: Has Enough Content",
"type": "main",
"index": 0
}
]
]
},
"Schedule Trigger \u2013 Every Monday Midnight IST": {
"main": [
[
{
"node": "Get Sitemap Index",
"type": "main",
"index": 0
}
]
]
},
"OpenAI Text Embeddings (text-embedding-3-small)": {
"ai_embedding": [
[
{
"node": "Upsert FAQ Chunks to Pinecone",
"type": "ai_embedding",
"index": 0
}
]
]
}
}
}
For the full experience including quality scoring and batch install features for each workflow upgrade to Pro
About this workflow
This workflow runs weekly and crawls your website sitemap, scrapes each page, generates page-specific FAQs with OpenAI GPT-4o, embeds the Q&A content using OpenAI text-embedding-3-small, and upserts the vectors into a Pinecone index to keep a RAG knowledge base in sync. A weekly…
Source: https://n8n.io/workflows/16352/ — original creator credit. Request a take-down →
Related workflows
Workflows that share integrations, category, or trigger type with this one. All free to copy and import.
WooriFisa. Uses agent, httpRequest, documentDefaultDataLoader, vectorStorePinecone. Scheduled trigger; 86 nodes.
This workflow automates patient communication for medical clinics using the WhatsApp Business API. It supports appointment booking, rescheduling, service inquiries, follow-ups, and document submission
This consists 3 different workflows. Each one should be saved into individual workflows.
WooriFisa 최종. Uses memoryMongoDbChat, agent, httpRequest, documentDefaultDataLoader. Scheduled trigger; 68 nodes.
Ditch the endless scroll for AI trends. Meet Archi, your personal AI research assistant that hits you up once a week with everyone you need to know. 🧑🏽🔬