AutomationFlowsWeb Scraping › Website to AI-Ready Markdown in Google Sheets

Website to AI-Ready Markdown in Google Sheets

Original n8n title: Web Crawler: Convert Websites to Ai-ready Markdown in Google Sheets

ByDaniel Nkencho @daniel-automates on n8n.io

Transform any website into a structured knowledge repository with this intelligent crawler that extracts hyperlinks from the homepage, intelligently filters images and content pages, and aggregates full Markdown-formatted content—perfect for fueling AI agents or building…

Event trigger★★★★☆ complexity22 nodesHTTP RequestGoogle Sheets
Web Scraping Trigger: Event Nodes: 22 Complexity: ★★★★☆ Added:

This workflow corresponds to n8n.io template #9594 — we link there as the canonical source.

This workflow follows the Google Sheets → HTTP Request recipe pattern — see all workflows that pair these two integrations.

The workflow JSON

Copy or download the full n8n JSON below. Paste it into a new n8n workflow, add your credentials, activate. Full import guide →

Download .json
{
  "meta": {
    "templateCredsSetupCompleted": true
  },
  "nodes": [
    {
      "id": "349e50cf-75b8-432c-818e-63f1ff3ead34",
      "name": "Overview Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1696,
        3104
      ],
      "parameters": {
        "color": 4,
        "width": 600,
        "height": 1112,
        "content": "# Automated Website Crawler for AI Knowledge Bases\n\n## \ud83d\udccb What This Template Does\nThis workflow crawls a website's homepage to extract all sublinks, filters images from content pages, scrapes and converts textual content to Markdown, then aggregates everything into Google Sheets\u2014ideal for building AI-ready knowledge bases or company dossiers.\n\n## \ud83d\udd27 Prerequisites\n- Google account with Sheets access\n- n8n instance\n\n## \ud83d\udd11 Required Credentials\n\n### Google Sheets OAuth2 API Setup\n1. Go to console.cloud.google.com \u2192 APIs & Services \u2192 Credentials\n2. Create OAuth client ID for Web application\n3. Add n8n redirect URI: https://your-n8n-instance.com/rest/oauth2-credential/callback\n4. Add to n8n as Google Sheets OAuth2 API and grant Sheets scopes\n\n## \u2699\ufe0f Configuration Steps\n1. Import JSON into n8n\n2. Set target URL in Set Website node\n3. Assign Google credential to Sheet nodes\n4. Update documentId and sheetName to your spreadsheet\n5. Ensure sheet has columns: Website, Links, Scraped Content, Images\n6. Test manually\n\n## \ud83c\udfaf Use Cases\n- Crawl company sites for knowledge base building\n- Extract content for AI agent training datasets\n- Gather competitor intel for market analysis\n- Archive dynamic sites for compliance\n\n## \u26a0\ufe0f Troubleshooting\n- No links: Check homepage <a> tags and test URL\n- Sheet errors: Verify columns and permissions\n- Truncated content: Adjust slice limit or split rows\n- Rate limits: Add Wait node after scraping"
      },
      "typeVersion": 1
    },
    {
      "id": "eb43d67c-01fc-4d83-bb2c-099938a57468",
      "name": "Note: Trigger and Setup",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2512,
        3072
      ],
      "parameters": {
        "color": 6,
        "width": 556,
        "height": 176,
        "content": "## \ud83d\uddb1\ufe0f Trigger & Setup Nodes\n\n**Purpose:** Manual Trigger starts the workflow; Set Website configures the target URL.\n\n**Note:** Update website_url in Set Website for your site; use Schedule Trigger for automation."
      },
      "typeVersion": 1
    },
    {
      "id": "3c8581cb-46cd-4f25-af5a-c52bc2f463c6",
      "name": "Set Website",
      "type": "n8n-nodes-base.set",
      "position": [
        2688,
        3296
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "a652f57e-210e-421e-b20b-781d6f4dc240",
              "name": "website_url",
              "type": "string",
              "value": "https://example.com"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "18201858-7764-4a14-9f6b-12e36eaf158b",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        2496,
        3296
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "b7435481-bed3-439f-933c-1c5e0142ad5c",
      "name": "Scrape Homepage",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        2880,
        3296
      ],
      "parameters": {
        "url": "={{ $json.website_url }}",
        "options": {
          "redirect": {
            "redirect": {}
          },
          "allowUnauthorizedCerts": false
        }
      },
      "executeOnce": false,
      "typeVersion": 4.2,
      "alwaysOutputData": false
    },
    {
      "id": "ce13710d-24ca-47d4-a25c-8890c1592947",
      "name": "Note: Homepage Scraping",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3168,
        3488
      ],
      "parameters": {
        "color": 5,
        "width": 396,
        "height": 192,
        "content": "## \ud83c\udf10 Homepage Scraping Nodes\n\n**Purpose:** Scrape Homepage fetches HTML; Extract Links pulls hrefs from <a> tags; Split Links breaks array into items.\n\n**Note:** Handles redirects; targets all links for discovery."
      },
      "typeVersion": 1
    },
    {
      "id": "61a60f2c-f032-4b46-83ba-405df0ce05df",
      "name": "Extract Links from HTML",
      "type": "n8n-nodes-base.html",
      "position": [
        3088,
        3296
      ],
      "parameters": {
        "options": {
          "trimValues": true,
          "cleanUpText": true
        },
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "links",
              "attribute": "href",
              "cssSelector": "a",
              "returnArray": true,
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "582eeae0-fec0-4548-9c78-7c05ac5aaebc",
      "name": "Split Links",
      "type": "n8n-nodes-base.splitOut",
      "position": [
        3296,
        3296
      ],
      "parameters": {
        "options": {},
        "fieldToSplitOut": "links"
      },
      "typeVersion": 1
    },
    {
      "id": "17d59531-4d51-4494-8ae9-e91b81851a0b",
      "name": "Remove Duplicate Links",
      "type": "n8n-nodes-base.removeDuplicates",
      "position": [
        3520,
        3296
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 2
    },
    {
      "id": "d50fa2a9-1a58-4dad-8bd0-cfbd31aeae91",
      "name": "Filter Real Hyperlinks",
      "type": "n8n-nodes-base.filter",
      "position": [
        3696,
        3296
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "bd6c6da6-8af7-4809-b6cd-01a38d71953b",
              "operator": {
                "type": "string",
                "operation": "startsWith"
              },
              "leftValue": "={{ $json.links }}",
              "rightValue": "https://"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "cb121b70-a14a-4cbd-a54c-e55c6fc235b7",
      "name": "Note: Link Processing",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3216,
        3056
      ],
      "parameters": {
        "color": 2,
        "width": 556,
        "height": 224,
        "content": "## \ud83d\udd04 Link Processing Nodes\n\n**Purpose:** Remove Duplicate Links cleans list; Filter Real Hyperlinks keeps HTTPS; Separate Images and Links routes via regex.\n\n**Note:** Switch output 0: Images, 1: Content links; adjust regex for custom extensions."
      },
      "typeVersion": 1
    },
    {
      "id": "d69c0dc2-2c4c-474b-ba11-3d79e1390b12",
      "name": "Separate Images and Links",
      "type": "n8n-nodes-base.switch",
      "position": [
        2480,
        3680
      ],
      "parameters": {
        "rules": {
          "values": [
            {
              "outputKey": "Images",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "16724958-4eea-489d-b494-3d76a3ba2562",
                    "operator": {
                      "type": "string",
                      "operation": "regex"
                    },
                    "leftValue": "={{ $json.links }}",
                    "rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "Links",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "816392f0-96db-4134-8bee-4b74688ff929",
                    "operator": {
                      "type": "string",
                      "operation": "notRegex"
                    },
                    "leftValue": "={{ $json.links }}",
                    "rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
                  }
                ]
              },
              "renameOutput": true
            }
          ]
        },
        "options": {}
      },
      "typeVersion": 3.2
    },
    {
      "id": "23896343-575e-4956-8e95-3b5e6e4c8ae7",
      "name": "Aggregate Images",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        2736,
        3504
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "links"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "fcad347b-60d7-4fa2-9b02-e96c2f27116d",
      "name": "Aggregate Links",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        2736,
        3696
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "links"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "fc5d6ce1-1765-4768-a9c7-de3677e8109d",
      "name": "Scrape Content Links",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        2736,
        3872
      ],
      "parameters": {
        "url": "={{ $json.links }}",
        "options": {}
      },
      "typeVersion": 4.2
    },
    {
      "id": "0d4b6a4e-b6cb-4e6c-9a22-bd0dc6a72027",
      "name": "Note: Content Scraping",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2320,
        3984
      ],
      "parameters": {
        "color": 5,
        "width": 428,
        "height": 224,
        "content": "## \ud83d\udcc4 Content Scraping & Aggregation Nodes\n\n**Purpose:** Scrape Content Links fetches pages; Convert to Markdown formats HTML; Aggregate Images/Links/Content combines outputs.\n\n**Note:** Markdown preserves structure for AI; slice content if exceeding sheet limits."
      },
      "typeVersion": 1
    },
    {
      "id": "349e5f7c-c81b-467b-a59b-ea40a47226f0",
      "name": "Convert to Markdown",
      "type": "n8n-nodes-base.markdown",
      "position": [
        2944,
        3872
      ],
      "parameters": {
        "html": "={{ $json.data }}",
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "24f22a31-03a3-4faf-81f4-3c38c0956ee4",
      "name": "Aggregate Scraped Content",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        3136,
        3872
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "data"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "a4d34aab-1af2-4196-85f5-1a2d832969dd",
      "name": "Add Images to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2944,
        3504
      ],
      "parameters": {
        "columns": {
          "value": {
            "Images": "={{ $json.links.join('\\n\\n') }}",
            "Website": "={{ $('Set Website').item.json.website_url }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "6afbfad8-b80f-4a0d-81b4-9138cc2af46a",
      "name": "Add Links to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2944,
        3696
      ],
      "parameters": {
        "columns": {
          "value": {
            "Links": "={{ $json.links.join('\\n\\n') }}",
            "Website": "={{ $('Set Website').item.json.website_url }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "35ae2c30-a93a-4fd2-82b6-07d2f4c56c88",
      "name": "Add Scraped Content to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        3344,
        3872
      ],
      "parameters": {
        "columns": {
          "value": {
            "Website": "={{ $('Set Website').item.json.website_url }}",
            "Scraped Content": "={{ $json.data.join('\\n\\n').slice(0, 50000) }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "c3f7b022-db11-400c-baaa-77392acfb991",
      "name": "Note: Sheet Integration",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3232,
        4048
      ],
      "parameters": {
        "color": 3,
        "width": 444,
        "height": 176,
        "content": "## \ud83d\udcca Sheet Integration Nodes\n\n**Purpose:** Add Images/Links/Scraped Content to Sheet appends aggregated data to Google Sheets.\n\n**Note:** Matches on 'Website' column; update documentId/sheetName for your sheet."
      },
      "typeVersion": 1
    }
  ],
  "connections": {
    "Set Website": {
      "main": [
        [
          {
            "node": "Scrape Homepage",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Split Links": {
      "main": [
        [
          {
            "node": "Remove Duplicate Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Set Website",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Links": {
      "main": [
        [
          {
            "node": "Add Links to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape Homepage": {
      "main": [
        [
          {
            "node": "Extract Links from HTML",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Images": {
      "main": [
        [
          {
            "node": "Add Images to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Convert to Markdown": {
      "main": [
        [
          {
            "node": "Aggregate Scraped Content",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape Content Links": {
      "main": [
        [
          {
            "node": "Convert to Markdown",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Filter Real Hyperlinks": {
      "main": [
        [
          {
            "node": "Separate Images and Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Remove Duplicate Links": {
      "main": [
        [
          {
            "node": "Filter Real Hyperlinks",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract Links from HTML": {
      "main": [
        [
          {
            "node": "Split Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Scraped Content": {
      "main": [
        [
          {
            "node": "Add Scraped Content to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Separate Images and Links": {
      "main": [
        [
          {
            "node": "Aggregate Images",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Aggregate Links",
            "type": "main",
            "index": 0
          },
          {
            "node": "Scrape Content Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Credentials you'll need

Each integration node will prompt for credentials when you import. We strip credential IDs before publishing — you'll add your own.

Pro

For the full experience including quality scoring and batch install features for each workflow upgrade to Pro

About this workflow

Transform any website into a structured knowledge repository with this intelligent crawler that extracts hyperlinks from the homepage, intelligently filters images and content pages, and aggregates full Markdown-formatted content—perfect for fueling AI agents or building…

Source: https://n8n.io/workflows/9594/ — original creator credit. Request a take-down →

More Web Scraping workflows → · Browse all categories →

Related workflows

Workflows that share integrations, category, or trigger type with this one. All free to copy and import.

Web Scraping

Automate LinkedIn lead generation by scraping comments from targeted posts and enriching profiles with detailed data

Form Trigger, HTTP Request, Google Sheets
Web Scraping

This automated n8n workflow scrapes job listings from Upwork using Apify, processes and cleans the data, and generates daily email reports with job summaries. The system uses Google Sheets for data st

Google Sheets, HTTP Request, Gmail
Web Scraping

Transform LinkedIn profile URLs into comprehensive enriched lead profiles, quickly and automatically.

HTTP Request, Google Sheets
Web Scraping

Content creators, researchers, educators, and digital marketers who need to discover high-quality YouTube training videos on specific topics. Perfect for building curated learning resource lists, comp

HTTP Request, Google Sheets
Web Scraping

This tool is perfect for those who need to collect business email addresses for outreach, research, or marketing purposes, especially those whose success depends on building and managing business rela

Google Sheets, HTTP Request