{
  "nodes": [
    {
      "id": "63d513b3-4054-456f-910b-cd4a765df79a",
      "name": "Sticky Note16",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1712,
        -960
      ],
      "parameters": {
        "width": 1024,
        "height": 400,
        "content": "![Logo Growth AI](https://cdn.prod.website-files.com/6825df5b20329ba581df4914/68d413c43f8729fa336568a6_Logo_horizontal.png)"
      },
      "typeVersion": 1
    },
    {
      "id": "15175f62-b287-4455-94ef-2f08914ff656",
      "name": "Sticky Note17",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1712,
        -528
      ],
      "parameters": {
        "color": 7,
        "width": 1024,
        "height": 240,
        "content": "## Need more advanced automation solutions? Contact us for custom enterprise workflows!\n\n# Growth-AI.fr\n\n## https://www.linkedin.com/in/allanvaccarizi/\n## https://www.linkedin.com/in/hugo-marinier-%F0%9F%A7%B2-6537b633/"
      },
      "typeVersion": 1
    },
    {
      "id": "1a314b56-840f-40a1-a148-065c789654cb",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        768,
        -512
      ],
      "parameters": {
        "width": 480,
        "height": 672,
        "content": "## Batch scraping\n\n### How it works\n\n1. The workflow is triggered by a chat message, which kicks off the URL scraping pipeline.\n2. URLs are read from a Google Sheet and filtered to remove any empty or invalid rows.\n3. Valid URLs are processed in batches using a loop to avoid overloading the scraper.\n4. Each URL is scraped with Firecrawl and the raw content is transformed into Markdown via a code node.\n5. The resulting Markdown file is saved to Google Drive, and the corresponding row in Google Sheets is marked as scraped before the loop continues.\n\n### Setup steps\n\n- - [ ] Configure Google Sheets credentials and set the correct spreadsheet/sheet containing the URLs to scrape.\n- - [ ] Configure Google Drive credentials and specify the destination folder for saving Markdown files.\n- - [ ] Add your Firecrawl API credentials to the 'Scrape URL with Firecrawl' node.\n- - [ ] Ensure the Google Sheet has a column to track scraping status (used by 'Mark as Scraped in Sheets').\n- - [ ] Set the desired batch size in the 'Loop Over Items in Batches' node to control throughput.\n\n### Customization\n\nYou can adjust the batch size to control scraping speed and API usage. The 'Process Scraped Content' code node can be modified to reformat, clean, or enrich the scraped Markdown before saving."
      },
      "typeVersion": 1
    },
    {
      "id": "04d083f5-ce2f-4cf8-b949-6df65b9424dc",
      "name": "Sticky Note1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1312,
        -224
      ],
      "parameters": {
        "color": 7,
        "width": 640,
        "height": 320,
        "content": "## Trigger and fetch URLs\n\nThe workflow starts when a chat message is received. URLs are then read from a Google Sheet and filtered to remove any empty rows before processing."
      },
      "typeVersion": 1
    },
    {
      "id": "7799c5f1-49a2-4c03-a2df-963ee04d4ea1",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2032,
        -240
      ],
      "parameters": {
        "color": 7,
        "width": 464,
        "height": 352,
        "content": "## Batch loop and scrape\n\nValid URLs are fed into a batch loop that iterates over each item. Each URL is scraped using Firecrawl, with the loop cycling until all URLs are processed."
      },
      "typeVersion": 1
    },
    {
      "id": "a301031e-e3a9-4dfa-96db-73c2fd8cdea9",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2576,
        -224
      ],
      "parameters": {
        "color": 7,
        "width": 688,
        "height": 320,
        "content": "## Process, save, and update status\n\nScraped content is processed and converted to Markdown via a code node, saved as a file in Google Drive, and then the source row in Google Sheets is marked as scraped before the loop resumes."
      },
      "typeVersion": 1
    },
    {
      "id": "096b6879-d342-41d0-a9e4-ec7c0e0cc777",
      "name": "When Chat Message Received",
      "type": "@n8n/n8n-nodes-langchain.chatTrigger",
      "position": [
        1360,
        -64
      ],
      "parameters": {
        "mode": "webhook",
        "public": true,
        "options": {
          "responseMode": "responseNode"
        }
      },
      "typeVersion": 1.1
    },
    {
      "id": "4d20bae7-5cf5-4f44-bb3f-9195c85809f6",
      "name": "Scrape URL with Firecrawl",
      "type": "@mendable/n8n-nodes-firecrawl.firecrawl",
      "onError": "continueErrorOutput",
      "position": [
        2352,
        -48
      ],
      "parameters": {
        "url": "={{ $json.URL }}",
        "operation": "scrape",
        "requestOptions": {}
      },
      "credentials": {
        "firecrawlApi": {
          "name": "<your credential>"
        }
      },
      "retryOnFail": true,
      "typeVersion": 1
    },
    {
      "id": "3fa1b84b-2cb4-4184-9e81-be306cb224b0",
      "name": "Read URLs from Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        1568,
        -64
      ],
      "parameters": {
        "options": {},
        "sheetName": {
          "__rl": true,
          "mode": "name",
          "value": "Page to doc"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "={{ $json.chatInput }}"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.5
    },
    {
      "id": "799f9382-38a0-409f-8ab7-d499b35a6931",
      "name": "Filter Non-Empty Rows",
      "type": "n8n-nodes-base.filter",
      "position": [
        1808,
        -64
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "48acd975-5041-455b-8e47-3b7eef32b483",
              "operator": {
                "type": "string",
                "operation": "exists",
                "singleValue": true
              },
              "leftValue": "={{ $json.URL }}",
              "rightValue": ""
            },
            {
              "id": "3d28d877-11fb-455d-b328-572c8492ea03",
              "operator": {
                "type": "string",
                "operation": "empty",
                "singleValue": true
              },
              "leftValue": "={{ $json.Scraped }}",
              "rightValue": ""
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "3d34488e-0eaf-4695-89dd-3c1ed61f67bd",
      "name": "Save Markdown to Google Drive",
      "type": "n8n-nodes-base.googleDrive",
      "position": [
        2896,
        -64
      ],
      "parameters": {
        "name": "={{ $('Scrape URL with Firecrawl').item.json.data.metadata.url }}",
        "content": "={{ $json.markdown_clean }}",
        "driveId": {
          "__rl": true,
          "mode": "list",
          "value": "0ADUfRaRT2rWIUk9PVA",
          "cachedResultUrl": "https://drive.google.com/drive/folders/0ADUfRaRT2rWIUk9PVA",
          "cachedResultName": "Growth AI"
        },
        "options": {},
        "folderId": {
          "__rl": true,
          "mode": "list",
          "value": "18HHNuVxjYGKv3YHnzIrBxwr_a5Sn1B9_",
          "cachedResultUrl": "https://drive.google.com/drive/folders/18HHNuVxjYGKv3YHnzIrBxwr_a5Sn1B9_",
          "cachedResultName": "Contenu scrap\u00e9"
        },
        "operation": "createFromText"
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 3
    },
    {
      "id": "dbdced1c-ba11-4216-a851-5b3b74c0dd20",
      "name": "Update Scraped Status in Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        3120,
        -64
      ],
      "parameters": {
        "columns": {
          "value": {
            "URL": "={{ $('Loop Over URLs').item.json.URL }}",
            "Scraped": "OK"
          },
          "schema": [
            {
              "id": "URL",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "URL",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Scraped",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "row_number",
              "type": "number",
              "display": true,
              "removed": true,
              "readOnly": true,
              "required": false,
              "displayName": "row_number",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "URL"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "update",
        "sheetName": {
          "__rl": true,
          "mode": "name",
          "value": "Page to doc"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "={{ $('When Chat Message Received').item.json.chatInput }}"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "name": "<your credential>"
        }
      },
      "typeVersion": 4.6,
      "alwaysOutputData": true
    },
    {
      "id": "e0833704-4d0a-4afc-90ea-1e8740cdbc0e",
      "name": "Transform Scraped Content",
      "type": "n8n-nodes-base.code",
      "onError": "continueRegularOutput",
      "position": [
        2624,
        -64
      ],
      "parameters": {
        "jsCode": "// Code pour node \"Code\" dans n8n\n// Nettoie le markdown en supprimant les liens, URLs et texte ind\u00e9sirable\n\n// R\u00e9cup\u00e9rer le markdown depuis l'item d'entr\u00e9e\nconst markdown = $input.item.json.data.markdown;\n\n// Fonction pour nettoyer le markdown\nfunction cleanMarkdown(text) {\n  if (!text) return '';\n  \n  let cleaned = text;\n  \n  // 1. Supprimer \"Passer au contenu principal\" et \"Aller au contenu\" (insensible \u00e0 la casse)\n  cleaned = cleaned.replace(/passer au contenu principal/gi, '');\n  cleaned = cleaned.replace(/aller au contenu/gi, '');\n  \n  // 2. Convertir les liens markdown [texte](url) en texte simple\n  // Garde le texte entre [], supprime les [] et (url)\n  cleaned = cleaned.replace(/\\[([^\\]]+)\\]\\([^\\)]+\\)/g, '$1');\n  \n  // 3. Supprimer les crochets restants [] et garder leur contenu\n  cleaned = cleaned.replace(/\\[([^\\]]+)\\]/g, '$1');\n  \n  // 4. Supprimer les URLs standalone (http://, https://, www.)\n  cleaned = cleaned.replace(/https?:\\/\\/[^\\s)]+/g, '');\n  cleaned = cleaned.replace(/www\\.[^\\s)]+/g, '');\n  \n  // 5. Supprimer les parenth\u00e8ses qui contiennent des URLs r\u00e9siduelles\n  cleaned = cleaned.replace(/\\([^)]*(?:http|www)[^)]*\\)/g, '');\n  \n  // 6. Nettoyer les espaces multiples cr\u00e9\u00e9s par les suppressions\n  cleaned = cleaned.replace(/  +/g, ' ');\n  \n  // 7. Nettoyer les lignes vides multiples\n  cleaned = cleaned.replace(/\\n{3,}/g, '\\n\\n');\n  \n  // 8. Supprimer les espaces en d\u00e9but/fin de lignes\n  cleaned = cleaned.split('\\n').map(line => line.trim()).join('\\n');\n  \n  // 9. Supprimer les espaces en d\u00e9but/fin du texte\n  cleaned = cleaned.trim();\n  \n  return cleaned;\n}\n\n// Appliquer le nettoyage\nconst cleanedMarkdown = cleanMarkdown(markdown);\n\n// IMPORTANT : Retourner un TABLEAU contenant l'item\n// Cela pr\u00e9serve le \"pairing\" avec les items pr\u00e9c\u00e9dents\nreturn [{\n  json: {\n    markdown_clean: cleanedMarkdown,\n  }\n}];"
      },
      "typeVersion": 2,
      "alwaysOutputData": false
    },
    {
      "id": "5c6cc015-69bc-4471-bea7-d6d81097729f",
      "name": "Loop Over URLs",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        2080,
        -64
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 3
    },
    {
      "id": "52c112a0-0e92-4c56-8d24-d8eb6ece74b6",
      "name": "Sticky Note18",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1664,
        208
      ],
      "parameters": {
        "color": 4,
        "width": 1120,
        "height": 144,
        "content": "# Google Sheets template \n\n## https://docs.google.com/spreadsheets/d/1vgNAV6P3cvBtTUax1rKrhzCLmEBAQu5sgeUxbU5_--0"
      },
      "typeVersion": 1
    }
  ],
  "connections": {
    "Loop Over URLs": {
      "main": [
        [],
        [
          {
            "node": "Scrape URL with Firecrawl",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Filter Non-Empty Rows": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Read URLs from Sheets": {
      "main": [
        [
          {
            "node": "Filter Non-Empty Rows",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape URL with Firecrawl": {
      "main": [
        [
          {
            "node": "Transform Scraped Content",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Transform Scraped Content": {
      "main": [
        [
          {
            "node": "Save Markdown to Google Drive",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "When Chat Message Received": {
      "main": [
        [
          {
            "node": "Read URLs from Sheets",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Save Markdown to Google Drive": {
      "main": [
        [
          {
            "node": "Update Scraped Status in Sheets",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Update Scraped Status in Sheets": {
      "main": [
        [
          {
            "node": "Loop Over URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}