Methodology — how the AutomationFlows catalog is built

How AutomationFlows finds, deduplicates, scores, and publishes the n8n workflow catalog. Written so you can reproduce it, audit it, or push back on it.

Where workflows come from

The miner runs every 12 hours and pulls from eight sources of public, openly-shared n8n content:

GitHub Topics — repository search by topic:n8n* with deep tree-walk into each repo
GitHub code search — 117 query variants × 3 sort orders covering integration names (Slack, OpenAI, Stripe…), trigger types (webhook, cron, chat…), LangChain primitives, and 20 native-language content words (Chinese, Japanese, Korean, Spanish, Portuguese, German, French, Italian, Russian, Polish, Turkish, Dutch, Vietnamese, Indonesian, Thai, Arabic, Hindi, Czech, Hebrew, Romanian)
"Awesome" curated lists — README scrapes of restyler/awesome-n8n, Zie619/n8n-workflows, enescingoz/awesome-n8n-templates, n8n-io/awesome-n8n
GitLab — project search + tree walk
Codeberg — Gitea API project search + tree walk
Reddit — r/n8n, r/automation, r/selfhosted JSON listings; raw GitHub URLs harvested from posts
dev.to — articles tagged n8n, automation, lowcode; raw GitHub URLs harvested from article HTML
n8n community forum — latest.json topic feed + per-topic post HTML
Web search — DDG HTML scrape over 78 queries, including Asian and Russian dev-platform dorks (Qiita, Zenn, Juejin, CSDN, Tistory, Velog, Habr)
User submissions — Netlify form at /#submit for any public GitHub repo URL

State persists at ~/.automationflows_miner_state.json so daily runs are incremental — already-indexed workflows aren't re-fetched. Author-fan-out and LLM-generated query expansion (Haiku, persisted across runs) extend the search frontier each pass.

Privacy strip — applied to every workflow before publish

Workflow JSONs in upstream archives often contain real test data, credential UUIDs, and active webhook URLs. We remove all of it before any workflow is published:

pinData — test-run inputs/outputs may contain real API responses (PII, tokens, payment data)
meta.instanceId — fingerprint of the original n8n install
webhookId on trigger nodes — would leak the upstream owner's active webhook URL
credentials.{type}.id — credential UUIDs scoped to the original instance
Any node parameter matching the secret regex — sk-, sk_live_, Bearer …

Implemented in scripts/automation_miner.py::strip_privacy(). Not bypassable. If you find a workflow that still leaks something, flag it and we'll re-strip + republish.

Deduplication

The same workflow often shows up in three places: the upstream Zie619 archive, a forked GitHub repo, and a forum post linking the raw URL. We deduplicate on the SHA-256 hash of the canonical workflow content (nodes + connections + name, with privacy fields stripped first). Identical hashes collapse to one record; the first-seen source_repo stays as the attribution.

Near-duplicates with minor parameter changes are kept as separate entries — the workflow graph itself is the unit of identity.

Tagging

Every workflow gets normalised metadata derived from its node graph:

integrations — extracted from node types (n8n-nodes-base.slack → slack, @n8n/n8n-nodes-langchain.lmChatOpenAi → lmChatOpenAi)
trigger_type — webhook / cron / manual / event / poll / chat (one per workflow)
has_ai — true if any LangChain / OpenAI / Anthropic / HuggingFace node is present
complexity_score — 1-5, derived from node count and branching depth
category — one of 13 use-case buckets (AI & RAG, marketing, email, social, messaging, data, scraping, ecommerce, CRM, devops, finance, content, general)
subcategory — finer-grained bucket within each category, only emits a page when ≥5 workflows share it

The Pro QualityScore

Pro / Lifetime users see a multi-signal QualityScore (0-100) on every workflow. The current formula:

quality_pro = rescale(0.80 × structural + 0.20 × metadata, [10, 75] → [50, 100])

Structural (0-100) blends six axes scored from the node graph alone:

Graph topology (0-25) — node count, branching depth, connectivity
Branching / error discipline (0-20) — IF/SWITCH usage, error trigger nodes, retry config
Documentation density (0-20) — sticky-note count + per-node notes field coverage
Naming discipline (0-10) — penalises auto-generated node names like "Set 1", "HTTP Request 4"
Parameter completeness (0-15) — checks that node parameters are configured, not left at default
Metadata depth (0-10) — workflow-level name, tags, versionId

Metadata (0-100) is a smaller signal covering schema correctness and integration breadth, blended at 20%.

Validation: hand-rated 20 stratified samples against this scoring; Spearman ρ = +0.941, Pearson r = +0.950. The harness re-runs after any weight change at scripts/run_quality_spike.py.

Popularity signals (GitHub stars, recency, install / uninstall behaviour) are NOT in the blend. They were originally — we removed them because they're popularity proxies, not quality measures, and they unfairly penalise non-GitHub workflows (Codeberg, GitLab, gists, n8n.io templates, dev.to embeds). Content-only formula scores all sources identically.

What we don't republish

If a workflow's meta.templateId matches a template on n8n.io's official template gallery, the AutomationFlows page links to n8n.io with rel="canonical" — we surface the template for discovery but don't compete with the canonical home.

Update cadence

Miner runs at 04:00 and 16:00 local time daily via launchd. Each run discovers new workflows, regenerates the sitemap, refreshes the atom feed, and pushes the diff to GitHub; Netlify auto-deploys. The "What's new" page at /recent/ shows the 100 most recent additions.

Take-down

Every workflow links back to its original source_repo with full attribution. If you're the original creator and want a workflow removed, email acreatorstore@translatea.com with the URL — same-day removal is the standard.

Reproducibility

The miner code is in scripts/automation_miner.py; the page generator is scripts/generate_pages.py; the QualityScore implementation is scripts/workflow_quality.py with unit tests in scripts/test_workflow_quality.py. The full data feed is at /data/workflows.json and per-workflow stripped JSONs at /data/workflow_jsons/ — both CDN-served, both publicly readable.

How the catalog is built.