How AutomationFlows finds, deduplicates, scores, and publishes the n8n workflow catalog. Written so you can reproduce it, audit it, or push back on it.
Where workflows come from
The miner runs every 12 hours and pulls from eight sources of public, openly-shared n8n content:
- GitHub Topics — repository search by
topic:n8n*with deep tree-walk into each repo - GitHub code search — 117 query variants × 3 sort orders covering integration names (Slack, OpenAI, Stripe…), trigger types (webhook, cron, chat…), LangChain primitives, and 20 native-language content words (Chinese, Japanese, Korean, Spanish, Portuguese, German, French, Italian, Russian, Polish, Turkish, Dutch, Vietnamese, Indonesian, Thai, Arabic, Hindi, Czech, Hebrew, Romanian)
- "Awesome" curated lists — README scrapes of restyler/awesome-n8n, Zie619/n8n-workflows, enescingoz/awesome-n8n-templates, n8n-io/awesome-n8n
- GitLab — project search + tree walk
- Codeberg — Gitea API project search + tree walk
- Reddit —
r/n8n,r/automation,r/selfhostedJSON listings; raw GitHub URLs harvested from posts - dev.to — articles tagged
n8n,automation,lowcode; raw GitHub URLs harvested from article HTML - n8n community forum —
latest.jsontopic feed + per-topic post HTML - Web search — DDG HTML scrape over 78 queries, including Asian and Russian dev-platform dorks (Qiita, Zenn, Juejin, CSDN, Tistory, Velog, Habr)
- User submissions — Netlify form at /#submit for any public GitHub repo URL
State persists at ~/.automationflows_miner_state.json so daily runs are incremental — already-indexed workflows aren't re-fetched. Author-fan-out and LLM-generated query expansion (Haiku, persisted across runs) extend the search frontier each pass.
Privacy strip — applied to every workflow before publish
Workflow JSONs in upstream archives often contain real test data, credential UUIDs, and active webhook URLs. We remove all of it before any workflow is published:
pinData— test-run inputs/outputs may contain real API responses (PII, tokens, payment data)meta.instanceId— fingerprint of the original n8n installwebhookIdon trigger nodes — would leak the upstream owner's active webhook URLcredentials.{type}.id— credential UUIDs scoped to the original instance- Any node parameter matching the secret regex —
sk-,sk_live_,Bearer …
Implemented in scripts/automation_miner.py::strip_privacy(). Not bypassable. If you find a workflow that still leaks something, flag it and we'll re-strip + republish.
Deduplication
The same workflow often shows up in three places: the upstream Zie619 archive, a forked GitHub repo, and a forum post linking the raw URL. We deduplicate on the SHA-256 hash of the canonical workflow content (nodes + connections + name, with privacy fields stripped first). Identical hashes collapse to one record; the first-seen source_repo stays as the attribution.
Near-duplicates with minor parameter changes are kept as separate entries — the workflow graph itself is the unit of identity.
Tagging
Every workflow gets normalised metadata derived from its node graph:
- integrations — extracted from node types (
n8n-nodes-base.slack→slack,@n8n/n8n-nodes-langchain.lmChatOpenAi→lmChatOpenAi) - trigger_type — webhook / cron / manual / event / poll / chat (one per workflow)
- has_ai — true if any LangChain / OpenAI / Anthropic / HuggingFace node is present
- complexity_score — 1-5, derived from node count and branching depth
- category — one of 13 use-case buckets (AI & RAG, marketing, email, social, messaging, data, scraping, ecommerce, CRM, devops, finance, content, general)
- subcategory — finer-grained bucket within each category, only emits a page when ≥5 workflows share it
The Pro QualityScore
Pro / Lifetime users see a multi-signal QualityScore (0-100) on every workflow. The current formula:
quality_pro = rescale(0.80 × structural + 0.20 × metadata, [10, 75] → [50, 100])
Structural (0-100) blends six axes scored from the node graph alone:
- Graph topology (0-25) — node count, branching depth, connectivity
- Branching / error discipline (0-20) — IF/SWITCH usage, error trigger nodes, retry config
- Documentation density (0-20) — sticky-note count + per-node
notesfield coverage - Naming discipline (0-10) — penalises auto-generated node names like "Set 1", "HTTP Request 4"
- Parameter completeness (0-15) — checks that node parameters are configured, not left at default
- Metadata depth (0-10) — workflow-level
name,tags,versionId
Metadata (0-100) is a smaller signal covering schema correctness and integration breadth, blended at 20%.
Validation: hand-rated 20 stratified samples against this scoring; Spearman ρ = +0.941, Pearson r = +0.950. The harness re-runs after any weight change at scripts/run_quality_spike.py.
Popularity signals (GitHub stars, recency, install / uninstall behaviour) are NOT in the blend. They were originally — we removed them because they're popularity proxies, not quality measures, and they unfairly penalise non-GitHub workflows (Codeberg, GitLab, gists, n8n.io templates, dev.to embeds). Content-only formula scores all sources identically.
What we don't republish
If a workflow's meta.templateId matches a template on n8n.io's official template gallery, the AutomationFlows page links to n8n.io with rel="canonical" — we surface the template for discovery but don't compete with the canonical home.
Update cadence
Miner runs at 04:00 and 16:00 local time daily via launchd. Each run discovers new workflows, regenerates the sitemap, refreshes the atom feed, and pushes the diff to GitHub; Netlify auto-deploys. The "What's new" page at /recent/ shows the 100 most recent additions.
Take-down
Every workflow links back to its original source_repo with full attribution. If you're the original creator and want a workflow removed, email acreatorstore@translatea.com with the URL — same-day removal is the standard.
Reproducibility
The miner code is in scripts/automation_miner.py; the page generator is scripts/generate_pages.py; the QualityScore implementation is scripts/workflow_quality.py with unit tests in scripts/test_workflow_quality.py. The full data feed is at /data/workflows.json and per-workflow stripped JSONs at /data/workflow_jsons/ — both CDN-served, both publicly readable.