Epistemic Gap Finder
A conceptual cartography instrument. Feed it a corpus of descriptions from any categorisable domain and it maps the semantic space those concepts occupy, identifies the low-density regions — the deserts — and generates ranked candidate descriptions for what could inhabit those gaps.
Epistemic Gap Finder on GitHub: https://github.com/datasculptures/epistemic-gap-finder
What it does
EGF is not a search engine and not a recommendation system. It is a strategic positioning instrument for anyone who wants to know what is structurally absent from a space before committing to a direction. The pipeline runs in six stages:
Embed
Embeds a directory of plain-text or Markdown descriptions using a local
all-MiniLM-L6-v2 sentence-transformer model (~90 MB, downloaded
once then fully offline). Produces 384-dimensional vectors — one per file.
Reduce
Reduces to 2D and 3D using UMAP. Assesses topology preservation with trustworthiness, continuity, and LCMC metrics. Warns if the 2D layout is unreliable.
Density
Estimates density across the 2D space using k-NN radius density, then smooths the surface to separate genuine sparse regions from edge artefacts.
Detect gaps
Identifies low-density regions via local minima on the smoothed density surface. Each gap is scored by isolation — how absent it is from the corpus density, from 0 (fully occupied) to 1 (completely empty).
Generate candidates
Produces ranked candidate descriptions for each gap — from vocabulary projection (offline, always available) or a local LLM via ollama (optional, richer language).
Report
Renders a standalone HTML report with an interactive Plotly scatter map, a gap table ranked by isolation score, and candidate cards with confidence scores and generation-mode badges.
Domains
EGF is domain-agnostic. The --domain flag sets report labels and shapes the LLM prompt.
conceptDefault — neutral labels, works for anythingsoftware-toolDeveloper tooling, CLIs, librariesphilosophySchools of thought, philosophical positionsgenreMusical or literary genresdisciplineAcademic research fieldscustom:<noun>Any noun you choose — e.g.custom:tabletop RPG
Writing your corpus
EGF requires at least 7 .md or .txt files, each at least
50 characters long. Ten to twenty is the sweet spot. One file per concept, named
after the concept. Every description uses a four-sentence format:
- What it is or does. Primary function or identity. Active voice, plain language.
- What it operates on. Inputs, subject matter, or domain it engages with.
- What it produces. Output, result, or effect.
- The boundary condition. What it explicitly does not do, cover, or include. This is the most important sentence — it is what precisely positions the concept in the space.
Run egf analyse <dir> --domain <domain> --describe-format to print the template for your domain and exit.
Install
git clone https://github.com/datasculptures/epistemic-gap-finder.git
cd epistemic-gap-finder
python -m venv .venv
.venv\Scripts\Activate.ps1 # Windows
source .venv/bin/activate # macOS / Linux
pip install -e ".[dev]"
The first run downloads the all-MiniLM-L6-v2 model (~90 MB) to
.cache/. Every subsequent run is fully offline. Requires Python
3.10, 3.11, or 3.12.
Usage
# Get the description template for a domain (no analysis runs)
egf analyse my_corpus --domain software-tool --describe-format
# Basic run — auto-opens report in browser
egf analyse my_corpus --domain concept --open
# Custom domain
egf analyse my_corpus --domain "custom:tabletop RPG" --open
# With LLM-enhanced candidates (requires ollama running locally)
egf analyse my_corpus --domain "custom:tabletop RPG" --llm --llm-model llama3.2 --open
# Tune UMAP for a small corpus
egf analyse my_corpus --domain concept --n-neighbors 5 --open
# Automatic isolation threshold selection
egf analyse my_corpus --domain concept --isolation-min auto --open
# Write output to a named directory
egf analyse my_corpus --domain concept --output my_run_01 --open
Key options
--domainDomain template — sets report labels and LLM prompt (default:concept)--outputOutput directory (default:./egf_output)--isolation-minMinimum isolation score for gap detection (default: 0.1;autofor adaptive)--max-gapsMaximum gap regions to report (default: 7)--n-neighborsUMAP n_neighbors — lower for small corpora (default: 15)--quality-thresholdTrustworthiness warning floor (default: 0.75)--llmEnable LLM candidate generation via ollama (off by default)--llm-modelOllama model name (default:llama3)--openOpen the report in a browser after generation--verbose,-vVerbose output
Output files
Each run creates a timestamped HTML report and overwrites the data files.
report_YYYYMMDD_HHMMSS.htmlStandalone interactive HTML reportembeddings.npyfloat32 array, shape (n, 384) — raw sentence embeddingsreduced_2d.npyUMAP 2D positions, shape (n, 2)reduced_3d.npyUMAP 3D positions, shape (n, 3)quality.jsonTrustworthiness, continuity, LCMC scores and warning flaggaps.jsonDetected gap regions ranked by isolation scorecandidates.jsonGenerated candidate descriptions ranked by confidence
Reading the report
Reduction quality
Three scores assess how faithfully the 2D map preserves the high-dimensional structure.
- Trustworthiness≥ 0.85 good — ≥ 0.75 acceptable
- Continuity≥ 0.85 good — ≥ 0.70 acceptable
- LCMC≥ 0.50 good — ≥ 0.20 acceptable
Low trustworthiness is common with small corpora (< 15 items). Try --n-neighbors 5 or add more items.
Isolation score
How absent a gap region is from the corpus density.
- 0.8 – 1.0Sharply isolated — unambiguous gap
- 0.5 – 0.8Moderate — real but not dominant
- 0.2 – 0.5Weak — on the edge of the corpus
- < 0.2Marginal — probably noise
Semantic map
Interactive Plotly scatter. Blue dots are corpus items, orange circles are gap regions. Hover for names and isolation scores. Zoom and pan with mouse.
Candidate cards
One card per gap. Each shows a generated name, function summary, positioning statement relative to bounding items, confidence score, and a generation-mode badge: vocab, llm, or llm→vocab (LLM attempted, fell back).
LLM-enhanced candidates
Vocabulary candidates are assembled from TF-IDF term projections — sparse but always offline. The LLM path sends each gap's bounding items and vocabulary terms to a local ollama instance and produces readable, paragraph-quality descriptions. If ollama is not running, EGF falls back to vocabulary mode automatically.
# Pull a model once (~2 GB)
ollama pull llama3.2
# Keep this running in a separate terminal
ollama serve
# Warm up the model before running EGF
ollama run llama3.2 "Hello"
# Run EGF with LLM candidates
egf analyse my_corpus --domain "custom:tabletop RPG" --llm --llm-model llama3.2 --open
Related
- Latent Language Explorer V2 — companion tool exploring the same embedding deserts through 36,125 concepts
- Reduction Quality Bench — standardised quality reports for dimensionality reduction
- All Tools