Sovereign Document Intelligence Platform
SDIP ingests any document — PDFs, Word files, transcripts, code, configs, research — and transforms it into a searchable, graph-mapped knowledge base with automatic sensitivity classification. Every byte stays on your hardware.
SDIP runs entirely on your infrastructure. Document chunking, sensitivity scanning, and LLM classification all happen locally — no API calls to OpenAI, Google, or any third party. Your competitive intelligence, client data, legal documents, and proprietary research never leave your building. This isn't a toggle in a settings menu. It's how the system is built.
Each layer operates independently and can be run on its own schedule. The pipeline is idempotent — re-running any stage only processes what's changed.
Walk any directory or vault. Detect file type, compute content hash, register in PostgreSQL with full metadata. Incremental mode skips unchanged files. Supports Markdown, JSON, HTML, Python, shell, SQL, YAML, DOCX, and 20+ formats.
Intelligent semantic chunking — not arbitrary character splits. Markdown splits on headers, then paragraphs, then sentences. Python splits on function and class definitions. JSON splits on top-level keys. Small files stay whole. Every chunk gets a parent heading, word count, and position index.
Two-layer sensitivity scanning. Layer 1: regex patterns for SSNs, API keys, credentials, financial data, medical terms — with false-positive filtering. Layer 2: local LLM classification for context-dependent sensitivity. Every finding recorded with method, confidence score, and redacted sample.
Neo4j knowledge graph. Documents become nodes. Topics are extracted and linked. Systems are identified and connected. Cross-references between documents become edges. Sensitivity propagates through the graph — if a restricted document references yours, that relationship is tracked.
FastAPI membrane with clearance-based access control. Four tiers: PUBLIC, INTERNAL, SENSITIVE, RESTRICTED. Under-clearanced requests get automatic redaction. Natural language query endpoint. Full audit log on every content access — who, what, when, served or blocked.
Production-grade open source. Every component self-hosted, every dependency auditable, every data path on your network.
Client documents, contracts, and case files searchable without leaving your network. Automatic PII detection. Audit trail on every access. Sensitivity propagation catches indirect exposure.
Research archives, interview transcripts, source documents — indexed, cross-referenced, and queryable in natural language. Know what you have before you duplicate the work.
Codebases, architecture docs, runbooks, post-mortems — chunked and graph-mapped so tribal knowledge becomes searchable institutional knowledge. New hires find answers instead of asking.
Papers, datasets, field notes, grant applications — topic extraction surfaces connections across projects. The graph sees relationships that keyword search misses.
| Capability | Notion AI / ChatGPT | Enterprise RAG (cloud) | SDIP |
|---|---|---|---|
| Data stays on your hardware | No | No | Always |
| Zero cloud API calls | No | No | By design |
| Automatic sensitivity detection | No | Partial | Regex + LLM |
| Knowledge graph mapping | No | Rare | Neo4j topology |
| Clearance-based redaction | No | Some | 4 tiers + audit |
| Sensitivity propagation | No | No | Graph-based |
| Cross-document references | No | Limited | Auto-detected |
| Runs on a single server | No | No | Full stack |
| Per-seat pricing | $10–20/mo | $$$ | None — you own it |
SDIP is deployed on your infrastructure, configured for your document ecosystem, and handed over with full documentation. No ongoing license. No per-seat fees. The system is yours.
Start a Conversation