SanTOK is a comprehensive, production-ready text processing and cognitive reasoning framework that goes far beyond simple tokenization. It provides a complete toolkit for text analysis, semantic understanding, model training, vector storage, API deployment, and deterministic reasoning.
- Quick Start (5 Minutes)
- Overview
- Key Features
- Architecture
- Model Learning & Development Process
- How SanTOK Components Help Your Model
- Installation
- Quick Start
- Core Components
- Usage Examples
- Advanced Examples & Use Cases
- API Documentation
- CLI Usage
- Deployment
- Project Structure
- Testing
- Comparison with Alternatives
- Known Limitations
- Best Practices & Recommendations
- Roadmap & Future Plans
- Version History
- Migration Guide
- Quick Reference / Cheat Sheet
- Industry Use Cases & Applications
- FAQ
- Contributing
- License
- Author
- Additional Resources
- Support & Troubleshooting
Get SanTOK running in 5 minutes:
# 1. Install
pip install -r requirements.txt
# 2. Basic tokenization
python -c "from santok import TextTokenizationEngine; engine = TextTokenizationEngine(); print(engine.tokenize('Hello World', 'word'))"
# 3. Start API server
python run.py
# 4. Test API
curl http://localhost:8000/api/v1/healthOr use the CLI:
python santok_cli.py tokenize "Hello World" --method wordFor detailed setup, see Installation section.
SanTOK is a multi-layered framework consisting of three main components:
- SanTOK Core - Advanced text tokenization engine with multiple methods, mathematical analysis, and statistical features
- SanTOK Cognitive - Deterministic reasoning substrate for LLM-based systems with knowledge graphs and symbolic reasoning
- SanTOK Complete - Comprehensive production system with embeddings, vector stores, training, and API servers
- Multiple Tokenization Methods: 9+ tokenization strategies (word, character, subword, grammar-based, byte-level, etc.)
- Deterministic Processing: Same input always produces the same output with reproducible UIDs
- Mathematical Analysis: Advanced algorithms using digital roots, weighted sums, and 9-centric mathematics
- Semantic Embeddings: Multiple embedding strategies (feature-based, hash-based, semantic, hybrid)
- Vector Database Integration: Support for ChromaDB, FAISS, and Weaviate
- Cognitive Reasoning: Knowledge graphs, symbolic reasoning, and constraint enforcement
- Production-Ready APIs: FastAPI-based servers with WebSocket support
- Training Capabilities: Custom semantic model training on your data
- ✅ 9+ Tokenization Methods: Space, word, character, grammar, subword (BPE, frequency, syllable), byte-level
- ✅ Mathematical Properties: Frontend digits, backend numbers, global IDs, digital roots
- ✅ Deterministic UIDs: Xorshift64* based unique identifiers
- ✅ Statistical Features: Length factors, balance indices, entropy calculations
- ✅ Preprocessing Options: Case normalization, punctuation removal, repetition collapsing
- ✅ Multiple Embedding Strategies: Feature-based, hash-based, semantic, hybrid
- ✅ Semantic Training: Train custom embeddings on your corpus
- ✅ Enhanced Trainer: Multi-stream hierarchical learning with temporal awareness
- ✅ Vector Stores: ChromaDB, FAISS, Weaviate integration
- ✅ Inference Pipeline: Production-ready embedding inference
- ✅ Knowledge Graphs: 15+ relation types (IS_A, PART_OF, CAUSES, USES, etc.)
- ✅ Symbolic Reasoning: 20+ inference rules (transitivity, inheritance, symmetry)
- ✅ Knowledge Trees: Hierarchical organization and taxonomies
- ✅ Unified Memory: Persistent memory system with graph linking
- ✅ Constraint Enforcement: LLM output validation against verified facts
- ✅ Full Explainability: Complete reasoning traces for every answer
- ✅ FastAPI Servers: Production-ready RESTful APIs
- ✅ WebSocket Support: Real-time tokenization and streaming
- ✅ Interactive Documentation: Auto-generated API docs at
/docs - ✅ Job Management: Async job processing with status tracking
- ✅ Authentication: JWT-based security
- ✅ Cross-Platform: Windows, Linux, macOS support
- ✅ Vocabulary Building: Custom vocabulary construction
- ✅ Language Model Training: Train language models on your data
- ✅ Small Language Models (SLM): Lightweight transformer-based models
- ✅ Dataset Management: Download and process training datasets
SanTOK follows a modular architecture with clear separation of concerns. Below is a detailed breakdown of each component's architecture.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Framework │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ SanTOK Core │ │ SanTOK │ │ SanTOK │ │
│ │ │ │ Cognitive │ │ Complete │ │
│ │ Tokenization │ │ Reasoning │ │ Production │ │
│ │ Engine │ │ Substrate │ │ System │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ┌──────────────────▼──────────────────┐ │
│ │ API Servers & CLI Tools │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────▼──────────────────┐ │
│ │ Vector Stores & Integrations │ │
│ └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
This section provides a clean, detailed view of each major component's architecture, showing how they work internally and how they integrate with the rest of the system.
SanTOK's tokenization system is the foundation of all text processing. It provides 9 different tokenization methods, each producing deterministic, mathematically-rich token representations.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Tokenization Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Input Layer │ │
│ │ - Raw text string │ │
│ │ - Optional: source tag, language hint │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Preprocessing Pipeline │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Normalize │→ │ Remove │ │ │
│ │ │ Case │ │ Punctuation │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Normalize │→ │ Detect │ │ │
│ │ │ Whitespace │ │ Language │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Parallel Tokenization (9 Methods) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Space │ │ Word │ │ Char │ │ │
│ │ │Tokenizer │ │Tokenizer │ │Tokenizer │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Grammar │ │ Subword │ │ Byte │ │ │
│ │ │Tokenizer │ │Tokenizer │ │Tokenizer │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ BPE │ │ Syllable │ │Frequency │ │ │
│ │ │Tokenizer │ │Tokenizer │ │Tokenizer │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Mathematical Enrichment Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ UID │→ │ Frontend │ │ │
│ │ │ Generation │ │ Digit Calc │ │ │
│ │ │(Xorshift64*) │ │(9-centric) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Backend │→ │ Global ID │ │ │
│ │ │ Number │ │ Assignment │ │ │
│ │ │ Composition │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Neighbor │→ │ Content ID │ │ │
│ │ │ UID Linking │ │ Generation │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Statistical Analysis │ │
│ │ - Length Factor (token count % 10) │ │
│ │ - Balance Index (mean of frontend digits) │ │
│ │ - Entropy Index (variance of frontend digits) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ TokenStream Output │ │
│ │ - TokenRecord objects with all properties │ │
│ │ - Organized by tokenization method │ │
│ │ - Ready for embedding generation │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Input: "Hello World"
↓
[Preprocessing]
normalize_case() → "hello world"
normalize_whitespace() → "hello world"
detect_language() → "en"
↓
[Tokenization - Word Method]
tokenize_word() → ["Hello", "World"]
↓
[UID Assignment]
assign_uids(seed=42) →
"Hello" → UID: 12345678901234567890
"World" → UID: 98765432109876543210
↓
[Mathematical Properties]
Frontend Digits:
"Hello" → 5 (9-centric calculation)
"World" → 6
Backend Numbers:
"Hello" → 12345 (composite calculation)
"World" → 67890
Global IDs:
"Hello" → 1001
"World" → 1002
↓
[Neighbor Linking]
"Hello".prev_uid = None
"Hello".next_uid = 98765432109876543210
"World".prev_uid = 12345678901234567890
"World".next_uid = None
↓
[TokenRecord Creation]
TokenRecord(
text="Hello",
uid=12345678901234567890,
index=0,
frontend_digit=5,
backend_number=12345,
global_id=1001,
prev_uid=None,
next_uid=98765432109876543210,
content_id=hash("Hello")
)
↓
[TokenStream]
TokenStream(
name="word",
tokens=[TokenRecord("Hello"), TokenRecord("World")]
)
TokenRecord:
text: str # Original token text
uid: int # Unique identifier (64-bit)
index: int # Position in sequence
frontend_digit: int # 9-centric digit (1-9)
backend_number: int # Composite number
global_id: int # Global sequence ID
content_id: int # Content-based hash
prev_uid: Optional[int] # Previous token UID
next_uid: Optional[int] # Next token UID
stream_type: str # Tokenization method
metadata: dict # Additional properties- Deterministic: Same input + seed = same output
- Parallel Processing: All 9 methods run simultaneously
- Mathematical Richness: Every token has 5+ mathematical properties
- Multi-language: Automatic language detection
- Source Tracking: Optional source tags for provenance
SanTOK's embedding system converts tokenized text into dense vector representations suitable for machine learning, similarity search, and semantic analysis.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Embeddings Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ TokenRecord Input │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Strategy Router │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Feature │ │ Semantic │ │ Hash │ │ │
│ │ │ Based │ │ (Trained)│ │ Based │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ │ │
│ │ │ Hybrid │ │ │
│ │ │(Combined)│ │ │
│ │ └──────────┘ │ │
│ └──────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Feature Extraction (All Strategies) │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ UID Features │ │ Text Features│ │ │
│ │ │ - 64-bit → 8 │ │ - Length │ │ │
│ │ │ bytes │ │ - Char freq │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Math Features│ │ Stream Features│ │ │
│ │ │ - Frontend │ │ - Type (one-hot)│ │ │
│ │ │ - Backend │ │ - Position │ │ │
│ │ │ - Global ID │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Strategy-Specific Processing │ │
│ │ │ │
│ │ [Feature-Based] │ │
│ │ Features → Concatenate → Project → Normalize │ │
│ │ │ │
│ │ [Semantic] │ │
│ │ UID → Lookup in trained model → Embedding │ │
│ │ │ │
│ │ [Hash-Based] │ │
│ │ Text+UID → Hash → Vector → Normalize │ │
│ │ │ │
│ │ [Hybrid] │ │
│ │ Text Embedding + Feature Embedding → Weighted │ │
│ │ │ │
│ └──────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Dimension Projection │ │
│ │ - Project to target dimension (default: 768) │ │
│ │ - L2 normalization │ │
│ │ - Type conversion (float32) │ │
│ └──────────────┬──────────────────────────────────────┘ │
│ ↓ │
│ Embedding Vector (numpy.ndarray, shape: (embedding_dim,)) │
│ │
└─────────────────────────────────────────────────────────────┘
TokenRecord
↓
Extract Features:
- UID bytes: [0.12, 0.34, 0.56, ...] (8 floats)
- Frontend digit: [0.56] (1 float, normalized)
- Backend number: [0.78] (1 float, normalized)
- Global ID: [0.90] (1 float, normalized)
- Text length: [0.05] (1 float, normalized)
- Character frequencies: [0.1, 0.2, ...] (26 floats)
- Stream type: [0, 1, 0, ...] (9 floats, one-hot)
↓
Concatenate → Feature Vector (47 floats)
↓
Linear Projection Matrix (47 × 768)
↓
Embedding Vector (768 floats)
↓
L2 Normalize
↓
Final Embedding
TokenRecord
↓
Extract UID: 12345678901234567890
↓
Lookup in Trained Model:
vocab[uid] → index: 42
↓
Retrieve Embedding:
embeddings[42] → [0.1, 0.2, ..., 0.9] (768 floats)
↓
Final Embedding (already normalized from training)
TokenRecord
↓
Combine: text + str(uid)
↓
Hash Function (SHA256)
↓
Convert to Vector:
hash_bytes → [0-255] → normalize to [0-1]
↓
Repeat/Interpolate to 768 dimensions
↓
L2 Normalize
↓
Final Embedding
TokenRecord
↓
├─→ Text Embedding (optional, from sentence-transformers)
│ └─→ [0.1, 0.2, ..., 0.9] (768 floats)
│
└─→ Feature Embedding (always)
└─→ [0.2, 0.3, ..., 0.8] (768 floats)
↓
Weighted Combination:
embedding = α × text_emb + (1-α) × feature_emb
(default: α = 0.5)
↓
L2 Normalize
↓
Final Embedding
List[TokenRecord]
↓
[Parallel Processing]
Split into chunks
Process chunks in parallel (multiprocessing)
↓
[Feature Extraction]
Extract features for all tokens
↓
[Vector Generation]
Generate embeddings for all tokens
↓
[Stacking]
Stack into matrix: (N, embedding_dim)
↓
Output: numpy.ndarray
SanTOK's semantic system learns meaningful representations from token co-occurrence patterns, context relationships, and mathematical properties without requiring pre-trained models.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Semantic System Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Training Corpus │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tokenization Phase │ │
│ │ - TextTokenizer.build() │ │
│ │ - Multiple token streams │ │
│ │ - TokenRecord creation │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Vocabulary Building │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Count │→ │ Filter │ │ │
│ │ │ Frequencies │ │ (min_count) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Sort by │→ │ Create │ │ │
│ │ │ Frequency │ │ UID→Index │ │ │
│ │ └──────────────┘ │ Mapping │ │ │
│ │ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Co-occurrence Matrix Construction │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Build │→ │ Context │ │ │
│ │ │ Context │ │ Windows │ │ │
│ │ │ Windows │ │ (size=5) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Count │→ │ Create │ │ │
│ │ │ Co-occurrence│ │ Sparse │ │ │
│ │ │ Pairs │ │ Matrix │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Embedding Initialization │ │
│ │ - Random initialization (normal distribution) │ │
│ │ - Token embeddings: (vocab_size, embedding_dim) │ │
│ │ - Context embeddings: (vocab_size, embedding_dim) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Loop (Epochs) │ │
│ │ For each epoch: │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Sample Training Pairs │ │ │
│ │ │ - Positive: co-occurring tokens │ │ │
│ │ │ - Negative: random tokens │ │ │
│ │ └──────────────┬──────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Forward Pass │ │ │
│ │ │ - Dot product: token_emb · ctx_emb │ │ │
│ │ │ - Apply sigmoid │ │ │
│ │ │ - Compute loss (binary cross-entropy)│ │ │
│ │ └──────────────┬──────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Backward Pass │ │ │
│ │ │ - Compute gradients │ │ │
│ │ │ - Update embeddings (SGD) │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Model Saving │ │
│ │ - Save token embeddings │ │
│ │ - Save vocabulary mapping │ │
│ │ - Save metadata (dim, vocab_size, etc.) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Multiple Token Streams
├─ char stream: [c, h, e, l, l, o, ...]
├─ subword stream: [hel, lo, wor, ld, ...]
└─ word stream: [hello, world, ...]
↓
[Multi-Stream Learning]
Learn embeddings at all granularities simultaneously
Cross-stream alignment:
- Align char → subword → word
- Hierarchical semantic relationships
↓
[Temporal Awareness]
Position-dependent embeddings:
- Early tokens: different semantics
- Middle tokens: context-aware
- Late tokens: summary semantics
↓
[Content-ID Clustering]
Group tokens by content_id:
- Deterministic semantic clusters
- Similar content → similar embeddings
↓
[Mathematical Property Integration]
Incorporate frontend/backend/global_id:
- Mathematical relationships → semantic signals
- UID-based semantic graph
↓
Enhanced Multi-Granularity Embeddings
- Co-occurrence: Tokens appearing together
- Context: Neighbor relationships
- Content Similarity: Same content_id → similar meaning
- Temporal: Position-dependent semantics
- Hierarchical: Char → Subword → Word relationships
- Mathematical: UID-based relationships
SanTOK models learn through self-supervised learning using SanTOK's unique mathematical properties. The learning process is transparent, deterministic, and explainable.
Purpose: Create a deterministic vocabulary from your text corpus using SanTOK tokenization.
Step-by-Step Process:
1. Text Corpus Input
└─ Raw text files (any size)
↓
2. SanTOK Tokenization
└─ Tokenize using SanTOK (word/char/subword)
└─ Extract UIDs, frontend digits, backend numbers
└─ Build TokenRecord objects
↓
3. Token Counting & Frequency Analysis
└─ Count occurrences of each unique token
└─ Track token metadata (UID, frontend, backend, content_id)
└─ Filter by minimum frequency threshold
↓
4. Vocabulary Construction
└─ Select top N tokens (default: 60,000)
└─ Assign sequential IDs to tokens
└─ Create token_to_id and id_to_token mappings
└─ Store special tokens (<PAD>, <UNK>, <BOS>, <EOS>, <MASK>)
↓
5. Vocabulary Persistence
└─ Save vocabulary to disk (pickle + JSON)
└─ Ready for model training
What the Model Learns at This Stage:
- ✅ Token frequency distributions
- ✅ Token relationships (through UIDs)
- ✅ Mathematical properties (frontend/backend numbers)
- ✅ Content similarity (content_id clustering)
Example Output:
Building SanTOK Vocabulary (60K)
============================================================
[Pass 1] Tokenizing text and counting vocabulary tokens...
✓ Found 1,234,567 unique tokens
Total token occurrences: 45,678,901
After filtering (min_freq=2): 987,654 tokens
[Pass 2] Creating 60K vocabulary from token frequencies...
✓ Vocabulary built!
Total vocabulary size: 60,000
Special tokens: 5
Regular tokens: 59,995
Purpose: Train embeddings that capture semantic relationships between tokens.
Step-by-Step Learning Process:
1. Token Stream Preparation
└─ Load tokenized data with TokenRecords
└─ Extract UIDs, neighbors (prev_uid, next_uid)
└─ Group by stream type (char, subword, word)
↓
2. Vocabulary Building for Embeddings
└─ Create UID-based vocabulary
└─ Filter by minimum count
└─ Initialize random embeddings (vocab_size × embedding_dim)
↓
3. Co-occurrence Matrix Construction
└─ Build context windows (default: ±5 tokens)
└─ Track which tokens appear together
└─ Use SanTOK's neighbor structure:
• prev_uid → immediate predecessor
• next_uid → immediate successor
• content_id → semantic similarity
• Same stream → contextual relationships
└─ Create sparse/dense co-occurrence matrix
↓
4. Training Loop (Epochs)
For each epoch:
├─ Positive Sampling
│ └─ Sample co-occurring token pairs
│ └─ Update embeddings to increase similarity
│ └─ Use gradient descent:
│ • Compute dot product (similarity)
│ • Apply sigmoid activation
│ • Calculate loss (cross-entropy)
│ • Update embeddings: emb += lr * gradient
│
├─ Negative Sampling
│ └─ Sample random non-co-occurring pairs
│ └─ Update embeddings to decrease similarity
│ └─ 5 negative samples per positive
│
└─ Embedding Normalization
└─ L2 normalize embeddings every 2 epochs
└─ Maintain unit vectors
↓
5. Model Convergence
└─ Loss decreases over epochs
└─ Embeddings capture semantic relationships
└─ Similar tokens have similar embeddings
What the Model Learns:
- ✅ Semantic Similarity: Tokens with similar meanings cluster together
- ✅ Contextual Relationships: Tokens that appear together get closer embeddings
- ✅ Hierarchical Structure: Multi-stream learning captures different granularities
- ✅ Temporal Patterns: Position-dependent semantics
- ✅ Content Clustering: Tokens with similar content_id cluster together
Learning Metrics:
Epoch 1/10: Loss = 2.3456 # High loss - random embeddings
Epoch 2/10: Loss = 1.8923 # Learning patterns
Epoch 3/10: Loss = 1.4567 # Improving
Epoch 4/10: Loss = 1.1234 # Good progress
...
Epoch 10/10: Loss = 0.5678 # Converged - model learned!Visual Learning Progress:
Initial State (Random):
"cat" → [0.12, -0.45, 0.78, ...] (random)
"dog" → [-0.23, 0.67, -0.12, ...] (random)
"car" → [0.34, -0.56, 0.89, ...] (random)
After Training:
"cat" → [0.45, 0.23, 0.12, ...] (learned)
"dog" → [0.42, 0.25, 0.15, ...] (similar to cat!)
"car" → [-0.12, 0.78, -0.34, ...] (different from cat/dog)
Similarity Scores:
cat-dog: 0.87 (high - learned they're similar!)
cat-car: 0.12 (low - learned they're different)
Purpose: Train a GPT-2 style language model to predict next tokens.
Step-by-Step Learning Process:
1. Data Preparation
└─ Encode text to token IDs using vocabulary
└─ Create sequences of fixed length (default: 512)
└─ Split into input/target pairs:
Input: [token_1, token_2, ..., token_n]
Target: [token_2, token_3, ..., token_n+1]
↓
2. Model Architecture Initialization
└─ Token embeddings (vocab_size × embedding_dim)
└─ Position embeddings (max_seq_length × embedding_dim)
└─ Transformer layers (12 layers, 12 heads each):
• Self-attention (Q, K, V projections)
• Feed-forward networks
• Layer normalization
└─ Output projection (embedding_dim × vocab_size)
↓
3. Training Loop (Epochs)
For each epoch:
├─ Batch Creation
│ └─ Shuffle training sequences
│ └─ Create batches (default: 32 sequences)
│
├─ Forward Pass
│ └─ Embed tokens: token_emb + pos_emb
│ └─ Pass through transformer layers:
│ • Multi-head self-attention
│ • Feed-forward networks
│ • Residual connections
│ • Layer normalization
│ └─ Generate logits (vocab_size probabilities)
│
├─ Loss Calculation
│ └─ Cross-entropy loss:
│ loss = -log(prob[target_token])
│ └─ Average over all positions in sequence
│
└─ Weight Updates
└─ Compute gradients (backpropagation)
└─ Update all weights:
• Embedding weights
• Attention weights (Q, K, V, O)
• Feed-forward weights
• Layer norm parameters
• Output projection
↓
4. Model Checkpointing
└─ Save model every N epochs
└─ Store all weights and hyperparameters
└─ Enable resuming training
↓
5. Convergence & Validation
└─ Loss decreases: 6.0 → 2.0 → 1.5 → 1.2
└─ Perplexity decreases (measure of uncertainty)
└─ Model learns language patterns
What the Model Learns:
- ✅ Next Token Prediction: Learns to predict likely next tokens
- ✅ Language Patterns: Grammar, syntax, semantics
- ✅ Context Understanding: Uses previous tokens to predict next
- ✅ Long-range Dependencies: Attention mechanism captures relationships
- ✅ Domain Knowledge: Learns from training corpus
Learning Progress Example:
Epoch 1/10:
Loss: 6.2345 (High - model is guessing randomly)
Perplexity: 510.2 (Very uncertain)
Sample: "The cat sat on the [random_word]"
Epoch 5/10:
Loss: 2.1234 (Learning patterns)
Perplexity: 8.4 (More confident)
Sample: "The cat sat on the mat" (Better!)
Epoch 10/10:
Loss: 1.4567 (Converged)
Perplexity: 4.3 (Confident predictions)
Sample: "The cat sat on the mat and purred" (Coherent!)
Purpose: Verify the model learned correctly and is ready for use.
Testing Process:
1. Reconstruction Testing
└─ Test: Tokenize → Reconstruct
└─ Verify: Original text == Reconstructed text
└─ Metric: 100% accuracy required
↓
2. Embedding Quality Testing
└─ Test: Similar tokens have similar embeddings
└─ Verify: Cosine similarity > 0.7 for related tokens
└─ Metric: Semantic alignment score
↓
3. Language Model Testing
├─ Perplexity Testing
│ └─ Measure model's uncertainty
│ └─ Lower = better (model is confident)
│
├─ Generation Quality
│ └─ Generate text from prompts
│ └─ Check: Coherence, grammar, relevance
│
└─ Next Token Prediction Accuracy
└─ Test on held-out validation set
└─ Measure: Top-1, Top-5, Top-10 accuracy
↓
4. Performance Testing
└─ Speed: Tokens/second
└─ Memory: RAM usage
└─ Accuracy: Reconstruction rate
↓
5. Model Readiness Checklist
✅ Loss converged (< 2.0 for LM, < 1.0 for embeddings)
✅ Perplexity reasonable (< 10 for good models)
✅ Reconstruction accuracy = 100%
✅ Embedding similarity makes sense
✅ Generation quality acceptable
✅ Performance meets requirements
Test Results Example:
=== Model Testing Results ===
1. Reconstruction Test:
✓ Accuracy: 100.0% (Perfect!)
✓ All tokens correctly reconstructed
2. Embedding Quality:
✓ cat-dog similarity: 0.87 (High - correct!)
✓ cat-car similarity: 0.12 (Low - correct!)
✓ Semantic alignment: 0.82 (Good!)
3. Language Model:
✓ Perplexity: 4.3 (Excellent!)
✓ Top-1 accuracy: 45.2%
✓ Top-5 accuracy: 78.9%
✓ Top-10 accuracy: 89.3%
4. Performance:
✓ Speed: 1,234 tokens/second
✓ Memory: 2.3 GB
✓ Latency: 0.8ms per token
✅ MODEL READY FOR DEPLOYMENT!
How to Know Your Model is Learning:
-
Loss Decreasing:
Epoch 1: Loss = 6.23 ❌ (Random) Epoch 3: Loss = 3.45 ⚠️ (Learning) Epoch 5: Loss = 2.12 ✅ (Good progress) Epoch 10: Loss = 1.45 ✅ (Converged!) -
Perplexity Decreasing:
Epoch 1: Perplexity = 510.2 ❌ (Very uncertain) Epoch 5: Perplexity = 8.4 ✅ (Confident) Epoch 10: Perplexity = 4.3 ✅ (Very confident!) -
Embedding Similarity Makes Sense:
cat-dog: 0.87 ✅ (High - they're similar!) cat-car: 0.12 ✅ (Low - they're different!) python-code: 0.82 ✅ (High - related!) -
Generation Quality Improving:
Epoch 1: "The cat sat on the [random_word]" Epoch 5: "The cat sat on the mat" Epoch 10: "The cat sat on the mat and purred contentedly" -
Reconstruction Accuracy:
Always: 100% ✅ (SanTOK guarantees perfect reconstruction)
Your model is ready when:
- ✅ Loss converged: Loss < 2.0 (language model) or < 1.0 (embeddings)
- ✅ Perplexity reasonable: < 10 for good models, < 5 for excellent
- ✅ Reconstruction perfect: 100% accuracy (always true for SanTOK)
- ✅ Embedding quality: Similarity scores make semantic sense
- ✅ Generation coherent: Generated text is grammatically correct
- ✅ Performance acceptable: Meets speed/memory requirements
- ✅ Validation passed: All tests pass
Model NOT ready if:
- ❌ Loss = 0.0000 (trivial memorization - dataset too small)
- ❌ Loss not decreasing (learning rate too high/low)
- ❌ Perplexity > 100 (model is guessing randomly)
- ❌ Generation is gibberish
- ❌ Embeddings don't capture semantics
Enhanced Semantic Trainer (Multi-Stream Learning):
Multiple Token Streams
├─ char stream (character-level)
├─ subword stream (subword-level)
└─ word stream (word-level)
↓
[Multi-Stream Learning]
├─ Learn at all granularities simultaneously
├─ Cross-stream alignment
├─ Hierarchical semantics
└─ Unified embeddings
↓
Enhanced Embeddings
└─ Capture semantics at multiple levels
└─ Better generalization
└─ Richer representations
Key Learning Mechanisms:
- Co-occurrence Learning: Tokens appearing together get similar embeddings
- Negative Sampling: Random tokens get pushed apart
- Gradient Descent: Iteratively improve embeddings
- Normalization: Maintain unit vectors for stability
- Multi-Stream Alignment: Align semantics across granularities
- Temporal Patterns: Learn position-dependent semantics
- Content Clustering: Group tokens by content_id
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Model Development Workflow │
└─────────────────────────────────────────────────────────────┘
PHASE 1: Data Preparation
──────────────────────────
Raw Text Corpus
↓
[SanTOK Tokenization]
├─ Extract tokens
├─ Generate UIDs
├─ Calculate features
└─ Create TokenRecords
↓
Tokenized Dataset
↓
PHASE 2: Vocabulary Building
─────────────────────────────
Tokenized Dataset
↓
[Vocabulary Builder]
├─ Count token frequencies
├─ Filter by min_frequency
├─ Select top 60K tokens
├─ Assign token IDs
└─ Save vocabulary
↓
Vocabulary File (60K tokens)
↓
PHASE 3A: Semantic Embedding Training
───────────────────────────────────────
Tokenized Dataset + Vocabulary
↓
[Semantic Trainer]
├─ Build co-occurrence matrix
│ └─ Use prev_uid, next_uid
│ └─ Context windows
│ └─ Content_id similarity
├─ Initialize random embeddings
├─ Training Loop (10 epochs):
│ ├─ Positive sampling
│ ├─ Negative sampling
│ ├─ Gradient updates
│ └─ Embedding normalization
└─ Save trained embeddings
↓
Trained Embeddings Model
↓
PHASE 3B: Language Model Training
───────────────────────────────────
Tokenized Dataset + Vocabulary
↓
[Language Model Trainer]
├─ Encode to token IDs
├─ Create sequences
├─ Initialize transformer model
├─ Training Loop (10 epochs):
│ ├─ Forward pass
│ ├─ Loss calculation
│ ├─ Backpropagation
│ └─ Weight updates
└─ Save trained model
↓
Trained Language Model
↓
PHASE 4: Testing & Validation
──────────────────────────────
Trained Models
↓
[Testing Suite]
├─ Reconstruction test
├─ Embedding quality test
├─ Generation quality test
├─ Performance benchmarks
└─ Validation metrics
↓
Test Results Report
↓
PHASE 5: Model Deployment
─────────────────────────
Validated Models
↓
[Deployment]
├─ Load models
├─ Initialize inference pipeline
├─ API server (optional)
└─ Production ready!
↓
🚀 DEPLOYED MODEL
This section showcases exactly how each SanTOK component contributes to making your models better, faster, and more reliable.
What SanTOK Tokenization Provides:
- ✅ Deterministic UIDs (same token = same UID always)
- ✅ Mathematical properties (frontend digits, backend numbers)
- ✅ Multiple granularities (char, subword, word)
- ✅ Perfect reversibility (100% reconstruction)
- ✅ Statistical features (entropy, balance, variance)
How It Helps Your Model:
Without SanTOK:
"cat" → random ID (different each time)
Model confusion: Same word, different IDs
With SanTOK:
"cat" → UID: 12345 (always the same)
Model benefit: Consistent representation = better learning
Impact: Models learn faster because tokens have stable identities. No confusion from changing IDs.
SanTOK provides:
- Frontend digit (1-9): Semantic category
- Backend number: Positional encoding
- Global ID: Full context signature
Model uses these for:
- Better feature engineering
- Mathematical relationships between tokens
- Deterministic clustering
Impact: Models can leverage mathematical properties for better understanding, not just raw text.
SanTOK provides 3 streams simultaneously:
- Character level: "c", "a", "t"
- Subword level: "cat"
- Word level: "cat"
Model benefits:
- Learn at all levels simultaneously
- Better handling of rare words
- Richer representations
Impact: Models understand text at multiple levels, improving generalization.
SanTOK guarantee:
Tokenize → Reconstruct = 100% accuracy
Model benefit:
- No information loss
- Can verify correctness
- Trustworthy pipeline
Impact: Models built on SanTOK are reliable and verifiable.
Real Example:
# Without SanTOK: Standard tokenizer
text = "Hello world"
tokens = ["Hello", "world"] # Lost capitalization, punctuation info
# With SanTOK: Rich tokenization
text = "Hello world"
tokens = [
TokenRecord(text="Hello", uid=12345, frontend=5, backend=678, ...),
TokenRecord(text="world", uid=23456, frontend=6, backend=789, ...)
]
# Model gets: text + UID + mathematical properties + neighborsResult: Model has 10x more information per token, leading to better learning.
What SanTOK Embeddings Provide:
- ✅ Feature-based embeddings (from SanTOK properties)
- ✅ Semantic embeddings (self-trained, no external models)
- ✅ Hash-based embeddings (fast, deterministic)
- ✅ Hybrid embeddings (combines multiple strategies)
How It Helps Your Model:
Without SanTOK:
Model needs: BERT, Word2Vec, Sentence Transformers
Problems: Large models, slow, requires internet
With SanTOK:
Model gets: Self-trained embeddings from your data
Benefits: Fast, lightweight, works offline
Impact: Models can be trained and deployed anywhere, no external dependencies.
SanTOK embeddings learn from YOUR data:
- Medical text → medical embeddings
- Code → code embeddings
- Your domain → your embeddings
Standard embeddings:
- Generic (trained on Wikipedia)
- May not fit your domain
Impact: Models perform better on domain-specific tasks because embeddings match the domain.
SanTOK embeddings include:
- UID relationships
- Frontend/backend numbers
- Content_id clustering
- Neighbor relationships
Standard embeddings:
- Only text similarity
- No mathematical structure
Impact: Models can use mathematical relationships for better reasoning.
SanTOK provides 4 strategies:
1. Feature-based: Fast, deterministic
2. Semantic: Learned from data
3. Hash-based: Ultra-fast, no training
4. Hybrid: Best of all worlds
Model can choose based on:
- Speed requirements
- Accuracy needs
- Resource constraints
Impact: Models can optimize for speed or accuracy as needed.
Real Example:
# Standard embedding: Generic, slow
embedding = sentence_transformer.encode("cat")
# Time: 50ms, Size: 384 dim, Generic semantics
# SanTOK embedding: Domain-specific, fast
embedding = santok_embedding.generate("cat")
# Time: 2ms, Size: 768 dim, Your domain semantics
# Includes: UID relationships, mathematical propertiesResult: Models get faster, more accurate, domain-specific embeddings.
What SanTOK Semantics Provide:
- ✅ Self-supervised semantic learning
- ✅ Co-occurrence patterns
- ✅ Context windows
- ✅ Content-based clustering
- ✅ Multi-stream alignment
How It Helps Your Model:
SanTOK semantics:
- Analyzes YOUR text corpus
- Learns relationships in YOUR domain
- Captures YOUR terminology
Standard semantics:
- Pre-trained on generic data
- May not understand your domain
Impact: Models understand your specific domain better.
SanTOK learns:
- Which tokens appear together
- Context windows (neighbors)
- Sequential patterns
- Temporal relationships
Model uses for:
- Better next-token prediction
- Understanding context
- Capturing dependencies
Impact: Models understand context and relationships, not just individual tokens.
SanTOK provides:
- Character-level semantics
- Subword-level semantics
- Word-level semantics
- Cross-stream alignment
Model benefits:
- Understands at all granularities
- Better handling of rare words
- Richer semantic understanding
Impact: Models have deeper semantic understanding across multiple levels.
SanTOK creates:
- Persistent semantic relationships
- UID-based semantic graph
- Content_id clusters
- Temporal patterns
Model uses for:
- Consistent semantic understanding
- Better generalization
- Explainable semantics
Impact: Models have consistent, explainable semantic understanding.
Real Example:
# Standard semantics: Generic
"Python" → generic programming language embedding
# SanTOK semantics: Domain-aware
"Python" → embedding that includes:
- Co-occurrence with "code", "programming", "language"
- Content_id cluster (programming concepts)
- Temporal patterns (appears with "develop", "script")
- Multi-stream alignment (char/subword/word levels)Result: Models have richer, domain-specific semantic understanding.
What SanTOK Vectors Provide:
- ✅ Unified vector store interface
- ✅ Multiple backends (ChromaDB, FAISS, Weaviate)
- ✅ Efficient similarity search
- ✅ Metadata management
- ✅ Batch operations
How It Helps Your Model:
SanTOK vectors:
- Optimized similarity search
- Sub-millisecond queries
- Scales to millions of vectors
Standard approach:
- Linear search (slow)
- No optimization
- Doesn't scale
Impact: Models can quickly find similar examples for few-shot learning, retrieval, etc.
SanTOK supports:
- ChromaDB: Persistent, disk-based
- FAISS: Fast, in-memory
- Weaviate: Cloud-native, scalable
Model can choose:
- Development: FAISS (fast)
- Production: Weaviate (scalable)
- Local: ChromaDB (simple)
Impact: Models can use the best storage for their use case.
SanTOK vectors store:
- Embeddings
- Token metadata (UID, frontend, backend)
- Source information
- Timestamps
- Custom tags
Model uses for:
- Filtered searches
- Source tracking
- Temporal queries
Impact: Models can do sophisticated queries beyond simple similarity.
SanTOK vectors:
- Batch insert (thousands at once)
- Batch search (multiple queries)
- Efficient updates
Standard approach:
- One-by-one operations
- Slow for large datasets
Impact: Models can efficiently work with large datasets.
Real Example:
# Standard approach: Slow linear search
similar = find_similar(embedding) # O(n) - scans all vectors
# SanTOK vectors: Fast indexed search
similar = vector_store.search(embedding, top_k=10) # O(log n) - indexed
# Returns: Similar vectors + metadata + source infoResult: Models get 10-100x faster similarity search with rich metadata.
What SanTOK Trees, Graphs & Reasoning Provide:
- ✅ Knowledge graphs (nodes, edges, relations)
- ✅ Hierarchical trees (concepts, documents)
- ✅ Symbolic reasoning (20+ inference rules)
- ✅ Contradiction detection
- ✅ Confidence propagation
How It Helps Your Model:
SanTOK provides:
- Knowledge graph: Relationships between concepts
- Trees: Hierarchical organization
- Reasoning: Logical inference
Model uses for:
- Understanding relationships
- Hierarchical concepts
- Logical reasoning
Impact: Models can reason about structured knowledge, not just text.
SanTOK reasoning:
- Shows inference steps
- Explains conclusions
- Tracks confidence
Standard models:
- Black box predictions
- No explanation
- Unclear reasoning
Impact: Models can explain their reasoning, crucial for trust and debugging.
SanTOK detects:
- Conflicting information
- Logical inconsistencies
- Confidence conflicts
Model uses for:
- Validating outputs
- Preventing hallucinations
- Ensuring consistency
Impact: Models can catch and prevent errors before they happen.
SanTOK provides:
- Transitivity: A→B, B→C → A→C
- Inheritance: IS_A relationships
- Symmetry: Bidirectional relations
- And 17+ more rules
Model uses for:
- Logical inference
- Knowledge expansion
- Relationship discovery
Impact: Models can make logical inferences, expanding knowledge automatically.
Real Example:
# Standard model: Text-only
Question: "Is Python a programming language?"
Answer: "Yes" (but can't explain why)
# SanTOK reasoning: Structured knowledge
Question: "Is Python a programming language?"
Reasoning:
1. Python IS_A programming language (fact)
2. Therefore: Yes
3. Confidence: 1.0 (certain)
4. Explanation: Direct IS_A relationshipResult: Models can reason logically and explain their answers.
What SanTOK Cognitive Provides:
- ✅ Deterministic reasoning substrate
- ✅ Knowledge representation
- ✅ Symbolic inference
- ✅ Constraint enforcement
- ✅ Full explainability
How It Helps Your Model:
SanTOK Cognitive = System 2 (deliberate, correct)
LLMs = System 1 (fast, intuitive, error-prone)
Combined:
- LLM generates fast responses
- SanTOK validates and corrects
- Best of both worlds
Impact: Models get the speed of LLMs with the correctness of symbolic reasoning.
SanTOK Cognitive:
- Validates against knowledge graph
- Checks for contradictions
- Enforces constraints
Standard models:
- Can hallucinate
- No validation
- Unreliable outputs
Impact: Models produce reliable, validated outputs.
SanTOK Cognitive:
- Enforces domain constraints
- Validates against rules
- Prevents invalid outputs
Model uses for:
- Safe generation
- Compliance
- Quality assurance
Impact: Models stay within safe, valid boundaries.
SanTOK Cognitive:
- Shows reasoning trace
- Explains every step
- Provides confidence scores
Standard models:
- Black box
- No explanation
- Unclear reasoning
Impact: Models are trustworthy and debuggable.
Real Example:
# Standard LLM: Can hallucinate
Question: "What is the capital of France?"
Answer: "Paris" (correct, but can't explain)
# SanTOK Cognitive: Validated and explained
Question: "What is the capital of France?"
Answer: "Paris"
Reasoning:
- France HAS_CAPITAL Paris (fact in knowledge graph)
- Confidence: 1.0
- Source: Knowledge graph node #12345
- Validation: ✓ No contradictions foundResult: Models are reliable, explainable, and validated.
What SanTOK SLM Provides:
- ✅ 100% SanTOK-native (no external AI)
- ✅ Constraint-grounded generation
- ✅ No hallucination
- ✅ Full explainability
- ✅ Lightweight and fast
How It Helps Your Model:
SanTOK SLM:
- Uses only SanTOK components
- No BERT, GPT, or other models
- Pure SanTOK tokenization + embeddings
Standard SLMs:
- Require external models
- Large dependencies
- Complex setup
Impact: Models are self-contained and easy to deploy.
SanTOK SLM:
- Generates within constraints
- Validates against knowledge graph
- Prevents invalid outputs
Standard SLMs:
- Can generate anything
- No validation
- Unreliable
Impact: Models generate safe, valid outputs.
SanTOK SLM:
- Only generates from learned knowledge
- Validates against facts
- No made-up information
Standard SLMs:
- Can hallucinate
- Make up facts
- Unreliable
Impact: Models are trustworthy and factual.
SanTOK SLM:
- Small model size
- Fast inference
- Low memory usage
Standard SLMs:
- Large models
- Slow inference
- High memory
Impact: Models can run on edge devices, mobile, etc.
Real Example:
# Standard SLM: Large, slow, can hallucinate
model = load_pretrained_slm() # 500MB, slow, unreliable
# SanTOK SLM: Small, fast, reliable
model = SanTOKSLMModel() # 50MB, fast, validated
# Generates: Constraint-grounded, explainable, no hallucinationResult: Models are lightweight, fast, and reliable.
What SanTOK Memory Provides:
- ✅ Unified memory system
- ✅ Vector + Graph + Tree storage
- ✅ Cross-store linking
- ✅ Temporal tracking
- ✅ Source awareness
How It Helps Your Model:
SanTOK Memory:
- Vector store: Similarity search
- Graph store: Relationships
- Tree store: Hierarchies
- All linked together
Standard approach:
- Separate stores
- No integration
- Fragmented knowledge
Impact: Models have unified, integrated knowledge.
SanTOK Memory:
- Search by similarity (vector)
- Search by relationship (graph)
- Search by hierarchy (tree)
- Combined searches
Standard approach:
- Single search type
- Limited retrieval
Impact: Models can retrieve knowledge in multiple ways.
SanTOK Memory:
- Tracks when knowledge was added
- Temporal relationships
- Freshness scoring
Model uses for:
- Recent information priority
- Temporal reasoning
- Knowledge evolution
Impact: Models understand time and can prioritize recent information.
SanTOK Memory:
- Tracks source of each fact
- Source-aware queries
- Source-based filtering
Model uses for:
- Attribution
- Source verification
- Quality control
Impact: Models can cite sources and verify information.
Real Example:
# Standard memory: Single store
memory = vector_store # Only similarity search
# SanTOK Memory: Unified system
memory = UnifiedMemory()
memory.add("Python is a language", source="wikipedia")
# Stored in: Vector store + Graph store + Tree store
# Linked together, with source trackingResult: Models have unified, multi-modal, source-aware memory.
What SanTOK Interpretation Provides:
- ✅ Real-time data interpretation
- ✅ Semantic relationship discovery
- ✅ Concept exploration
- ✅ Knowledge discovery
- ✅ Weaviate integration
How It Helps Your Model:
SanTOK Interpretation:
- Interprets data as it arrives
- Finds semantic relationships
- Discovers concepts
Standard approach:
- Batch processing
- No real-time understanding
Impact: Models can understand and interpret data in real-time.
SanTOK Interpretation:
- Finds related concepts
- Discovers relationships
- Builds knowledge graph
Model uses for:
- Understanding context
- Relationship discovery
- Knowledge expansion
Impact: Models can discover and understand relationships automatically.
SanTOK Interpretation:
- Explores concepts deeply
- Multi-level exploration
- Hierarchical understanding
Model uses for:
- Deep understanding
- Concept hierarchies
- Knowledge navigation
Impact: Models can explore and understand concepts at multiple levels.
SanTOK Interpretation:
- Discovers new knowledge
- Finds patterns
- Builds understanding
Standard approach:
- Static knowledge
- No discovery
Impact: Models can discover new knowledge from data.
Real Example:
# Standard approach: Static processing
data = "Machine learning uses neural networks"
result = process(data) # Basic processing
# SanTOK Interpretation: Dynamic understanding
data = "Machine learning uses neural networks"
result = interpreter.interpret(data)
# Discovers:
# - "machine learning" IS_A "AI technique"
# - "neural networks" USES "machine learning"
# - Related concepts: "deep learning", "training"
# - Builds knowledge graph automaticallyResult: Models can discover and understand knowledge dynamically.
The Complete SanTOK Advantage:
┌─────────────────────────────────────────────────────────────┐
│ How SanTOK Components Help Your Model │
└─────────────────────────────────────────────────────────────┘
1. Tokenization
└─ Provides: Deterministic foundation, mathematical properties
└─ Helps Model: Stable learning, rich features
2. Embeddings
└─ Provides: Domain-specific, fast embeddings
└─ Helps Model: Better representations, no external deps
3. Semantics
└─ Provides: Self-learned semantic understanding
└─ Helps Model: Domain-aware, contextual understanding
4. Vectors
└─ Provides: Fast similarity search, multiple backends
└─ Helps Model: Efficient retrieval, scalable storage
5. Trees & Graphs
└─ Provides: Structured knowledge, reasoning
└─ Helps Model: Logical inference, explainability
6. Cognitive
└─ Provides: Validation, constraint enforcement
└─ Helps Model: Reliable outputs, no hallucination
7. SLM
└─ Provides: Lightweight, constraint-grounded generation
└─ Helps Model: Fast, reliable text generation
8. Memory
└─ Provides: Unified, multi-modal knowledge storage
└─ Helps Model: Integrated knowledge, source tracking
9. Interpretation
└─ Provides: Real-time understanding, knowledge discovery
└─ Helps Model: Dynamic learning, relationship discovery
COMBINED RESULT:
✅ Faster training (deterministic, rich features)
✅ Better accuracy (domain-specific, validated)
✅ More reliable (no hallucination, constraints)
✅ Fully explainable (reasoning traces, sources)
✅ Production-ready (scalable, efficient)
✅ Self-contained (no external dependencies)
SanTOK provides comprehensive training and testing systems for semantic models, language models, and performance evaluation.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Training System Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Data Preparation │ │
│ │ - Corpus loading │ │
│ │ - Text preprocessing │ │
│ │ - Dataset splitting (train/val/test) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tokenization │ │
│ │ - TextTokenizer.build() │ │
│ │ - Multiple streams │ │
│ │ - TokenRecord creation │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Type Selection │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Semantic │ │ Language │ │ │
│ │ │ Embedding │ │ Model │ │ │
│ │ │ Training │ │ Training │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Enhanced │ │ Vocabulary │ │ │
│ │ │ Semantic │ │ Building │ │ │
│ │ │ Training │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Training Execution │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ For each epoch: │ │ │
│ │ │ 1. Sample batch │ │ │
│ │ │ 2. Forward pass │ │ │
│ │ │ 3. Compute loss │ │ │
│ │ │ 4. Backward pass │ │ │
│ │ │ 5. Update parameters │ │ │
│ │ │ 6. Validation (if applicable) │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Model Evaluation │ │
│ │ - Loss curves │ │
│ │ - Embedding quality metrics │ │
│ │ - Similarity evaluation │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Model Persistence │ │
│ │ - Save embeddings │ │
│ │ - Save vocabulary │ │
│ │ - Save metadata │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Testing System Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Test Categories │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Unit Tests │ │ Integration │ │ │
│ │ │ - Functions │ │ Tests │ │ │
│ │ │ - Classes │ │ - Pipelines │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Performance │ │ Accuracy │ │ │
│ │ │ Tests │ │ Tests │ │ │
│ │ │ - Speed │ │ - Correctness│ │ │
│ │ │ - Memory │ │ - Quality │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Test Execution │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ 1. Tokenization Tests │ │ │
│ │ │ - All 9 methods │ │ │
│ │ │ - Determinism │ │ │
│ │ │ - Mathematical properties │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ 2. Embedding Tests │ │ │
│ │ │ - All 4 strategies │ │ │
│ │ │ - Dimension correctness │ │ │
│ │ │ - Normalization │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ 3. Training Tests │ │ │
│ │ │ - Vocabulary building │ │ │
│ │ │ - Training convergence │ │ │
│ │ │ - Model saving/loading │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ 4. Performance Benchmarks │ │ │
│ │ │ - Speed measurements │ │ │
│ │ │ - Memory usage │ │ │
│ │ │ - Scalability │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Test Reporting │ │
│ │ - Pass/Fail status │ │
│ │ - Performance metrics │ │
│ │ - Coverage reports │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Data Collection
↓
2. Preprocessing
↓
3. Tokenization
↓
4. Vocabulary Building
↓
5. Co-occurrence Matrix
↓
6. Model Initialization
↓
7. Training Loop
├─ Epoch 1 → Loss: 0.5
├─ Epoch 2 → Loss: 0.4
├─ Epoch 3 → Loss: 0.3
└─ ...
↓
8. Validation
↓
9. Model Saving
↓
10. Evaluation
SanTOK Cognitive provides a complete knowledge representation and reasoning system using trees, graphs, and symbolic inference.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Knowledge Trees Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tree Structure │ │
│ │ │ │
│ │ Root Node │ │
│ │ (depth=0) │ │
│ │ │ │ │
│ │ ┌────────┼────────┐ │ │
│ │ │ │ │ │ │
│ │ Child 1 Child 2 Child 3 │ │
│ │ (depth=1) (depth=1) (depth=1) │ │
│ │ │ │ │ │ │
│ │ ┌───┴───┐ │ ┌───┴───┐ │ │
│ │ │ │ │ │ │ │ │
│ │ Leaf 1 Leaf 2│ Leaf 3 Leaf 4 │ │
│ │ │ │ │
│ │ Leaf 5 │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ TreeNode Properties: │
│ - node_id: Unique identifier │
│ - content: Text/label │
│ - parent_id: Parent node reference │
│ - children_ids: List of child node IDs │
│ - depth: Hierarchical depth │
│ - metadata: Additional properties │
│ - embedding_ref: Link to vector store │
│ - graph_node_ref: Link to graph store │
│ │
│ Operations: │
│ - add_node(): Add new node │
│ - remove_node(): Remove node (recursive) │
│ - get_path(): Get path from root to node │
│ - traverse_dfs(): Depth-first traversal │
│ - traverse_bfs(): Breadth-first traversal │
│ - get_subtree(): Extract subtree │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Knowledge Graph Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Graph Structure │ │
│ │ │ │
│ │ Node A ──IS_A──→ Node B │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ PART_OF USES │ │
│ │ │ │ │ │
│ │ ↓ ↓ │ │
│ │ Node C ──CAUSES──→ Node D │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ LOCATED_IN RELATED_TO │ │
│ │ │ │ │ │
│ │ └───────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ GraphNode Properties: │
│ - id: Unique integer ID │
│ - label: Human-readable label │
│ - properties: Dictionary of properties │
│ - edges: List of outgoing edges │
│ │
│ GraphEdge Properties: │
│ - source_id: Source node ID │
│ - target_id: Target node ID │
│ - relation: RelationType (15+ types) │
│ - confidence: Confidence score (0-1) │
│ - metadata: Additional properties │
│ │
│ Relation Types (15+): │
│ - IS_A, PART_OF, CAUSES, USES │
│ - LOCATED_IN, RELATED_TO, PRECEDES │
│ - OPPOSITE_OF, SIMILAR_TO, CONTAINS │
│ - ... (and more) │
│ │
│ Operations: │
│ - add_node(): Add new node │
│ - add_edge(): Add relation │
│ - get_neighbors(): Get connected nodes │
│ - find_path(): Find path between nodes │
│ - query(): Complex graph queries │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Reasoning Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Query: "What is machine learning?" │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Query Parser │ │
│ │ - Parse natural language │ │
│ │ - Extract key concepts │ │
│ │ - Build structured query │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Knowledge Retrieval │ │
│ │ - Search in GraphStore │ │
│ │ - Search in TreeStore │ │
│ │ - Search in UnifiedMemory │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Inference Engine │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Apply Inference Rules (20+) │ │ │
│ │ │ - Transitivity: A→B, B→C → A→C │ │ │
│ │ │ - Inheritance: IS_A relationships │ │ │
│ │ │ - Symmetry: A↔B │ │ │
│ │ │ - Inverse: A→B → B←A │ │ │
│ │ │ - ... (16+ more rules) │ │ │
│ │ └──────────────┬──────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Rule Chaining │ │ │
│ │ │ - Chain multiple rules │ │ │
│ │ │ - Propagate confidence │ │ │
│ │ │ - Track reasoning path │ │ │
│ │ └──────────────┬──────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Generate Inferred Facts │ │ │
│ │ │ - New relationships │ │ │
│ │ │ - Confidence scores │ │ │
│ │ │ - Reasoning traces │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Path Finding │ │
│ │ - Find reasoning paths │ │
│ │ - Calculate path confidence │ │
│ │ - Rank paths by relevance │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Contradiction Detection │ │
│ │ - Check for conflicting facts │ │
│ │ - Flag contradictions │ │
│ │ - Resolve conflicts (if possible) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Explanation Generation │ │
│ │ - Build reasoning trace │ │
│ │ - Format explanation │ │
│ │ - Include confidence scores │ │
│ │ - Link to source facts │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ Answer with Full Reasoning Trace │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Unified Memory Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MemoryObject │ │
│ │ - content: Text/fact │ │
│ │ - type: "fact", "concept", "rule", etc. │ │
│ │ - metadata: Additional properties │ │
│ │ - graph_node_ref: Link to graph │ │
│ │ - tree_node_ref: Link to tree │ │
│ │ - embedding_ref: Link to vector store │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Auto-Linking │ │
│ │ When auto_link_graph=True: │ │
│ │ - Extract entities from content │ │
│ │ - Create graph nodes │ │
│ │ - Create relations (IS_A, PART_OF, etc.) │ │
│ │ - Link memory object to graph │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Storage Integration │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ GraphStore │ │ TreeStore │ │ │
│ │ │ (Relations) │ │ (Hierarchy) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │
│ │ │ VectorStore │ │ │
│ │ │ (Embeddings) │ │ │
│ │ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Query: "What is Python?"
↓
[Query Parser]
Extract: "Python" (entity)
Query type: DEFINITION
↓
[Knowledge Retrieval]
Find in GraphStore:
Node: "Python" (id=1)
Edge: Python --IS_A--> Programming Language
Find in Memory:
MemoryObject: "Python is a programming language"
↓
[Inference Engine]
Apply rules:
- IS_A transitivity
- Inheritance
Generate inferred facts:
Python IS_A Programming Language
Programming Language IS_A Software Tool
→ Python IS_A Software Tool (inferred)
↓
[Path Finding]
Find paths:
Path 1: Python → Programming Language (confidence: 1.0)
Path 2: Python → Programming Language → Software Tool (confidence: 0.9)
↓
[Explanation Generation]
Build trace:
Facts used: 2
Rules applied: transitive_is_a
Path: Python → Programming Language
Confidence: 95%
↓
Answer: "Python is a type of programming language."
Explanation: [Full reasoning trace]
SanTOK provides production-ready API servers built on FastAPI, supporting REST endpoints, WebSocket connections, file uploads, and async job processing.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK API Server Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Client Layer │ │
│ │ - HTTP Clients (REST) │ │
│ │ - WebSocket Clients │ │
│ │ - File Upload Clients │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ FastAPI Application │ │
│ │ - FastAPI app instance │ │
│ │ - Route registration │ │
│ │ - Middleware stack │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Middleware Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ CORS │→ │ Authentication│ │ │
│ │ │ Handler │ │ (JWT) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Request │→ │ Error │ │ │
│ │ │ Validation │ │ Handling │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Route Handlers │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ /api/v1/ │ │ /api/v1/ │ │ │
│ │ │ tokenize │ │ embed │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ /api/v1/ │ │ /api/v1/ │ │ │
│ │ │ train │ │ search │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ /api/v1/ │ │ /ws │ │ │
│ │ │ upload │ │ (WebSocket) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Business Logic Layer │ │
│ │ - TextTokenizer │ │
│ │ - EmbeddingGenerator │ │
│ │ - VectorStore │ │
│ │ - JobManager (async) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Response Layer │ │
│ │ - JSON responses │ │
│ │ - Streaming responses │ │
│ │ - WebSocket messages │ │
│ │ - File downloads │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
HTTP Request
↓
[FastAPI Router]
Parse route → Select handler
↓
[CORS Middleware]
Add CORS headers
↓
[Authentication Middleware]
Validate JWT token (if required)
↓
[Request Validation]
Validate request body (Pydantic)
↓
[Route Handler]
Execute business logic:
- Tokenize text
- Generate embeddings
- Search vectors
- etc.
↓
[Response Serialization]
Convert to JSON
↓
[Response Middleware]
Add headers, status codes
↓
HTTP Response
WebSocket Connection
↓
[Connection Handler]
Accept connection
↓
[Message Loop]
While connected:
Receive message
↓
[Message Router]
Route to handler:
- tokenize
- train
- stream
↓
[Processing]
Execute operation
↓
[Streaming Response]
Send progress updates
↓
Send final result
↓
[Connection Close]
Cleanup resources
┌─────────────────────────────────────────────────────────────┐
│ Async Job Management │
├─────────────────────────────────────────────────────────────┤
│ │
│ POST /api/v1/jobs │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Job Creation │ │
│ │ - Generate job_id │ │
│ │ - Create job record │ │
│ │ - Status: PENDING │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Background Task │ │
│ │ - Execute in thread pool │ │
│ │ - Update status: RUNNING │ │
│ │ - Process request │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Status Updates │ │
│ │ - Update progress │ │
│ │ - Store intermediate results │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Job Completion │ │
│ │ - Status: COMPLETED or FAILED │ │
│ │ - Store final results │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ GET /api/v1/jobs/{job_id} │
│ ↓ │
│ Return job status and results │
│ │
└─────────────────────────────────────────────────────────────┘
Core Endpoints:
POST /api/v1/tokenize- Tokenize textPOST /api/v1/embed- Generate embeddingsPOST /api/v1/analyze- Comprehensive analysis
Training Endpoints:
POST /api/v1/train- Train semantic modelGET /api/v1/training/jobs- List training jobsGET /api/v1/training/jobs/{id}- Get job status
File Operations:
POST /api/v1/upload- Upload filePOST /api/v1/tokenize/file- Tokenize fileGET /api/v1/download/{id}- Download results
Search & Retrieval:
POST /api/v1/search- Vector searchGET /api/v1/health- Health checkGET /api/v1/info- System information
WebSocket:
WS /ws- Real-time tokenizationWS /ws/train- Training progressWS /ws/execute- Code execution
SanTOK provides a unified interface to multiple vector database backends, allowing seamless switching between ChromaDB, FAISS, and Weaviate.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Vector Store Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Unified Interface │ │
│ │ SanTOKVectorStore (Abstract Base Class) │ │
│ │ - add_tokens() │ │
│ │ - search() │ │
│ │ - get_token_embedding() │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Backend Selection │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ ChromaDB │ │ FAISS │ │ │
│ │ │ VectorStore │ │ VectorStore │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Weaviate │ │ In-Memory │ │ │
│ │ │ VectorStore │ │ VectorStore │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Backend-Specific Implementation │ │
│ │ Each backend implements: │ │
│ │ - Storage mechanism │ │
│ │ - Index structure │ │
│ │ - Search algorithm │ │
│ │ - Metadata handling │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
ChromaVectorStore
↓
[Initialization]
Create PersistentClient
Get or create collection
↓
[Add Tokens]
Convert embeddings to list
Extract metadata from TokenRecords
Add to collection with IDs
↓
[Search]
Query collection with embedding
Use similarity search
Return top_k results with metadata
↓
[Retrieve]
Get by ID from collection
Return embedding vector
FAISSVectorStore
↓
[Initialization]
Create IndexFlatL2 (L2 distance)
Initialize token mapping
↓
[Add Tokens]
Add embeddings to FAISS index
Store TokenRecord mapping
↓
[Search]
Query FAISS index
Get top_k indices
Map indices to TokenRecords
Return results with distances
↓
[Retrieve]
Get embedding from index
Return vector
WeaviateVectorStore
↓
[Initialization]
Connect to Weaviate cluster
Create or get class (collection)
Define schema
↓
[Add Tokens]
Create objects with:
- Vector (embedding)
- Properties (metadata)
Batch insert
↓
[Search]
Use GraphQL query
Vector similarity search
Filter by metadata (optional)
Return results
↓
[Retrieve]
Get object by ID
Extract vector and metadata
Return embedding
| Feature | ChromaDB | FAISS | Weaviate |
|---|---|---|---|
| Speed | Fast | Very Fast | Fast |
| Memory | Medium | Low | Medium |
| Persistence | Built-in | Manual | Cloud |
| Metadata | Good | Limited | Excellent |
| Scalability | Medium | High | Very High |
| Use Case | Development | Production | Enterprise |
SanTOK includes a complete Small Language Model implementation that uses only SanTOK components - no external AI frameworks.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK SLM Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Input Layer │ │
│ │ - Text prompt │ │
│ │ - Context (optional) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Tokenization │ │
│ │ - SanTOK TextTokenizer │ │
│ │ - Convert text to TokenRecords │ │
│ │ - Extract UIDs │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Embedding Layer │ │
│ │ - SanTOK EmbeddingGenerator │ │
│ │ - Convert tokens to embeddings │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Model Architecture │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Transformer │→ │ Attention │ │ │
│ │ │ Encoder │ │ Mechanism │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Positional │→ │ Feed-Forward │ │ │
│ │ │ Encoding │ │ Network │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Constraint Engine │ │
│ │ - Knowledge graph constraints │ │
│ │ - Fact validation │ │
│ │ - No hallucination guarantee │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Decoder │ │
│ │ - Constrained decoding │ │
│ │ - Token generation │ │
│ │ - Sequence optimization │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Output │ │
│ │ - Generated text │ │
│ │ - Confidence scores │ │
│ │ - Reasoning trace │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Training Facts
↓
[Knowledge Integration]
Add facts to UnifiedMemory
Build knowledge graph
↓
[Tokenization]
Tokenize all facts
Build vocabulary
↓
[Embedding Training]
Train semantic embeddings
Learn token relationships
↓
[Model Training]
Train transformer layers
Learn sequence patterns
↓
[Constraint Learning]
Learn graph constraints
Build constraint rules
↓
Trained Model
Input Prompt
↓
[Tokenization]
Tokenize prompt
↓
[Embedding]
Convert to embeddings
↓
[Encoding]
Pass through encoder
Generate context
↓
[Constraint Checking]
Query knowledge graph
Get valid tokens
↓
[Decoding]
Generate tokens one by one
Apply constraints
Optimize sequence
↓
[Output]
Generated text
Confidence scores
- No Hallucination: Only generates facts from knowledge graph
- Fact Validation: Every token checked against constraints
- Reasoning Trace: Full explanation of generation
- Confidence Scores: Reliability for each token
SanTOK provides comprehensive command-line interfaces for all operations, from tokenization to training to system management.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK CLI Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Command: python santok_cli.py <command> [options] │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Argument Parser │ │
│ │ - Parse command-line arguments │ │
│ │ - Validate inputs │ │
│ │ - Set defaults │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Command Router │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ tokenize │ │ train │ │ │
│ │ │ - Text │ │ - Model │ │ │
│ │ │ - File │ │ - Corpus │ │ │
│ │ │ - URL │ │ - Enhanced │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ embed │ │ test │ │ │
│ │ │ - Generate │ │ - Quick │ │ │
│ │ │ - Strategy │ │ - Full │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │
│ │ │ info │ │ │
│ │ │ - System │ │ │
│ │ │ - Features │ │ │
│ │ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Execution Layer │ │
│ │ - Initialize components │ │
│ │ - Execute operation │ │
│ │ - Handle errors │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Output Formatting │ │
│ │ - JSON output │ │
│ │ - Pretty print │ │
│ │ - File output │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
santok_cli.py
├─ tokenize
│ ├─ --text <text>
│ ├─ --file <path>
│ ├─ --url <url>
│ ├─ --method <method>
│ ├─ --output <path>
│ └─ --format <json|txt>
│
├─ train
│ ├─ --file <corpus>
│ ├─ --model-path <path>
│ ├─ --embedding-dim <dim>
│ ├─ --epochs <n>
│ └─ --enhanced
│
├─ embed
│ ├─ --text <text>
│ ├─ --model-path <path>
│ ├─ --strategy <strategy>
│ └─ --output <path>
│
├─ test
│ └─ --quick
│
└─ info
SanTOK provides integration modules to connect with external systems, adapt vocabularies, and bridge between different components.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Integration Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Integration Modules │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Vocabulary │ │ Source Map │ │ │
│ │ │ Adapter │ │ Integration │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Convert │ │ - Track │ │ │
│ │ │ between │ │ sources │ │ │
│ │ │ systems │ │ - Map tokens │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Cognitive │ │ Vector │ │ │
│ │ │ Pipeline │ │ Bridge │ │ │
│ │ │ │ │ │ │ │
│ │ │ - End-to-end │ │ - Connect │ │ │
│ │ │ processing │ │ to stores │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Token │ │ Embedding │ │ │
│ │ │ Bridge │ │ Bridge │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Link │ │ - Convert │ │ │
│ │ │ systems │ │ formats │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
External System
↓
[Integration Module]
- Receive input
- Convert format
- Validate
↓
[SanTOK Processing]
- Tokenize
- Generate embeddings
- Process
↓
[Output Conversion]
- Convert to external format
- Add metadata
↓
External System
SanTOK includes comprehensive performance optimization features including parallel processing, caching, and efficient algorithms.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Performance Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Performance Strategies │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Parallel │ │ Caching │ │ │
│ │ │ Processing │ │ System │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Threading │ │ - Result │ │ │
│ │ │ - Multiproc │ │ caching │ │ │
│ │ │ - Auto-detect│ │ - Embedding │ │ │
│ │ └──────────────┘ │ cache │ │ │
│ │ ┌──────────────┐ └──────────────┘ │ │
│ │ │ Memory │ ┌──────────────┐ │ │
│ │ │ Optimization │ │ Algorithm │ │ │
│ │ │ │ │ Efficiency │ │ │
│ │ │ - Streaming │ │ │ │ │
│ │ │ - Chunking │ │ - Sparse │ │ │
│ │ │ - Lazy eval │ │ matrices │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Input Text (Large)
↓
[Size Detection]
Check text length
↓
[Threshold Check]
If > 50KB:
→ Use parallel processing
Else:
→ Use sequential processing
↓
[Chunking]
Split text into chunks (50KB each)
↓
[Parallel Execution]
┌─────────────┬─────────────┬─────────────┐
│ Chunk 1 │ Chunk 2 │ Chunk 3 │
│ (Thread 1) │ (Thread 2) │ (Thread 3) │
└──────┬──────┴──────┬──────┴──────┬──────┘
│ │ │
└────────────┼────────────┘
↓
[Result Aggregation]
Merge all results
Maintain order
↓
Final TokenStream
1. Automatic Parallel Processing:
- Detects text size automatically
- Uses threading for I/O-bound tasks
- Uses multiprocessing for CPU-bound tasks
- Optimal worker count based on CPU cores
2. Memory Optimization:
- Streaming processing for large files
- Chunked processing to avoid memory overflow
- Lazy evaluation where possible
- Efficient data structures
3. Caching:
- Tokenization result caching
- Embedding caching
- Vocabulary caching
- Model caching
4. Algorithm Efficiency:
- Sparse matrices for large vocabularies
- Efficient hash-based lookups
- Optimized mathematical operations
- Vectorized operations (NumPy)
Text Size | Sequential | Threaded | Multiprocess | Speedup
-------------|------------|----------|--------------|--------
1 KB | 0.001s | 0.002s | 0.005s | 0.5x
10 KB | 0.01s | 0.008s | 0.012s | 1.25x
100 KB | 0.1s | 0.05s | 0.04s | 2.5x
1 MB | 1.0s | 0.3s | 0.2s | 5x
10 MB | 10s | 2s | 1.5s | 6.7x
SanTOK implements comprehensive error handling and validation to ensure robust operation and prevent failures.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Error Handling Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input/Request │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Input Validation Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Type │→ │ Value │ │ │
│ │ │ Validation │ │ Validation │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Range │→ │ Format │ │ │
│ │ │ Validation │ │ Validation │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Processing Layer │ │
│ │ - Try-catch blocks │ │
│ │ - Graceful degradation │ │
│ │ - Fallback mechanisms │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Error Classification │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Validation │ │ Processing │ │ │
│ │ │ Errors │ │ Errors │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ System │ │ External │ │ │
│ │ │ Errors │ │ Errors │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Error Response │ │
│ │ - User-friendly messages │ │
│ │ - Detailed logs (server-side) │ │
│ │ - Error codes │ │
│ │ - Recovery suggestions │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Input Validation:
validate_text_input(text) # Type and format check
validate_port(port) # Range validation
validate_file_path(path) # Path validation
validate_tokenization_method() # Method validation2. Processing Validation:
- Token count limits
- Memory usage checks
- Timeout handling
- Resource availability
3. Output Validation:
- Result format validation
- Data integrity checks
- Consistency verification
Error Occurs
↓
[Error Classification]
- ValidationError
- ProcessingError
- SystemError
- ExternalError
↓
[Error Context]
- Capture stack trace
- Log context
- Identify recovery options
↓
[Error Response]
Production:
- Generic user message
- Detailed server logs
Development:
- Detailed error message
- Stack trace
- Debug information
↓
[Recovery Attempt]
- Fallback methods
- Retry logic
- Graceful degradation
- Information Disclosure Prevention: Detailed errors only in development
- Input Sanitization: All inputs validated and sanitized
- Resource Limits: Prevent DoS attacks
- Authentication: JWT-based security
- CORS Configuration: Configurable origins
This section shows the complete end-to-end data flow through the entire SanTOK system.
┌─────────────────────────────────────────────────────────────┐
│ Complete SanTOK Data Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [1] Input Sources │ │
│ │ - Text string │ │
│ │ - File upload │ │
│ │ - URL fetch │ │
│ │ - API request │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [2] Preprocessing │ │
│ │ - Text normalization │ │
│ │ - Language detection │ │
│ │ - Encoding detection │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [3] Tokenization (9 methods in parallel) │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Space │ │Word │ │Char │ │Gram │ │ │
│ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Subw │ │BPE │ │Syll │ │Freq │ │ │
│ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │
│ │ ┌──────┐ │ │
│ │ │Byte │ │ │
│ │ └──────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [4] Mathematical Enrichment │ │
│ │ - UID assignment │ │
│ │ - Frontend digits │ │
│ │ - Backend numbers │ │
│ │ - Global IDs │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [5] Branch Point │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Embedding │ │ Cognitive │ │ │
│ │ │ Path │ │ Reasoning │ │ │
│ │ │ │ │ Path │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │
│ │ ↓ ↓ │ │
│ │ [5a] Embedding [5b] Knowledge │ │
│ │ Generation Graph Building │ │
│ │ │ │ │ │
│ │ ↓ ↓ │ │
│ │ [5c] Vector Store [5d] Reasoning │ │
│ │ │ │ │ │
│ │ └────────┬────────┘ │ │
│ │ ↓ │ │
│ │ [6] Integration │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [7] Output Generation │ │
│ │ - Formatted results │ │
│ │ - Metadata │ │
│ │ - Explanations │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [8] Response Delivery │ │
│ │ - JSON response │ │
│ │ - File download │ │
│ │ - WebSocket stream │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. INPUT
Text: "Hello World"
Source: API/File/URL
↓
2. PREPROCESSING
normalize_case() → "hello world"
normalize_whitespace() → "hello world"
detect_language() → "en"
↓
3. TOKENIZATION (Parallel)
Space: ["hello", "world"]
Word: ["hello", "world"]
Char: ["h", "e", "l", ...]
Grammar: ["hello", ",", "world"]
Subword: ["hel", "lo", "wor", "ld"]
... (9 methods)
↓
4. MATHEMATICAL ANALYSIS
For each token:
- Generate UID (Xorshift64*)
- Calculate frontend digit (9-centric)
- Compose backend number
- Assign global ID
- Link neighbors
↓
5. EMBEDDING GENERATION
Strategy: feature_based/semantic/hash/hybrid
Extract features → Generate vector (768-dim)
↓
6. STORAGE/REASONING
Option A: Vector Store
- Add to ChromaDB/FAISS/Weaviate
- Index for search
Option B: Cognitive Reasoning
- Add to knowledge graph
- Build relations
- Enable reasoning
↓
7. OUTPUT FORMATTING
{
"tokens": [...],
"embeddings": [...],
"metadata": {...},
"reasoning": {...}
}
↓
8. RESPONSE
JSON/File/Stream
SanTOK supports multiple deployment scenarios from local development to cloud production.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Deployment Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Deployment Options │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Local │ │ Cloud │ │ │
│ │ │ Development │ │ Production │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Python │ │ - Railway │ │ │
│ │ │ script │ │ - Heroku │ │ │
│ │ │ - CLI │ │ - AWS │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Docker │ │ Kubernetes │ │ │
│ │ │ Container │ │ Cluster │ │ │
│ │ │ │ │ │ │ │
│ │ │ - docker-compose│ - Auto-scaling│ │ │
│ │ │ - Dockerfile │ │ - Load bal. │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Local Development:
python run.py
↓
[Development Server]
- Hot reload
- Debug mode
- Local storage
- Port: 8000
2. Production (Railway/Heroku):
Procfile: web: python start.py
↓
[Platform Detection]
- Auto-detect Python
- Set PORT from env
- Configure logging
↓
[Production Server]
- Optimized settings
- Error handling
- Logging
- Health checks
3. Docker Deployment:
docker-compose up
↓
[Docker Container]
- Isolated environment
- Volume mounts
- Network config
- Environment variables
4. Kubernetes:
kubectl apply -f k8s/
↓
[K8s Cluster]
- Pods
- Services
- Ingress
- Auto-scaling
Development:
- DEBUG=True
- LOG_LEVEL=DEBUG
- CORS_ORIGINS=*
- PORT=8000
Production:
- DEBUG=False
- LOG_LEVEL=INFO
- CORS_ORIGINS=https://yourdomain.com
- PORT=${PORT}
- WEAVIATE_URL=${WEAVIATE_URL}
- WEAVIATE_API_KEY=${WEAVIATE_API_KEY}
SanTOK implements multiple security layers to protect the system and user data.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Security Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Security Layers │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Input │→ │ Authentication│ │ │
│ │ │ Validation │ │ (JWT) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Sanitization │→ │ Authorization│ │ │
│ │ │ │ │ (Roles) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Rate │→ │ Error │ │ │
│ │ │ Limiting │ │ Masking │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Authentication:
- JWT-based token authentication
- Token expiration
- Secure token storage
- Refresh tokens
2. Input Validation:
- Type checking
- Range validation
- Format validation
- Sanitization
3. Resource Protection:
- Rate limiting
- File size limits
- Memory limits
- Timeout protection
4. Error Security:
- No information disclosure in production
- Detailed errors only in development
- Secure logging
- Error masking
5. CORS Configuration:
- Configurable origins
- Production restrictions
- Development flexibility
SanTOK includes comprehensive monitoring and logging for observability and debugging.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Monitoring Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Logging System │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Application │ │ Error │ │ │
│ │ │ Logs │ │ Logs │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Performance │ │ Access │ │ │
│ │ │ Logs │ │ Logs │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Metrics Collection │ │
│ │ - Request count │ │
│ │ - Response times │ │
│ │ - Error rates │ │
│ │ - Resource usage │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
- DEBUG: Detailed debugging information
- INFO: General informational messages
- WARNING: Warning messages
- ERROR: Error messages
- CRITICAL: Critical errors
GET /api/v1/health- Basic health checkGET /api/v1/info- System informationGET /api/v1/metrics- Performance metrics
SanTOK includes text compression algorithms based on mathematical properties and 9-centric numerology.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Compression Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input Text │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Compression Strategies │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Numerology │ │ Weighted │ │ │
│ │ │ Based │ │ Sum Based │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Digital Root │ │ Backend │ │ │
│ │ │ Folding │ │ Number │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Compression Process │ │
│ │ 1. Calculate numerology values │ │
│ │ 2. Compute weighted character sums │ │
│ │ 3. Apply digital root folding │ │
│ │ 4. Generate compressed representation │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ Compressed Output │
│ │
└─────────────────────────────────────────────────────────────┘
1. Numerology-Based Compression:
Text: "Hello"
↓
[Character Numerology]
H → 8 (position in alphabet % 9 + 1)
e → 5
l → 3
l → 3
o → 6
↓
[Sum Calculation]
Total: 8 + 5 + 3 + 3 + 6 = 25
↓
[Digital Root]
dr(25) = 1 + ((25-1) mod 9) = 7
↓
Compressed: 7
2. Weighted Sum Compression:
Text: "Hello"
↓
[Weighted Sum]
H: ord('H') × 1 = 72 × 1 = 72
e: ord('e') × 2 = 101 × 2 = 202
l: ord('l') × 3 = 108 × 3 = 324
l: ord('l') × 4 = 108 × 4 = 432
o: ord('o') × 5 = 111 × 5 = 555
↓
Total: 72 + 202 + 324 + 432 + 555 = 1585
↓
[Digital Root]
dr(1585) = 1 + ((1585-1) mod 9) = 1
↓
Compressed: 1
SanTOK's source map integration tracks the provenance of tokens and embeddings, enabling source-aware processing and multi-source merging.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Source Map Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Source Map Structure │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Source │ │ Algorithm │ │ │
│ │ │ Metadata │ │ Mapping │ │ │
│ │ │ │ │ │ │ │
│ │ │ - source_id │ │ - algorithm │ │ │
│ │ │ - source_tag │ │ → tokens │ │ │
│ │ │ - timestamp │ │ - tokens → │ │ │
│ │ │ - metadata │ │ source │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Source-Aware Tokenization │ │
│ │ - Tag tokens with source │ │
│ │ - Track algorithm used │ │
│ │ - Maintain provenance │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Source-Aware Embedding Generation │ │
│ │ - Embeddings linked to source │ │
│ │ - Source metadata in embeddings │ │
│ │ - Multi-source merging │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Input Text + Source Tag
↓
[Source Map Lookup]
Get source metadata:
- source_id
- source_tag (e.g., "wikipedia", "arxiv")
- algorithm_id
↓
[Tokenization with Source]
Tokenize text
Tag each token with:
- source_id
- source_tag
- algorithm_id
↓
[Embedding Generation with Source]
Generate embeddings
Link embeddings to source
Add source metadata
↓
[Source-Aware Storage]
Store with source tags
Enable source-based queries
↓
[Multi-Source Merging]
Merge embeddings from multiple sources
Combine metadata
Weighted combination
SanTOK's data interpretation system uses embeddings and vector stores to provide real-time insights and interpretations of data.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Data Interpretation Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input Text: "Sales dropped 20% last month" │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Token Extraction │ │
│ │ - Tokenize input │ │
│ │ - Extract key tokens │ │
│ │ - Identify important terms │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Embedding Generation │ │
│ │ - Generate embeddings for tokens │ │
│ │ - Create query embedding │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Vector Search │ │
│ │ - Search in Weaviate/ChromaDB/FAISS │ │
│ │ - Find related concepts │ │
│ │ - Retrieve top-k results │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Interpretation Generation │ │
│ │ - Analyze relationships │ │
│ │ - Generate insights │ │
│ │ - Provide recommendations │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ Output: "Analyze customer behavior and marketing changes" │
│ │
└─────────────────────────────────────────────────────────────┘
Input: "Sales dropped 20% last month"
↓
[Token Extraction]
Key tokens: ["Sales", "dropped", "20%", "last", "month"]
↓
[Embedding Generation]
Generate embeddings for each token
Create combined query embedding
↓
[Vector Search]
Search in knowledge base:
- Find "Sales" related concepts
- Find "dropped" related concepts
- Find "20%" related concepts
↓
[Concept Retrieval]
Related concepts:
- "customer behavior"
- "marketing changes"
- "trend analysis"
- "improvement strategies"
↓
[Interpretation Generation]
Combine concepts:
"Analyze customer behavior and
marketing changes to find the cause."
↓
Output with confidence scores
SanTOK includes several custom algorithms for ranking, scoring, similarity, and graph operations.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Custom Algorithms Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Algorithm Categories │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Ranking │ │ Scoring │ │ │
│ │ │ Algorithms │ │ Algorithms │ │ │
│ │ │ │ │ │ │ │
│ │ │ - SanTOK │ │ - 9-Scorer │ │ │
│ │ │ Ranker │ │ - Confidence │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Similarity │ │ Graph │ │ │
│ │ │ Algorithms │ │ Algorithms │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Semantic │ │ - Graph │ │ │
│ │ │ Similarity │ │ Walker │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Pattern │ │ Query │ │ │
│ │ │ Matching │ │ Parsing │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Pattern │ │ - NL → Query │ │ │
│ │ │ Matcher │ │ Parser │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Query + Candidates
↓
[Component Score Calculation]
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Relevance │ Connectivity │ Hierarchy │ Freshness │
│ │ │ │ │
│ - Token │ - Graph │ - Tree │ - Temporal │
│ overlap │ centrality │ depth │ decay │
│ - Position │ - Relation │ - Sibling │ - Access │
│ boost │ strength │ penalty │ frequency │
│ - Digital │ - Path │ - Parent │ - Mod time │
│ root │ distance │ inheritance│ │
└──────────────┴──────────────┴──────────────┴──────────────┘
↓
[Weighted Combination]
score = α·Relevance + β·Connectivity + γ·Hierarchy + δ·Freshness
(default: α=0.4, β=0.3, γ=0.2, δ=0.1)
↓
[9-Centric Folding]
Apply digital root transformation
↓
Ranked Results
Input Value
↓
[9-Centric Calculation]
Apply digital root: dr(n) = 1 + ((n-1) mod 9)
↓
[Score Normalization]
Map to [0, 1] range
↓
[Confidence Score]
Final score (0-1)
Token A + Token B
↓
[Feature Extraction]
Extract features from both tokens
↓
[Similarity Calculation]
- Character overlap
- UID distance
- Frontend digit similarity
- Backend number proximity
↓
[Combined Similarity]
Weighted combination of features
↓
Similarity Score (0-1)
Start Node
↓
[Energy-Based Traversal]
- Calculate node energy
- Follow high-energy paths
- Avoid low-energy nodes
↓
[Path Exploration]
- BFS/DFS traversal
- Depth limits
- Energy thresholds
↓
[Path Ranking]
Rank paths by:
- Total energy
- Path length
- Relation strength
↓
Top-K Paths
Input Text
↓
[Pattern Library]
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Lexical │ Structural │ Copula │ Possessive │
│ Patterns │ Patterns │ Patterns │ Patterns │
│ │ │ │ │
│ - IS_A │ - Position │ - "X is Y" │ - "X's Y" │
│ - PART_OF │ - Distance │ - "X are Y" │ - "Y of X" │
│ - HAS_PART │ - Context │ │ │
│ - CAUSES │ │ │ │
│ - USES │ │ │ │
└──────────────┴──────────────┴──────────────┴──────────────┘
↓
[Pattern Matching]
- Apply regex patterns
- Extract subject/object
- Identify relation type
- Calculate confidence
↓
[9-Centric Scoring]
Apply digital root to confidence
↓
[Relation Extraction]
Output: (subject, relation, object, confidence)
Pattern Types:
- Lexical Patterns: Word-based regex patterns (e.g., "X is Y" → IS_A)
- Structural Patterns: Position-based extraction
- Copula Patterns: "X is Y", "X are Y" → IS_A relation
- Possessive Patterns: "X's Y", "Y of X" → PART_OF/HAS_PART
- Causal Patterns: "X causes Y", "because of X" → CAUSES
- Temporal Patterns: "X before Y", "after X" → TEMPORAL
Example:
matcher = SanTOKPatternMatcher()
text = "Python is a programming language. It uses dynamic typing."
matches = matcher.extract(text)
# Output: [
# (Python, IS_A, programming language, 0.9),
# (Python, USES, dynamic typing, 0.8)
# ]Natural Language Query
↓
[Query Type Detection]
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Definition │ Relation │ List │ Boolean │
│ "What is X?" │ "How X→Y?" │ "Parts of X" │ "Is X a Y?" │
└──────────────┴──────────────┴──────────────┴──────────────┘
┌──────────────┬──────────────┬──────────────┐
│ Comparison │ Process │ Count │
│ "X vs Y?" │ "How X works"│ "How many X?"│
└──────────────┴──────────────┴──────────────┘
↓
[Entity Extraction]
- Extract subject
- Extract object (if relation query)
- Extract modifiers (negation, quantifiers)
↓
[Query Structure]
{
"type": "definition|relation|list|...",
"subject": "extracted entity",
"object": "extracted entity (optional)",
"negated": false,
"quantifier": null,
"confidence": 0.95
}
↓
[Structured Query]
Ready for execution against knowledge base
Supported Query Types:
- Definition: "What is X?", "Define X", "Tell me about X"
- Relation: "How is X related to Y?", "What's the relationship between X and Y?"
- List: "What are the parts of X?", "List all X", "What does X contain?"
- Boolean: "Is X a Y?", "Does X have Y?"
- Comparison: "What's the difference between X and Y?", "Compare X and Y"
- Process: "How does X work?", "Explain how X operates"
- Count: "How many X?", "Count the number of Y"
- Cause: "Why does X happen?", "What causes Y?"
Example:
parser = SanTOKQueryParser()
query = parser.parse("What is machine learning?")
# Output:
# type: DEFINITION
# subject: "machine learning"
# confidence: 0.95Text A + Text B
↓
[Tokenization]
Tokenize both texts
↓
[Multi-Component Analysis]
┌─────────────────────────────────────────────────────────┐
│ Component 1: Lexical Similarity │
│ - Jaccard coefficient (token overlap) │
│ - Dice coefficient │
│ - Common tokens identification │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Component 2: N-gram Similarity │
│ - Character n-gram extraction (default: trigrams) │
│ - N-gram overlap calculation │
│ - Position-aware matching │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Component 3: Position-Weighted Similarity │
│ - Token position matching │
│ - Order preservation scoring │
│ - Distance-based weighting │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Component 4: Graph-Based Similarity (Optional) │
│ - Path distance in knowledge graph │
│ - Relation strength │
│ - Common neighbors │
└─────────────────────────────────────────────────────────┘
↓
[Weighted Combination]
score = α·Lexical + β·Ngram + γ·Position + δ·Graph
(default: α=0.35, β=0.25, γ=0.20, δ=0.20)
↓
[9-Centric Harmonization]
Apply digital root transformation
Normalize to [0, 1] range
↓
SimilarityResult
- Combined score (0-1)
- Digital root (1-9)
- Component breakdown
- Common tokens/ngrams
Similarity Formula:
sim(a, b) = α·Jaccard(a, b) + β·Ngram(a, b) + γ·Position(a, b) + δ·Graph(a, b)
Where:
- Jaccard(a, b) = |A ∩ B| / |A ∪ B|
- Ngram(a, b) = |N-grams(a) ∩ N-grams(b)| / |N-grams(a) ∪ N-grams(b)|
- Position(a, b) = Weighted position alignment score
- Graph(a, b) = Path distance / max_path_distance (if graph available)
Example:
similarity = SanTOKSimilarity(graph=knowledge_graph)
result = similarity.compute("machine learning", "deep learning")
# Output:
# score: 0.67
# digital_root: 4
# lexical_score: 0.60
# ngram_score: 0.75
# position_score: 0.55
# graph_score: 0.70SanTOK includes comprehensive configuration management and utility systems for system-wide settings and operations.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Configuration Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Configuration Sources │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Environment │ │ Config Files │ │ │
│ │ │ Variables │ │ (.env, yaml) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Default │ │ Runtime │ │ │
│ │ │ Values │ │ Overrides │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Configuration Manager │ │
│ │ - Load configurations │ │
│ │ - Merge sources │ │
│ │ - Validate settings │ │
│ │ - Provide defaults │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Configuration Categories │ │
│ │ - Server settings (port, host, CORS) │ │
│ │ - Tokenization settings (seed, methods) │ │
│ │ - Embedding settings (dim, strategy) │ │
│ │ - Vector store settings (backend, connection) │ │
│ │ - Logging settings (level, format) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Runtime Overrides (highest priority)
- Command-line arguments
- Function parameters
↓
2. Environment Variables
- PORT, LOG_LEVEL, etc.
↓
3. Config Files
- .env file
- config.yaml
↓
4. Default Values (lowest priority)
- Built-in defaults
1. Validation Utilities:
- Input type validation
- Range validation
- Format validation
- Path validation
2. Logging Utilities:
- Structured logging
- Log levels
- File/console output
- Log rotation
3. Unique Identifier Utilities:
- UID generation (Xorshift64*)
- ID management
- Collision detection
4. Formatting Utilities:
- Output formatting
- Data serialization
- Pretty printing
SanTOK implements efficient memory management for handling large datasets and long-running processes.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Memory Management Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Memory Strategies │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Streaming │ │ Chunking │ │ │
│ │ │ Processing │ │ Strategy │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Process │ │ - Split into │ │ │
│ │ │ in chunks │ │ chunks │ │ │
│ │ │ - Release │ │ - Process │ │ │
│ │ │ memory │ │ separately │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Caching │ │ Lazy │ │ │
│ │ │ Strategy │ │ Evaluation │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Result │ │ - Compute │ │ │
│ │ │ caching │ │ on demand │ │ │
│ │ │ - Embedding │ │ - Defer │ │ │
│ │ │ caching │ │ computation│ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
1. Streaming Processing:
- Process data in streams
- Release memory after each chunk
- Avoid loading entire dataset
2. Chunking:
- Split large texts into chunks
- Process chunks independently
- Aggregate results
3. Caching:
- Cache frequently used results
- LRU eviction policy
- Memory-bounded cache
4. Lazy Evaluation:
- Compute only when needed
- Defer expensive operations
- Generator-based processing
Core Tokenization Engine - The foundation of all text processing.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Core Tokenization Engine │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input Text │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Preprocessing Layer │ │
│ │ - Case normalization │ │
│ │ - Punctuation handling │ │
│ │ - Whitespace normalization │ │
│ │ - Language detection │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Tokenization Methods (9 types) │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Space │ │ Word │ │ Char │ │ │
│ │ └──────┘ └──────┘ └──────┘ │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Grammar│ │Subword│ │ Byte │ │ │
│ │ └──────┘ └──────┘ └──────┘ │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │BPE │ │Syllable│ │Freq │ │ │
│ │ └──────┘ └──────┘ └──────┘ │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Mathematical Analysis Layer │ │
│ │ - UID Generation (Xorshift64*) │ │
│ │ - Frontend Digit Calculation │ │
│ │ - Backend Number Composition │ │
│ │ - Global ID Assignment │ │
│ │ - Digital Root Computation │ │
│ │ - Neighbor UID Linking │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Statistical Features │ │
│ │ - Length Factor │ │
│ │ - Balance Index │ │
│ │ - Entropy Index │ │
│ │ - Mean & Variance │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ TokenStream Objects (with TokenRecord instances) │
│ │
└─────────────────────────────────────────────────────────────┘
Token Processing Pipeline:
Text Input
↓
[Preprocessing]
├─ normalize_case()
├─ remove_punctuation()
├─ normalize_whitespace()
└─ detect_language()
↓
[Tokenization] (Parallel execution for 9 methods)
├─ tokenize_space() → Space tokens
├─ tokenize_word() → Word tokens
├─ tokenize_char() → Character tokens
├─ tokenize_grammar() → Grammar tokens
├─ tokenize_subword() → Subword tokens
├─ tokenize_subword_bpe() → BPE tokens
├─ tokenize_subword_syllable() → Syllable tokens
├─ tokenize_subword_frequency() → Frequency tokens
└─ tokenize_bytes() → Byte tokens
↓
[UID Assignment]
├─ assign_uids(seed) → Xorshift64* based UIDs
└─ neighbor_uids() → Link prev/next UIDs
↓
[Mathematical Properties]
├─ frontend_digit → 9-centric digit (1-9)
├─ backend_number → Composite number
├─ global_id → Unique global identifier
└─ content_id → Content-based ID
↓
[TokenStream Creation]
└─ TokenStream with TokenRecord objects
Key Classes:
TextTokenizer- Main orchestratorTokenStream- Container for tokenized resultsTokenRecord- Individual token with all properties
Cognitive Reasoning System - Deterministic reasoning substrate.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Cognitive System │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Knowledge Storage Layer │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ GraphStore │ │ TreeStore │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Nodes │ │ - Root Nodes │ │ │
│ │ │ - Edges │ │ - Children │ │ │
│ │ │ - Relations │ │ - Hierarchy │ │ │
│ │ │ (15+ types) │ │ - Taxonomies │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │
│ │ └──────────┬──────────┘ │ │
│ │ ↓ │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ UnifiedMemory │ │ │
│ │ │ - MemoryObjects │ │ │
│ │ │ - Graph linking │ │ │
│ │ │ - Auto-relations │ │ │
│ │ └──────────┬───────────┘ │ │
│ └────────────────────┼─────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Reasoning Engine │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Inference │ │ Query Engine │ │ │
│ │ │ Engine │ │ │ │ │
│ │ │ │ │ - Parsing │ │ │
│ │ │ - 20+ Rules │ │ - Execution │ │ │
│ │ │ - Chaining │ │ - Results │ │ │
│ │ │ - Validation │ └──────┬───────┘ │ │
│ │ └──────┬───────┘ │ │ │
│ │ │ │ │ │
│ │ └──────────┬──────────┘ │ │
│ │ ↓ │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ PathFinder │ │ │
│ │ │ - Graph traversal │ │ │
│ │ │ - Path discovery │ │ │
│ │ └──────────┬───────────┘ │ │
│ │ ↓ │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ Contradiction │ │ │
│ │ │ Detector │ │ │
│ │ └──────────┬───────────┘ │ │
│ └────────────────────┼─────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Explanation Layer │ │
│ │ - Reasoning traces │ │
│ │ - Confidence scores │ │
│ │ - Source attribution │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Knowledge Graph Structure:
GraphNode
├─ id: str
├─ label: str
├─ properties: dict
└─ edges: List[GraphEdge]
GraphEdge
├─ source: str (node id)
├─ target: str (node id)
├─ relation: RelationType (15+ types)
│ ├─ IS_A
│ ├─ PART_OF
│ ├─ CAUSES
│ ├─ USES
│ ├─ LOCATED_IN
│ └─ ... (10+ more)
└─ confidence: float (0-1)
Reasoning Flow:
Query Input
↓
[Query Parser]
└─ Parse natural language → Structured query
↓
[Query Engine]
├─ Find relevant nodes in graph
├─ Extract relations
└─ Build query plan
↓
[Inference Engine]
├─ Apply inference rules (20+)
│ ├─ Transitivity
│ ├─ Inheritance
│ ├─ Symmetry
│ ├─ Inverse
│ └─ ... (16+ more)
├─ Rule chaining
└─ Confidence propagation
↓
[Path Finder]
└─ Find reasoning paths
↓
[Contradiction Detector]
└─ Validate consistency
↓
[Explainer]
├─ Generate reasoning trace
├─ Calculate confidence
└─ Format explanation
↓
Answer with full trace
Semantic Embedding Generation - Multiple strategies for vector generation.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Embedding System │
├─────────────────────────────────────────────────────────────┤
│ │
│ TokenRecord Input │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Strategy Selection │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Feature │ │ Semantic │ │ │
│ │ │ Based │ │ (Trained)│ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Hash │ │ Hybrid │ │ │
│ │ │ Based │ │ (Combined)│ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Feature Extraction │ │
│ │ - UID (64-bit → 8 bytes) │ │
│ │ - Frontend digit (1-9) │ │
│ │ - Backend number │ │
│ │ - Global ID │ │
│ │ - Text length │ │
│ │ - Character frequencies │ │
│ │ - Stream type (one-hot) │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Embedding Generation │ │
│ │ [Feature-based] │ │
│ │ └─ Direct feature → vector │ │
│ │ [Semantic] │ │
│ │ └─ Trained model lookup │ │
│ │ [Hash-based] │ │
│ │ └─ Hash → normalized vector │ │
│ │ [Hybrid] │ │
│ │ ├─ Text embedding (optional) │ │
│ │ └─ Feature embedding │ │
│ │ └─ Weighted combination │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Dimension Projection │ │
│ │ - Project to target dimension │ │
│ │ - Normalize vector │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ Embedding Vector (float32 array) │
│ │
└─────────────────────────────────────────────────────────────┘
Embedding Strategies:
Feature-Based Strategy:
TokenRecord
↓
Extract Features:
- UID bytes (8 floats)
- Frontend digit (1 float)
- Backend number (1 float)
- Global ID (1 float)
- Text length (1 float)
- Stream type (9 floats, one-hot)
- Character stats (N floats)
↓
Concatenate → Feature vector
↓
Project to embedding_dim (768 default)
↓
Normalize
↓
Embedding
Semantic Strategy:
TokenRecord
↓
Lookup UID in trained model
↓
Retrieve learned embedding
↓
Embedding
Hash-Based Strategy:
TokenRecord
↓
Hash text + UID
↓
Convert to vector
↓
Normalize
↓
Embedding
Hybrid Strategy:
TokenRecord
↓
┌─────────────┬─────────────┐
│ Text Embed │ Feature Emb │
│ (optional) │ (always) │
└──────┬──────┴──────┬──────┘
│ │
└─────┬──────┘
↓
Weighted Combination
↓
Embedding
Semantic Model Training - Train custom embeddings on your corpus.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Training System │
├─────────────────────────────────────────────────────────────┤
│ │
│ Training Corpus │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Tokenization Phase │ │
│ │ - TextTokenizer.build() │ │
│ │ - Multiple streams │ │
│ │ - TokenRecord creation │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Vocabulary Building │ │
│ │ - Collect unique tokens │ │
│ │ - Build token → index mapping │ │
│ │ - Calculate frequencies │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Co-occurrence Matrix │ │
│ │ - Build context windows │ │
│ │ - Count co-occurrences │ │
│ │ - Create sparse matrix │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Training Loop │ │
│ │ For each epoch: │ │
│ │ - Sample training pairs │ │
│ │ - Forward pass │ │
│ │ - Calculate loss │ │
│ │ - Backward pass │ │
│ │ - Update embeddings │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Model Saving │ │
│ │ - Save embeddings │ │
│ │ - Save vocabulary │ │
│ │ - Save metadata │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Enhanced Training (Multi-Stream):
Multiple Token Streams
├─ char stream
├─ subword stream
└─ word stream
↓
[Multi-Stream Learning]
├─ Learn at all granularities
├─ Cross-stream alignment
└─ Hierarchical semantics
↓
[Temporal Awareness]
├─ Position-dependent embeddings
└─ Sequence modeling
↓
[Content-ID Clustering]
├─ Deterministic grouping
└─ Semantic clusters
↓
[Mathematical Properties]
├─ Frontend/backend integration
└─ UID-based relationships
↓
Enhanced Embeddings
FastAPI Server - Production-ready RESTful API.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK API Server Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Request Layer │ │
│ │ - HTTP Requests (REST) │ │
│ │ - WebSocket Connections │ │
│ │ - File Uploads │ │
│ └──────────────┬────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Middleware Layer │ │
│ │ - CORS handling │ │
│ │ - Authentication (JWT) │ │
│ │ - Request validation │ │
│ │ - Error handling │ │
│ └──────────────┬────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Route Handlers │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Tokenize │ │ Embed │ │ Train │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Upload │ │ Search │ │ Jobs │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────┬────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Processing Layer │ │
│ │ - TextTokenizer │ │
│ │ - EmbeddingGenerator │ │
│ │ - VectorStore │ │
│ │ - JobManager (async) │ │
│ └──────────────┬────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Response Layer │ │
│ │ - JSON responses │ │
│ │ - Streaming responses │ │
│ │ - WebSocket messages │ │
│ │ - File downloads │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
API Endpoint Structure:
/api/v1/
├─ POST /tokenize
│ └─ Text → Tokens
├─ POST /embed
│ └─ Text → Embeddings
├─ POST /train
│ └─ Corpus → Model
├─ POST /upload
│ └─ File → Processing
├─ GET /search
│ └─ Query → Results
├─ GET /jobs/{id}
│ └─ Job status
└─ WebSocket /ws
└─ Real-time streaming
Vector Database Integration - Multiple backend support.
┌─────────────────────────────────────────────────────────────┐
│ SanTOK Vector Store Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Embeddings + Metadata │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Vector Store Interface │ │
│ │ (Abstract base) │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Backend Selection │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ ChromaDB │ │ FAISS │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Weaviate │ │ In-Memory│ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └──────────────┬──────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Storage Operations │ │
│ │ - add(embedding, metadata) │ │
│ │ - search(query, top_k) │ │
│ │ - get(id) │ │
│ │ - delete(id) │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
End-to-End Processing Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Complete SanTOK Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ Input: Text/String │
│ ↓ │
│ [1] Preprocessing │
│ ├─ Normalize case │
│ ├─ Clean punctuation │
│ └─ Detect language │
│ ↓ │
│ [2] Tokenization (9 methods in parallel) │
│ ├─ Space, Word, Char │
│ ├─ Grammar, Subword │
│ └─ BPE, Syllable, Frequency, Byte │
│ ↓ │
│ [3] Mathematical Analysis │
│ ├─ UID assignment (Xorshift64*) │
│ ├─ Frontend digit (9-centric) │
│ ├─ Backend number │
│ └─ Global ID │
│ ↓ │
│ [4] Embedding Generation │
│ ├─ Feature extraction │
│ ├─ Strategy selection │
│ └─ Vector generation │
│ ↓ │
│ [5] Storage/Reasoning (Optional) │
│ ├─ Vector Store (ChromaDB/FAISS/Weaviate) │
│ └─ Cognitive Reasoning (Knowledge Graph) │
│ ↓ │
│ Output: Tokens + Embeddings + Metadata │
│ │
└─────────────────────────────────────────────────────────────┘
┌──────────────┐
│ User/API │
└──────┬───────┘
│
↓
┌──────────────────┐
│ API Server │
│ (FastAPI) │
└──────┬───────────┘
│
├──────────────┐
│ │
↓ ↓
┌──────────────┐ ┌──────────────┐
│ TextTokenizer│ │ EmbeddingGen │
└──────┬───────┘ └──────┬───────┘
│ │
├─────────┐ │
│ │ │
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│Cognitive │ │ Vector │ │ Training │
│Reasoning │ │ Store │ │ System │
└──────────┘ └──────────┘ └──────────┘
- Modularity: Each component is independent and can be used separately
- Determinism: Same input always produces same output
- Extensibility: Easy to add new tokenization methods or embedding strategies
- Performance: Parallel processing where possible
- Scalability: Supports large-scale processing
- Explainability: Full traceability of all operations
- Python: 3.11 or higher
- pip: Python package installer
- RAM: 4GB minimum, 8GB recommended
- Disk Space: 2GB free space
Linux/Mac:
git clone <repository-url>
cd SanTOK-Code-Only
chmod +x setup.sh # if setup script exists
./setup.shWindows:
git clone <repository-url>
cd SanTOK-Code-Only
.\setup.bat # if setup script exists- Clone the repository:
git clone <repository-url>
cd SanTOK-Code-Only- Create virtual environment:
python -m venv venv
# Activate virtual environment
# Linux/Mac:
source venv/bin/activate
# Windows:
venv\Scripts\activate- Install dependencies:
pip install --upgrade pip
pip install -r requirements.txt- Install the package (optional):
pip install -e .docker-compose uppython check_system.pyOr test in Python:
from santok import TextTokenizationEngine
engine = TextTokenizationEngine()
result = engine.tokenize("Hello World", "whitespace")
print(result['tokens']) # Should print: ['Hello', 'World']from santok import TextTokenizationEngine
# Create engine
engine = TextTokenizationEngine(
random_seed=12345,
normalize_case=True,
remove_punctuation=False
)
# Tokenize text
text = "Hello World! This is SanTOK."
result = engine.tokenize(text, tokenization_method="whitespace")
print(f"Tokens: {result['tokens']}")
print(f"Frontend Digits: {result['frontend_digits']}")
print(f"Features: {result['features']}")from src.core.core_tokenizer import TextTokenizer
# Create tokenizer
tokenizer = TextTokenizer(seed=42, embedding_bit=False)
# Build token streams (multiple methods at once)
streams = tokenizer.build("Hello World! This is SanTOK.")
# Access different tokenization methods
word_tokens = streams["word"].tokens
char_tokens = streams["char"].tokens
subword_tokens = streams["subword"].tokens
# Each token has: text, uid, index, content_id, frontend_digit, backend_number, global_id
for token in word_tokens[:5]:
print(f"Text: {token.text}, UID: {token.uid}, Frontend: {token.frontend_digit}")# Tokenize text
python santok_cli.py tokenize --text "Hello world" --method word
# Tokenize file
python santok_cli.py tokenize --file data.txt --output tokens.json
# Train embeddings
python santok_cli.py train --file corpus.txt --model-path model.pkl
# Generate embeddings
python santok_cli.py embed --text "Hello world" --model-path model.pkl
# Show system information
python santok_cli.py info# Option 1: Using run script (recommended)
python run.py
# Option 2: Direct start
python start.py
# Option 3: For Railway/Heroku deployment
python main.py # Auto-detects app from src.servers.main_server
# Server will be available at http://localhost:8000
# Interactive API docs at http://localhost:8000/docs
# Alternative docs at http://localhost:8000/redoc# Tokenize via API
curl -X POST "http://localhost:8000/api/v1/tokenize" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "method": "word"}'
# Generate embeddings
curl -X POST "http://localhost:8000/api/v1/embed" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "strategy": "feature_based"}'
# Health check
curl http://localhost:8000/api/v1/healthimport asyncio
import websockets
import json
async def tokenize_websocket():
uri = "ws://localhost:8000/ws"
async with websockets.connect(uri) as websocket:
# Send tokenize request
await websocket.send(json.dumps({
"action": "tokenize",
"text": "Hello world",
"method": "word"
}))
# Receive results
result = await websocket.recv()
print(json.loads(result))
asyncio.run(tokenize_websocket())Main Classes:
TextTokenizationEngine- Main tokenization engineTextTokenizer- Core tokenizer with multiple methodsBaseTokenizer- Base class for custom tokenizersParallelTokenizer- Parallel processing support
Tokenization Methods:
space/whitespace- Split by whitespace charactersword- Word-based tokenization (alphabetic characters)char/character- Character-level tokenizationgrammar- Grammar-aware tokenization with punctuation handlingsubword- Basic subword tokenizationsubword_bpe- Byte-Pair Encoding (BPE) subword tokenizationsubword_frequency- Frequency-based subword tokenizationsubword_syllable- Syllable-based subword tokenizationbyte- Byte-level tokenization (ord-based)
Multi-language Support:
- Automatic language detection
- Support for CJK (Chinese, Japanese, Korean)
- Arabic, Cyrillic, Hebrew, Thai, Devanagari support
- Language-specific word boundary detection
Components:
SanTOKEmbeddingGenerator- Generate embeddings from textSanTOKVectorStore- Store and search embeddingsSanTOKSemanticTrainer- Train semantic modelsSanTOKInferencePipeline- Inference pipelineEnhancedSanTOKSemanticTrainer- Enhanced training with multi-stream learning
Embedding Strategies:
feature_based- Mathematical feature-based embeddingshash_based- Hash-based embeddingssemantic- Trained semantic embeddingshybrid- Combination of multiple strategies
Supported Databases:
- ChromaDB - Lightweight vector database
- FAISS - Facebook AI Similarity Search
- Weaviate - Cloud-native vector database
Components:
UnifiedMemory- Unified memory systemSanTOKReasoner- Symbolic reasoning engineGraphStore- Knowledge graph storageTreeStore- Hierarchical tree storageInferenceEngine- Inference rule engine
Features:
- 15+ relation types
- 20+ inference rules
- Contradiction detection
- Confidence propagation
- Full explainability
Available Servers:
main_server.py- Full-featured FastAPI serverlightweight_server.py- Lightweight API serversimple_server.py- Simple HTTP serverapi_server.py- Alternative API implementation
Features:
- RESTful API endpoints
- WebSocket support
- File upload/download
- Job management
- Authentication (JWT)
- Interactive documentation
Components:
SanTOKVocabularyBuilder- Build vocabulariesSanTOKLanguageModelTrainer- Train language modelsEnhancedSanTOKSemanticTrainer- Enhanced semantic trainingDatasetDownloader- Download training datasets
Components:
- Transformer-based small language models
- Training scripts
- Model loading and inference
- Vocabulary expansion
Components:
VocabularyAdapter- Adapt vocabularies between systemsSourceMapIntegration- Source map integrationCognitivePipeline- Integration with cognitive reasoning
Components:
Config- Configuration managementLoggingConfig- Logging setupValidation- Input validationUniqueIdentifier- UID generation
from santok import TextTokenizationEngine
engine = TextTokenizationEngine()
# Analyze with all methods
text = "SanTOK is an advanced text processing framework."
analysis = engine.analyze_text(text)
# Access results for each method
for method, result in analysis.items():
print(f"{method}: {len(result['tokens'])} tokens")
print(f" Frontend Digits: {result['frontend_digits']}")
print(f" Features: {result['features']}")from src.core.core_tokenizer import TextTokenizer
from src.embeddings.semantic_trainer import SanTOKSemanticTrainer
# Tokenize corpus
tokenizer = TextTokenizer(seed=42)
streams = tokenizer.build(your_corpus_text)
# Train semantic embeddings
trainer = SanTOKSemanticTrainer(
embedding_dim=768,
epochs=10,
window_size=5
)
# Collect all tokens
all_tokens = []
for stream in streams.values():
all_tokens.extend(stream.tokens)
# Build vocabulary and train
trainer.build_vocab(all_tokens)
trainer.build_cooccurrence(all_tokens)
trainer.train(all_tokens)
# Save model
trainer.save("model.pkl")
# Get embedding for a token
embedding = trainer.get_embedding(token_uid)from enhanced_semantic_trainer import EnhancedSanTOKSemanticTrainer
from src.core.core_tokenizer import TextTokenizer
# Tokenize
tokenizer = TextTokenizer()
streams = tokenizer.build(your_text)
# Train with enhanced features
trainer = EnhancedSanTOKSemanticTrainer(
embedding_dim=768,
epochs=10,
window_size=5,
use_multi_stream=True,
use_temporal=True,
use_content_id_clustering=True,
use_math_properties=True
)
trainer.train(streams)
trainer.save("enhanced_model.pkl")from santok_cognitive import UnifiedMemory, SanTOKReasoner
# Create memory
memory = UnifiedMemory()
# Add knowledge
memory.add("Python is a programming language", "fact", auto_link_graph=True)
memory.add("Programming languages are used for software development", "fact", auto_link_graph=True)
# Create reasoner
reasoner = SanTOKReasoner(memory)
# Ask question
answer = reasoner.ask("What is Python?")
print(answer.text)
print(answer.explain()) # Full reasoning tracefrom src.embeddings.embedding_generator import SanTOKEmbeddingGenerator
from src.embeddings.vector_store import SanTOKVectorStore
# Generate embeddings
generator = SanTOKEmbeddingGenerator(strategy="feature_based")
embedding = generator.generate("Hello world")
# Store in vector database
store = SanTOKVectorStore()
doc_id = store.add(embedding, metadata={"text": "Hello world", "id": 1})
# Search
query_embedding = generator.generate("greeting")
results = store.search(query_embedding, top_k=5)
for result in results:
print(f"Score: {result['score']}, Metadata: {result['metadata']}")from fastapi import FastAPI
from santok import TextTokenizationEngine
app = FastAPI()
engine = TextTokenizationEngine()
@app.post("/tokenize")
async def tokenize(text: str, method: str = "whitespace"):
result = engine.tokenize(text, method)
return result
# Run with: uvicorn main:app --reloadWhen the server is running, visit http://localhost:8000/docs for interactive Swagger documentation or http://localhost:8000/redoc for ReDoc documentation.
Core Endpoints:
POST /api/v1/tokenize- Tokenize text with multiple methodsPOST /api/v1/embed- Generate embeddings from textPOST /api/v1/train- Train semantic embedding modelGET /api/v1/health- Health check endpointGET /api/v1/info- System informationPOST /api/v1/analyze- Comprehensive text analysis
File Operations:
POST /api/v1/upload- Upload file for processingPOST /api/v1/tokenize/file- Tokenize uploaded fileGET /api/v1/download/{file_id}- Download processed results
WebSocket:
WebSocket /ws- Real-time streaming tokenizationWebSocket /ws/train- Real-time training progress
Job Management:
POST /api/v1/jobs- Create async jobGET /api/v1/jobs/{job_id}- Get job statusGET /api/v1/jobs/{job_id}/result- Get job result
Example Requests:
Tokenize Text:
curl -X POST "http://localhost:8000/api/v1/tokenize" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world",
"method": "word",
"compute_features": true,
"seed": 42
}'Response:
{
"tokens": ["Hello", "world"],
"frontend_digits": [5, 6],
"backend_numbers": [123, 456],
"global_ids": [789, 101],
"features": {
"length_factor": 2,
"balance_index": 5,
"entropy_index": 0,
"mean": 5.5,
"variance": 0.25
},
"method": "word",
"token_count": 2
}Generate Embeddings:
curl -X POST "http://localhost:8000/api/v1/embed" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world",
"strategy": "feature_based",
"model_path": "model.pkl"
}'Upload and Process File:
curl -X POST "http://localhost:8000/api/v1/upload" \
-F "file=@document.txt" \
-F "method=word"WebSocket Example:
import asyncio
import websockets
import json
async def tokenize_stream():
uri = "ws://localhost:8000/ws"
async with websockets.connect(uri) as websocket:
# Send request
await websocket.send(json.dumps({
"action": "tokenize",
"text": "Hello world",
"method": "word"
}))
# Receive streaming results
while True:
result = await websocket.recv()
data = json.loads(result)
if data.get("done"):
break
print(f"Token: {data.get('token')}")
asyncio.run(tokenize_stream())# Tokenize text
python santok_cli.py tokenize --text "Hello world" --method word
# Tokenize file
python santok_cli.py tokenize --file data.txt --output tokens.json --format json
# Tokenize from URL
python santok_cli.py tokenize --url https://example.com/text.txt# Train basic model
python santok_cli.py train --file corpus.txt --model-path model.pkl
# Train with enhanced trainer
python santok_cli.py train --file corpus.txt --model-path model.pkl --enhanced
# Custom training parameters
python santok_cli.py train --file corpus.txt \
--model-path model.pkl \
--embedding-dim 768 \
--epochs 20 \
--window-size 5# Generate embeddings
python santok_cli.py embed --text "Hello world" --model-path model.pkl
# Generate with different strategy
python santok_cli.py embed --text "Hello world" \
--strategy feature_based \
--output embeddings.npy# Run tests
python santok_cli.py test
# Quick tests
python santok_cli.py test --quick
# Show system information
python santok_cli.py info# After installation: pip install -e .
santok "Hello world" --method whitespace
santok "Hello world" --analyze --output results.json# Start development server
python run.py
# Or use uvicorn directly
uvicorn src.servers.main_server:app --reload --host 0.0.0.0 --port 8000Using Railway:
# Railway auto-detects start.py
# Set PORT environment variable
railway upUsing Docker:
docker-compose up -dUsing systemd (Linux):
# Create service file
sudo nano /etc/systemd/system/santok.service
# Start service
sudo systemctl start santok
sudo systemctl enable santokPORT- Server port (default: 8000)LOG_LEVEL- Logging level (default: INFO)WEAVIATE_URL- Weaviate server URL (optional)WEAVIATE_API_KEY- Weaviate API key (optional)
SanTOK-Code-Only/
├── santok/ # Core tokenization package
│ ├── __init__.py # Package initialization
│ ├── santok.py # Main TextTokenizationEngine class
│ ├── cli.py # CLI interface (argparse-based)
│ └── utils/ # Utility modules
│ ├── config.py # Configuration management
│ ├── logging_config.py # Logging setup
│ └── validation.py # Input validation
│
├── santok_cognitive/ # Cognitive reasoning system
│ ├── __init__.py
│ ├── README.md # Cognitive system documentation
│ ├── ARCHITECTURE.md # Architecture documentation
│ ├── WHITEPAPER.md # Technical whitepaper
│ ├── graph/ # Knowledge graph implementation
│ │ ├── graph_node.py # Graph node class
│ │ ├── graph_edge.py # Graph edge class
│ │ ├── graph_store.py # Graph storage
│ │ └── relation_extractor.py # Relation extraction
│ ├── trees/ # Hierarchical tree structures
│ │ ├── tree.py # Tree implementation
│ │ ├── tree_node.py # Tree node class
│ │ └── tree_store.py # Tree storage
│ ├── memory/ # Unified memory system
│ │ ├── unified_memory.py # Main memory class
│ │ └── memory_object.py # Memory object representation
│ ├── reasoning/ # Inference and reasoning
│ │ ├── santok_reasoner.py # Main reasoner
│ │ ├── inference_engine.py # Inference rule engine
│ │ ├── query_engine.py # Query processing
│ │ ├── path_finder.py # Path finding algorithms
│ │ ├── contradiction_detector.py # Contradiction detection
│ │ └── explainer.py # Explanation generation
│ ├── algorithms/ # Custom SanTOK algorithms
│ │ ├── santok_ranker.py # Hybrid relevance ranking
│ │ ├── nine_scorer.py # 9-centric confidence scoring
│ │ ├── semantic_similarity.py # Semantic similarity
│ │ ├── graph_walker.py # Graph traversal algorithms
│ │ └── pattern_matcher.py # Pattern matching
│ ├── slm/ # Small Language Models
│ │ ├── santok_slm_model.py # SLM model implementation
│ │ ├── tiny_slm.py # Tiny transformer model
│ │ ├── slm_trainer.py # Training scripts
│ │ └── [multiple training scripts]
│ └── integration/ # Integration modules
│ ├── cognitive_pipeline.py # Cognitive processing pipeline
│ ├── vector_bridge.py # Vector store bridge
│ └── token_bridge.py # Token bridge
│
├── santok_complete/ # Complete production system
│ ├── core/ # Core tokenization
│ ├── embeddings/ # Embedding generation
│ ├── training/ # Model training
│ ├── servers/ # API servers
│ └── vector_stores/ # Vector database integrations
│
├── src/ # Main source code
│ ├── core/ # Core tokenization engines
│ │ ├── core_tokenizer.py # Main tokenizer (9 methods)
│ │ ├── base_tokenizer.py # Base tokenizer class
│ │ └── parallel_tokenizer.py # Parallel processing
│ ├── embeddings/ # Embedding systems
│ │ ├── embedding_generator.py # Embedding generation
│ │ ├── semantic_trainer.py # Semantic model training
│ │ ├── vector_store.py # Vector storage
│ │ ├── weaviate_vector_store.py # Weaviate integration
│ │ └── inference_pipeline.py # Inference pipeline
│ ├── servers/ # API servers
│ │ ├── main_server.py # Full-featured FastAPI server
│ │ ├── lightweight_server.py # Lightweight server
│ │ ├── simple_server.py # Simple HTTP server
│ │ ├── api_server.py # Alternative API implementation
│ │ ├── job_manager.py # Async job management
│ │ └── error_handling.py # Error handling utilities
│ ├── training/ # Training modules
│ │ ├── vocabulary_builder.py # Vocabulary construction
│ │ ├── language_model_trainer.py # Language model training
│ │ └── dataset_downloader.py # Dataset management
│ ├── integration/ # Integration modules
│ │ ├── vocabulary_adapter.py # Vocabulary adaptation
│ │ └── source_map_integration.py # Source map integration
│ ├── compression/ # Compression algorithms
│ │ └── compression_algorithms.py # Text compression
│ ├── interpretation/ # Text interpretation
│ │ └── data_interpreter.py # Data interpretation
│ ├── performance/ # Performance testing
│ │ ├── test_accuracy.py # Accuracy tests
│ │ └── comprehensive_performance_test.py # Full benchmarks
│ ├── cli/ # CLI tools
│ │ └── main.py # CLI main entry
│ └── utils/ # Utilities
│ └── unique_identifier.py # UID generation
│
├── backend/ # Backend-specific code
│ ├── santok/ # Backend tokenization package
│ ├── src/ # Backend source (mirror of src/)
│ └── Architecture_Docs/ # Architecture documentation
│
├── enhanced_semantic_trainer/ # Enhanced semantic training
│ ├── enhanced_trainer.py # Enhanced trainer implementation
│ ├── example_train.py # Training examples
│ ├── example_use.py # Usage examples
│ └── examples/ # Additional examples
│
├── examples/ # Example scripts and demos
│ ├── embedding_example.py # Embedding examples
│ ├── vector_store examples # Vector store usage
│ ├── training examples # Training examples
│ └── integration examples # Integration examples
│
├── docs/ # Comprehensive documentation
│ ├── api/ # API documentation
│ ├── backend/ # Backend documentation
│ ├── examples/ # Example documentation
│ ├── guides/ # User guides
│ ├── integration/ # Integration guides
│ └── performance/ # Performance documentation
│
├── weaviate_codes/ # Weaviate integration
│ ├── weaviate_vector_store.py # Weaviate vector store
│ └── README.md # Weaviate setup guide
│
├── main.py # Main entry point (Railway/Heroku)
├── run.py # Cross-platform run script
├── start.py # Server startup script
├── santok_cli.py # Main CLI interface
├── check_system.py # System verification script
├── requirements.txt # Python dependencies
├── setup.py # Package setup configuration
├── Procfile # Heroku/Railway process file
├── runtime.txt # Python version specification
└── README.md # This comprehensive documentation
SanTOK provides comprehensive example scripts demonstrating various use cases:
File: examples/embedding_example.py
- Basic tokenization and embedding generation
- Token-by-token embedding visualization
- Document-level embeddings
- Vector store integration
File: examples/train_semantic_embeddings.py
- Training semantic embeddings from scratch
- Vocabulary building
- Model persistence
File: examples/comprehensive_vector_store_example.py
- Unified example combining ALL vector store capabilities
- Weaviate, FAISS, and ChromaDB integration
- Semantic search with filtering
- Concept exploration and clustering
- Context fusion embeddings
- Batch processing for large datasets
File: examples/use_vector_store.py
- Loading vector stores from disk
- Interactive search mode
- Cluster analysis
- Similarity comparisons
File: examples/search_examples.py
- Advanced search patterns
- Multi-level concept exploration
- Related concept finding
- Token comparison utilities
File: examples/test_full_workflow_500k.py
- Complete workflow for 500K+ token datasets
- Batch processing with disk saving
- Resume capability
- Memory-efficient embedding generation
- Wikipedia data integration
File: santok_cognitive/demo.py
- Knowledge graph construction
- Tree-based hierarchical organization
- Symbolic reasoning demonstrations
- Inference rule applications
- Full pipeline examples
File: santok_cognitive/showcase.py
- Advanced cognitive features
- Query answering with explanations
- Contradiction detection
- Confidence propagation
File: examples/integrate_source_map_workflow.py
- Source map integration
- Metadata tracking
- Railway compute workflows
File: examples/integration_with_transformers.py
- Integration with external transformer models
- Hybrid embedding strategies
- Model comparison
File: examples/quick_start_integration.py
- Quick integration guide
- Common integration patterns
File: examples/santok_with_tiny_slm.py
- SanTOK-native SLM usage
- Constraint-grounded generation
- No external AI dependencies
File: examples/simple_tiny_slm.py
- Basic SLM implementation
- Training and inference
File: examples/eval_embedding_quality.py
- Embedding quality assessment
- Probe token evaluation
- Semantic alignment testing
File: examples/compare_neighbors.py
- Comparison between different stores/strategies
- Overlap analysis
- Performance benchmarking
File: examples/test_data_interpreter.py
- Real-time data interpretation
- Weaviate-based knowledge discovery
- Semantic relationship extraction
# 1. Tokenize documents
from src.core.core_tokenizer import TextTokenizer
tokenizer = TextTokenizer(method="word", seed=42)
tokens = tokenizer.tokenize_text("Your document text here...")
# 2. Generate embeddings
from src.embeddings.embedding_generator import SanTOKEmbeddingGenerator
generator = SanTOKEmbeddingGenerator(strategy="hybrid")
embeddings = generator.generate_embeddings(tokens)
# 3. Store in vector database
from src.embeddings.vector_store import ChromaVectorStore
store = ChromaVectorStore(collection_name="documents")
store.add_tokens(tokens, embeddings)
# 4. Semantic search
results = store.search(embeddings[0], top_k=10)# 1. Build knowledge graph
from santok_cognitive.memory.unified_memory import UnifiedMemory
memory = UnifiedMemory()
# 2. Add facts
obj1 = memory.add("Python is a programming language", "fact")
obj2 = memory.add("Python uses dynamic typing", "fact")
# 3. Create relationships
memory.add_relation(obj1.uid, obj2.uid, RelationType.RELATED_TO)
# 4. Query with reasoning
from santok_cognitive.reasoning.reasoner import SanTOKReasoner
reasoner = SanTOKReasoner(memory.graph)
answer = reasoner.answer("What is Python?")
print(answer.explanation)# 1. Set up inference pipeline
from src.embeddings.inference_pipeline import SanTOKInferencePipeline
pipeline = SanTOKInferencePipeline(
embedding_strategy="semantic",
vector_store="chroma"
)
# 2. Process incoming text
result = pipeline.process_text(
"Machine learning is a subset of artificial intelligence",
store=True
)
# 3. Find similar concepts
similar = pipeline.similarity_search(
"deep learning",
top_k=5
)# 1. Build vocabulary
from src.training.vocabulary_builder import SanTOKVocabularyBuilder
builder = SanTOKVocabularyBuilder()
vocab = builder.build_vocabulary("corpus.txt")
# 2. Train language model
from src.training.language_model_trainer import SanTOKLanguageModelTrainer
trainer = SanTOKLanguageModelTrainer(vocab)
model = trainer.train("corpus.txt", epochs=10)
# 3. Generate text
generated = model.generate("The future of AI", max_length=100)# Start API server
python run.py
# Use REST API
import requests
# Tokenize
response = requests.post("http://localhost:8000/api/v1/tokenize", json={
"text": "Hello world",
"method": "word"
})
tokens = response.json()
# Generate embeddings
response = requests.post("http://localhost:8000/api/v1/embed", json={
"text": "Hello world",
"strategy": "feature_based"
})
embeddings = response.json()File: src/performance/comprehensive_performance_test.py
- Performance benchmarking
- Tokenizer comparison
- Reconstruction accuracy testing
- Speed optimization strategies
To run any example:
# Navigate to examples directory
cd examples
# Run specific example
python comprehensive_vector_store_example.py
# Or run from project root
python examples/embedding_example.pyMost examples generate:
- Token files: JSON/CSV files with tokenized data
- Embedding files: NumPy arrays or pickle files
- Vector store files: Persistent database files
- Report files: Markdown/JSON reports with results
- Visualization files: PNG/SVG charts and graphs
# Quick smoke tests via CLI
python santok_cli.py test --quick
# Full test suite (if pytest tests exist)
python -m pytest tests/
# With coverage report
python -m pytest tests/ --cov=santok --cov-report=html
# Test specific module
python -m pytest tests/test_tokenization.py -v# Check system setup and dependencies
python check_system.py
# This verifies:
# - Python version
# - Installed dependencies
# - File structure
# - Basic functionalityTest Tokenization:
from santok import TextTokenizationEngine
engine = TextTokenizationEngine()
result = engine.tokenize("Hello World", "whitespace")
assert len(result['tokens']) == 2
assert result['tokens'][0] == 'Hello'
print("✓ Tokenization test passed")Test Core Tokenizer:
from src.core.core_tokenizer import TextTokenizer
tokenizer = TextTokenizer(seed=42)
streams = tokenizer.build("Hello World")
assert "word" in streams
assert len(streams["word"].tokens) > 0
print("✓ Core tokenizer test passed")Test Embeddings:
from src.embeddings.embedding_generator import SanTOKEmbeddingGenerator
generator = SanTOKEmbeddingGenerator(strategy="feature_based")
embedding = generator.generate("Hello world")
assert embedding is not None
assert len(embedding) > 0
print("✓ Embedding generation test passed")Test API Server:
# Start server
python start.py &
# Test health endpoint
curl http://localhost:8000/api/v1/health
# Test tokenize endpoint
curl -X POST "http://localhost:8000/api/v1/tokenize" \
-H "Content-Type: application/json" \
-d '{"text": "test", "method": "word"}'# Run performance benchmarks
from src.performance.comprehensive_performance_test import run_performance_tests
results = run_performance_tests()
print(results)Check the examples/ directory for comprehensive test examples:
test_full_workflow_500k.py- Large-scale workflow testeval_embedding_quality.py- Embedding quality evaluationtest_data_interpreter.py- Data interpretation tests
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch
git checkout -b feature/your-feature-name
- Make your changes
- Run tests
python santok_cli.py test - Commit your changes
git commit -m "Add your feature description" - Push to your fork
git push origin feature/your-feature-name
- Submit a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings to all functions and classes
- Include tests for new features
- Update documentation as needed
- Keep commits atomic and well-described
This project is licensed under the MIT License - see the LICENSE file for details.
Santosh Chavala
- GitHub: @chavalasantosh
- Repository: SanTOK
- DeepWiki: SanTOK Documentation
- Built with Python 3.11+
- Uses FastAPI for API servers
- Integrates with Weaviate, ChromaDB, and FAISS
- Thanks to all contributors and the open-source community
- Total Files: 300+ Python files
- Lines of Code: 50,000+
- Components: 15+ major modules
- Tokenization Methods: 9+
- Supported Python Versions: 3.11+
- API Endpoints: 20+
- Inference Rules: 20+ (Cognitive)
- SanTOK Cognitive Documentation - Cognitive reasoning system
- SanTOK Cognitive Architecture - Detailed architecture
- SanTOK Cognitive Whitepaper - Technical whitepaper
- SanTOK Complete Documentation - Complete system docs
- Enhanced Trainer Documentation - Enhanced training
- Weaviate Integration Guide - Weaviate setup and usage
- API Swagger Docs:
http://localhost:8000/docs(when server is running) - API ReDoc:
http://localhost:8000/redoc(when server is running)
docs/api/- API documentationdocs/backend/Architecture_Docs/- Backend architecturedocs/examples/- Example documentationdocs/guides/- User guidesdocs/integration/- Integration guidesdocs/performance/- Performance documentation
Check the examples/ directory for:
- Embedding examples
- Vector store usage
- Training workflows
- Integration examples
- Performance benchmarks
Port already in use:
# Change port
PORT=8001 python run.pyPython version too old:
- Install Python 3.11+ from python.org
Dependencies fail to install:
pip install --upgrade pip
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtImport errors:
- Ensure you're in the project root directory
- Activate virtual environment
- Run
python check_system.pyto diagnose issues
- Check the documentation in
docs/ - Run
python check_system.pyto verify installation - Check server logs for error messages
- Review examples in
examples/directory - Open an issue on GitHub
Tokenization:
from santok import TextTokenizationEngine
engine = TextTokenizationEngine(seed=42)
result = engine.tokenize("Hello World", method="word")Embeddings:
from src.embeddings.embedding_generator import SanTOKEmbeddingGenerator
generator = SanTOKEmbeddingGenerator(strategy="feature_based")
embeddings = generator.generate_embeddings(token_records)Vector Store:
from src.embeddings.vector_store import ChromaVectorStore
store = ChromaVectorStore(collection_name="docs")
store.add_tokens(tokens, embeddings)
results = store.search(query_embedding, top_k=10)Cognitive Reasoning:
from santok_cognitive.memory.unified_memory import UnifiedMemory
memory = UnifiedMemory()
obj = memory.add("Python is a language", "fact")
answer = memory.search_by_content("What is Python?")API Server:
# Start server
python run.py
# Tokenize via API
curl -X POST http://localhost:8000/api/v1/tokenize \
-H "Content-Type: application/json" \
-d '{"text": "Hello World", "method": "word"}'| Parameter | Description | Default |
|---|---|---|
seed |
Random seed for reproducibility | 42 |
method |
Tokenization method | "word" |
strategy |
Embedding strategy | "feature_based" |
embedding_dim |
Embedding dimension | 768 |
top_k |
Number of results | 10 |
| Component | Location |
|---|---|
| Core Tokenizer | src/core/core_tokenizer.py |
| Embeddings | src/embeddings/ |
| Vector Stores | src/embeddings/vector_store.py |
| Cognitive | santok_cognitive/ |
| API Server | src/servers/main_server.py |
| Examples | examples/ |
| CLI | santok_cli.py |
Use Case: Explainable medical diagnosis support
- Challenge: Medical AI must be explainable and auditable
- SanTOK Solution:
- Deterministic reasoning with full explainability
- Knowledge graphs for medical relationships
- Constraint enforcement for safety
- Benefits:
- Traceable decisions
- Regulatory compliance
- No hallucination in critical medical information
Example:
# Medical knowledge base
memory = UnifiedMemory()
memory.add("Aspirin reduces inflammation", source="medical_literature")
memory.add("Patient has inflammation", source="patient_record")
# Query with explanation
result = memory.query("Should patient take aspirin?")
# Returns: Answer + Full reasoning trace + Confidence + SourcesUse Case: Auditable financial decision systems
- Challenge: Financial decisions must be traceable and compliant
- SanTOK Solution:
- Full audit trails
- Contradiction detection
- Source tracking
- Benefits:
- Regulatory compliance
- Risk management
- Fraud detection
Example:
# Financial rules engine
memory = UnifiedMemory()
memory.add("High risk requires approval", relation=RelationType.RULE)
memory.add("Transaction is high risk", source="risk_engine")
# Automated decision with audit trail
decision = memory.reason("Should transaction be approved?")
# Returns: Decision + Complete audit trail + Rule chainUse Case: Legal document analysis and reasoning
- Challenge: Legal reasoning must be precise and explainable
- SanTOK Solution:
- Symbolic reasoning for legal logic
- Knowledge graphs for case law
- Full explainability
- Benefits:
- Precise legal analysis
- Case law relationships
- Explainable conclusions
Example:
# Legal knowledge base
memory = UnifiedMemory()
memory.add("Contract breach requires damages", relation=RelationType.IMPLIES)
memory.add("Party A breached contract", source="evidence")
# Legal reasoning
conclusion = memory.reason("What are the legal consequences?")
# Returns: Conclusion + Legal reasoning chain + PrecedentsUse Case: Internal knowledge bases with guarantees
- Challenge: Enterprise knowledge must be reliable and searchable
- SanTOK Solution:
- Unified memory (vector + graph + tree)
- Source tracking
- Temporal awareness
- Benefits:
- Reliable knowledge retrieval
- Source attribution
- Knowledge evolution tracking
Example:
# Enterprise knowledge base
memory = UnifiedMemory()
memory.add("Product X uses technology Y", source="engineering_team", date="2024-01-15")
memory.add("Technology Y is deprecated", source="tech_lead", date="2024-03-20")
# Temporal-aware query
results = memory.search("What technology does Product X use?")
# Returns: Current answer + Historical changes + Source timelineUse Case: AI-powered customer support with full audit trails
- Challenge: Support responses must be accurate and traceable
- SanTOK Solution:
- Constraint-grounded generation
- Knowledge validation
- Full audit trails
- Benefits:
- Accurate responses
- Source attribution
- Quality assurance
Example:
# Support knowledge base
memory = UnifiedMemory()
memory.add("Feature X requires subscription Y", source="product_docs")
memory.add("Customer has subscription Y", source="customer_db")
# Support query
response = memory.query("Can customer use Feature X?")
# Returns: Answer + Knowledge sources + Confidence + Audit trailUse Case: Research paper analysis and knowledge extraction
- Challenge: Extract and reason about research findings
- SanTOK Solution:
- Semantic embeddings for paper similarity
- Knowledge graphs for research relationships
- Citation tracking
- Benefits:
- Research discovery
- Citation networks
- Knowledge synthesis
Example:
# Research knowledge base
memory = UnifiedMemory()
memory.add("Study A shows X causes Y", source="paper_123", citation=True)
memory.add("Study B contradicts Study A", source="paper_456", citation=True)
# Research query
findings = memory.query("What is the relationship between X and Y?")
# Returns: Findings + Contradictions + Citations + ConfidenceUse Case: Code analysis and documentation
- Challenge: Understand code relationships and generate documentation
- SanTOK Solution:
- Code tokenization (supports any file type)
- Semantic embeddings for code similarity
- Knowledge graphs for code relationships
- Benefits:
- Code understanding
- Documentation generation
- Refactoring support
Example:
# Code knowledge base
memory = UnifiedMemory()
memory.add("Function X calls Function Y", source="codebase", relation=RelationType.CALLS)
memory.add("Function Y is deprecated", source="changelog")
# Code analysis
analysis = memory.query("What functions does Function X depend on?")
# Returns: Dependencies + Status + RecommendationsQ: What makes SanTOK different from other tokenizers? A: SanTOK provides deterministic UIDs, mathematical properties (frontend/backend numbers), perfect reversibility, and integrates tokenization with embeddings, vector stores, and cognitive reasoning - all in one framework.
Q: Do I need external models (BERT, GPT, etc.) to use SanTOK? A: No! SanTOK is self-contained. You can train your own embeddings and models using only SanTOK components. External models are optional for hybrid strategies.
Q: Is SanTOK production-ready? A: Yes! SanTOK includes production-ready APIs, error handling, logging, monitoring, and deployment configurations for platforms like Railway and Heroku.
Q: What file types does SanTOK support? A: SanTOK supports ANY file type - text, images, videos, audio, binary files, executables, archives, and more. It's a universal tokenization system.
Q: How fast is SanTOK? A: SanTOK is optimized for performance with parallel processing, caching, and efficient algorithms. See the Performance Benchmarks section for detailed metrics.
Q: What is a deterministic UID? A: A deterministic UID is a unique identifier that is always the same for the same token. Same input = same UID, every time. This enables reproducible results.
Q: What are frontend and backend numbers? A: Frontend digits (1-9) represent semantic categories, while backend numbers provide positional encoding. These mathematical properties help models understand relationships.
Q: Can I use SanTOK with existing models? A: Yes! SanTOK provides adapters and integration tools to work with external models, transformers, and other NLP tools.
Q: How do I choose between embedding strategies? A:
- Feature-based: Fast, deterministic, no training needed
- Semantic: Best quality, requires training on your data
- Hash-based: Ultra-fast, good for large-scale applications
- Hybrid: Combines multiple strategies for best results
Q: What vector store should I use? A:
- FAISS: Fast, in-memory, good for development
- ChromaDB: Persistent, disk-based, good for local deployment
- Weaviate: Cloud-native, scalable, best for production
Q: How much data do I need to train embeddings? A: Minimum 100K tokens recommended, but 1M+ tokens produces better results. The more domain-specific data, the better.
Q: How long does training take? A: Depends on dataset size and hardware. Typical training: 10-30 minutes for 1M tokens on modern CPUs, faster with GPUs.
Q: Can I resume training? A: Yes! SanTOK supports checkpointing and resuming training from saved models.
Q: How do I know if my model is ready? A: Check the Model Readiness Checklist in the documentation. Key indicators: loss converged, perplexity reasonable, generation quality acceptable.
Q: Can I deploy SanTOK to cloud platforms? A: Yes! SanTOK includes configurations for Railway, Heroku, and other platforms. See the Deployment section for details.
Q: What are the system requirements? A: Python 3.11+, 2GB+ RAM recommended, more for large datasets. No GPU required (but helps for training).
Q: Is SanTOK secure? A: Yes! SanTOK includes JWT authentication, input validation, safe error handling, and security best practices.
| Feature | Standard Tokenizers | SanTOK |
|---|---|---|
| Deterministic UIDs | ❌ | ✅ |
| Mathematical Properties | ❌ | ✅ |
| Perfect Reversibility | ❌ | ✅ |
| Multiple Granularities | Limited | ✅ (9+ methods) |
| Embedding Integration | ❌ | ✅ |
| Vector Store Integration | ❌ | ✅ |
| Cognitive Reasoning | ❌ | ✅ |
| Self-Contained | ❌ | ✅ |
| Feature | RAG | SanTOK |
|---|---|---|
| Structured Knowledge | ❌ | ✅ |
| Inference Rules | ❌ | ✅ (20+) |
| Constraint Enforcement | ❌ | ✅ |
| Explainability | ❌ | ✅ Full |
| No Hallucination | ❌ | ✅ |
| Deterministic | ❌ | ✅ |
| Feature | Knowledge Graphs | SanTOK |
|---|---|---|
| Natural Language Output | ❌ | ✅ |
| Inference Rules | Limited | ✅ (20+) |
| Constraint Enforcement | ❌ | ✅ |
| Full Explainability | Partial | ✅ |
| Integration with LLMs | ❌ | ✅ |
| Feature | Standard Embeddings | SanTOK Embeddings |
|---|---|---|
| External Dependencies | ✅ Required | ❌ Optional |
| Domain-Specific | ❌ Generic | ✅ Your domain |
| Mathematical Properties | ❌ | ✅ |
| Training Required | ❌ Pre-trained | ✅ Self-trained |
| Speed | Slow (50ms+) | Fast (2ms) |
-
Large Vocabulary Training: Training embeddings on vocabularies >100K tokens may require significant memory. Use sparse representations or batch processing.
-
Language Support: SanTOK works best with English text. Other languages may require additional preprocessing.
-
GPU Acceleration: While SanTOK can use GPUs, it's primarily optimized for CPU usage. GPU support is optional.
-
Real-time Processing: Very large files (>10GB) may require chunked processing rather than real-time.
-
Vector Store Scaling: FAISS and ChromaDB have practical limits. For very large scale (>100M vectors), consider Weaviate.
- Large Vocabularies: Use
max_vocab_sizeparameter to limit vocabulary - Memory Issues: Enable batch processing and disk saving
- Performance: Use parallel processing for large datasets
- Scaling: Use Weaviate for cloud-native, scalable vector storage
-
Choose the Right Method:
- Use
wordfor general text - Use
subwordfor code or technical text - Use
charfor character-level analysis
- Use
-
Set a Seed: Always use a consistent seed for reproducible results:
tokenizer = TextTokenizer(seed=42)
-
Enable Features: Compute features for better embeddings:
result = tokenizer.tokenize_text(text, compute_features=True)
-
Train on Your Domain: Don't rely on generic embeddings - train on your specific domain data.
-
Use Hybrid Strategy: For best results, use hybrid embeddings combining feature-based and semantic.
-
Normalize Embeddings: Always normalize embeddings for consistent similarity calculations.
-
Batch Processing: For large datasets, use batch processing to avoid memory issues.
-
Sufficient Data: Use at least 100K tokens, preferably 1M+ for good results.
-
Multiple Epochs: Train for at least 10 epochs, more for complex domains.
-
Validation: Always validate on held-out data to prevent overfitting.
-
Checkpointing: Save models regularly to enable resuming training.
-
Use Production Servers: Use
main_server.pyfor production, notsimple_server.py. -
Enable Authentication: Use JWT authentication for production APIs.
-
Monitor Performance: Enable logging and monitoring for production deployments.
-
Use Vector Stores: For production, use persistent vector stores (ChromaDB or Weaviate).
-
Parallel Processing: Enable parallel processing for large datasets.
-
Caching: Enable caching for frequently accessed data.
-
Batch Operations: Use batch operations for vector stores.
-
Choose Right Backend: Use FAISS for speed, Weaviate for scale.
- Enhanced GPU acceleration support
- Additional language support (multilingual tokenization)
- More vector store backends (Pinecone, Qdrant)
- Improved documentation and tutorials
- Performance optimizations
- Advanced model compression techniques
- Distributed training support
- Enhanced API features (GraphQL support)
- Web UI for model management
- More inference rules for cognitive reasoning
- Full multilingual support
- Advanced model architectures
- Integration with more LLM providers
- Enterprise features (SSO, RBAC)
- Advanced analytics and monitoring
Features:
- Complete tokenization system (9+ methods)
- Semantic embedding training
- Vector store integration (ChromaDB, FAISS, Weaviate)
- Cognitive reasoning system
- Production-ready APIs
- Comprehensive documentation
For detailed changelog, see: CHANGELOG.md (if available)
-
Replace tokenizer calls:
# Old tokens = tokenizer.tokenize(text) # New from santok import TextTokenizationEngine engine = TextTokenizationEngine() result = engine.tokenize(text, method="word") tokens = result['tokens']
-
Update to use UIDs:
# Old: Random IDs token_ids = [random_id(t) for t in tokens] # New: Deterministic UIDs token_ids = [token.uid for token in token_records]
-
Migrate embeddings:
# Old: External embeddings embeddings = bert_model.encode(tokens) # New: SanTOK embeddings from src.embeddings.embedding_generator import SanTOKEmbeddingGenerator generator = SanTOKEmbeddingGenerator(strategy="semantic") embeddings = generator.generate_embeddings(token_records)
-
Replace vector store:
# Old: Generic vector store store = VectorStore() # New: SanTOK vector store from src.embeddings.vector_store import ChromaVectorStore store = ChromaVectorStore(collection_name="documents")
-
Add cognitive reasoning:
# Old: Simple retrieval results = store.search(query) # New: Cognitive reasoning from santok_cognitive.memory.unified_memory import UnifiedMemory memory = UnifiedMemory() results = memory.search_by_content(query) # Includes: Reasoning, validation, explainability
SanTOK - Your complete solution for text processing, from tokenization to cognitive reasoning and production deployment.