#
Selective Contextual Reasoning (SCR)
Selective Contextual Reasoning is Membria’s method for delivering fast, context-aware answers without overwhelming Tiny Language Models. SCR retrieves only the most relevant knowledge fragments, confirms them, and injects them into the prompt in a structured way.
#
Core Steps
Semantic Retrieval
The agent searches its local KV cache and the nearest gateway index for high-similarity chunks related to the current query.Filtering and Deduplication
Retrieved chunks are ranked by freshness, confidence and semantic distance. Duplicates and low-scoring items are discarded.Context Assembly
The agent builds a compact prompt buffer composed of:- static facts (long-term)
- dynamic facts (query-specific)
- short recall from recent turns
Local Inference
The Tiny LM receives the trimmed context and generates a response. Because the prompt is lean, latency stays sub-200 ms even on modest hardware.Fallback to DoD
If the model’s confidence is below threshold or a cache miss occurs, the agent triggers a Distillation-on-Demand cycle with a Big LLM.
#
Segmented KV Buffer
A scheduler prioritizes tokens from the highest-value segments until the model’s context limit is reached.
#
Performance Advantages
- Latency Retrieval and inference complete in < 200 ms on edge devices.
- Token Savings Up to 80 percent fewer Big-LLM calls versus naive RAG.
- Personalization Local cache keeps user-specific facts close to the model.
- Privacy Only distilled context leaves the device when fallback is needed.
#
Implementation Notes
- Embedding search uses HNSW or Faiss for sub-millisecond nearest-neighbor lookups.
- Filtering relies on cosine similarity plus freshness weighting.
- Prompt budgets are enforced by a hard token cap; excess chunks are paged out.
- SCR runs in a separate thread so UI latency remains low.
Selective Contextual Reasoning turns Tiny LMs into efficient, privacy-preserving assistants capable of answering most queries without expensive cloud inference.