nilpotent explanation and practical usage (#310)

jclaggett · Gorm · Copilot · web-flow · commit c53bf97ddb79 · 2026-01-29T10:02:01.000-05:00
* Add Practical Usage section, nilpotent explanation, and alternatives explored

- Added new 'Practical Usage' section with:
  - Recommended configuration (64-bit cells)
  - Low-entropy detection guidance
  - Strategies for handling problematic data (compression, salting, chunking)

- Replaced AI-generated appendix with proper mathematical explanation:
  - Why upper triangular matrices form a nilpotent group
  - How nilpotency causes the low-entropy failure (2^k coefficient analysis)
  - Why this is fundamental to the chosen algebraic structure
  - Trade-offs and alternative structures (GL(n, F_p))

- Added 'Appendix B: Alternatives Explored' documenting:
  - Quaternion multiplication experiments (same nilpotent behavior)
  - Modified fusing operations that broke associativity
  - Why UTM remains the best practical choice despite limitations

- Fixed typo: 'fasted' -&gt; 'fastest' in Conclusion

* feat: add disclaimer to appendix

I'm not able to construct such a well written argument but I can at
least read along to make sure it makes sense. The math is valid at
least.

also: remove extra verbage

* Update src/math/hashing/hashfusing.clj

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;

* Update src/math/hashing/hashfusing.clj

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;

---------

Co-authored-by: Gorm &lt;gorm@clawd.bot&gt;
Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/src/math/hashing/hashfusing.clj b/src/math/hashing/hashfusing.clj
@@ -660,28 +660,26 @@
 ;; represents a doubling of the number of fuses of the same value, folding
 ;; shows that fusing can _not_ reliably represent long runs of repeated values.
 
-;; ## Conclusion
+;; ## Practical Usage
 
-;; This article described a method for fusing hashes together using upper
-;; triangular matrix multiplication. The method is associative and
-;; non-commutative.
+;; Based on the experiments above, here are recommendations for using hash
+;; fusing in practice.
 
-;; Experiment 1 showed that high entropy data can be well represented by fusing
-;; hashes together and that the size of cell fields, from 8 to 64 bits, all
-;; performed well with high entropy data. Since 64 bit cells fit in the
-;; smallest matrix, this is the fasted to compute.
+;; ### Recommended Configuration
 
-;; Experiment 2 showed that low entropy data, specifically long runs of
-;; repeated values, can _not_ be well represented by fusing hashes together
-;; since they all approach a zero hash. The rate of approach is determined by
-;; the bit count in the cells. 64 bit cells were the most able to handle
-;; repeated values.
+;; Use 64-bit cells (4x4 matrix). This configuration offers:
+;; - The smallest matrix size, making it the fastest to compute
+;; - The highest tolerance for repeated values (64 folds before zero)
+;; - No observed collisions with high-entropy data
 
-;; In practice, hash fusing with upper triangular matrix multiplication is
-;; okay to use if a check is made after each fuse to ensure the lower bits
-;; (e.g., the lowest 32 bits) are not all zero. If this check fails, a
-;; LowEntropyDataError can be raised to indicate the data being hashed has
-;; insufficient entropy.
+;; ### Detecting Low-Entropy Failures
+
+;; The key insight from Experiment 2 is that low-entropy data causes the lower
+;; bits to become zero progressively. We can detect this condition and fail
+;; fast rather than silently producing degenerate hashes.
+
+;; The following function fuses two hashes and raises an error if the result
+;; shows signs of low-entropy degeneration (lower 32 bits all zero):
 
 (defn high-entropy-fuse
   "Fuse two 256-bit hashes together via upper triangular matrix multiplication
@@ -709,21 +707,163 @@
   (catch Exception e
          (.getMessage e)))
 
-;; ## Appendix: Why Low Entropy Data Fails (AI Explanation)
-
-;; _The following explanation was generated by Copilot in response to the
-;; question: "Why does low entropy data fail to be represented by hash fusing
-;; via upper triangular matrix multiplication?"_
-
-;; The reason low entropy data fails to be represented by hash fusing is that
-;; the multiplication of upper triangular matrices causes the lower bits of the
-;; resulting matrix to rapidly approach zero when the same matrix is repeatedly
-;; multiplied by itself. This is because the multiplication operation involves
-;; summing products of the matrix elements, and when the same matrix is used
-;; repeatedly, the contributions to the lower bits become increasingly
-;; negligible compared to the higher bits. As a result, after a certain number
-;; of multiplications, the lower bits of the resulting matrix become zero,
-;; leading to a loss of information and the inability to distinguish between
-;; different inputs that produce the same low-entropy pattern. This phenomenon
-;; highlights the limitations of using hash fusing for low-entropy data, as it
-;; fails to provide unique identifiers for such data.
+;; ### Handling Low-Entropy Data
+
+;; When working with data that may contain long runs of repeated values,
+;; consider these strategies:
+;;
+;; 1. **Compression**: Run-length encode or compress the data before hashing.
+;;    This converts low-entropy patterns into high-entropy representations.
+;;
+;; 2. **Salting**: Inject position-dependent noise into the hash sequence.
+;;    For example, XOR each element's hash with a hash of its index.
+;;
+;; 3. **Chunking**: Break long sequences into smaller chunks and hash each
+;;    chunk separately before fusing. This limits the damage from repetition.
+
+;; ## Conclusion
+
+;; This article described a method for fusing hashes together using upper
+;; triangular matrix multiplication. The method is associative and
+;; non-commutative.
+
+;; Experiment 1 showed that high entropy data can be well represented by fusing
+;; hashes together and that the size of cell fields, from 8 to 64 bits, all
+;; performed well with high entropy data. Since 64 bit cells fit in the
+;; smallest matrix, this is the fastest to compute.
+
+;; Experiment 2 showed that low entropy data, specifically long runs of
+;; repeated values, can _not_ be well represented by fusing hashes together
+;; since they all approach a zero hash. The rate of approach is determined by
+;; the bit count in the cells. 64 bit cells were the most able to handle
+;; repeated values.
+
+;; The Practical Usage section provides a working implementation with
+;; low-entropy detection, along with strategies for handling problematic data.
+
+;; ## Appendix: Why Low Entropy Data Fails — Nilpotent Group Structure
+
+;; _The following explanation, based on interactions with a Claude-based AI
+;; assistant, is very well written and accurate._
+
+;; The low-entropy failure observed in Experiment 2 is not an accident of
+;; implementation — it is a fundamental consequence of the algebraic structure
+;; we chose. Understanding this helps clarify both the limitations and the
+;; design space for alternatives.
+
+;; ### Upper Triangular Matrices Form a Nilpotent Group
+
+;; An n×n upper triangular matrix with 1s on the diagonal can be written as
+;; I + N, where I is the identity matrix and N is _strictly_ upper triangular
+;; (all diagonal entries are 0). The key property of strictly upper triangular
+;; matrices is that they are **nilpotent**: repeated multiplication eventually
+;; yields zero.
+
+;; Specifically, for an n×n strictly upper triangular matrix N:
+;;
+;;   N^n = 0 (the zero matrix)
+;;
+;; This happens because each multiplication "shifts" the non-zero entries one
+;; diagonal further from the main diagonal. After n-1 multiplications, all
+;; entries have been shifted off the matrix.
+
+;; ### How This Affects Hash Fusing
+
+;; When we fuse a hash with itself (A × A), we are computing (I + N) × (I + N):
+;;
+;;   (I + N)² = I + 2N + N²
+;;
+;; With our bit masking over w-bit cells, arithmetic is effectively done
+;; modulo 2^w. Multiplying by 2 shifts all bits left and, after masking back
+;; to w bits, forces the least significant bit of each cell to 0. After k squarings:
+;;
+;;   (I + N)^(2^k) = I + 2^k · N + (higher powers of N)
+;;
+;; The coefficient 2^k means the lowest k bits of each cell are forced to zero
+;; under mod 2^w / bitmasking. This is exactly what Experiment 2 observed:
+;; each fold (squaring) zeros out one more bit, and after cell-size folds,
+;; all bits are zero.
+
+;; ### Why This Is Fundamental
+
+;; The nilpotent structure is intrinsic to upper triangular matrices. Any
+;; associative, non-commutative composition based on this structure will
+;; exhibit the same behavior. This is not a bug — it's a mathematical
+;; property of the group we chose.
+
+;; ### The Trade-off
+
+;; We chose upper triangular matrices because they provide:
+;; - **Associativity**: Essential for Finger Trees
+;; - **Non-commutativity**: Essential for ordered sequences
+;; - **Efficient representation**: The matrix structure maps cleanly to hash bits
+;;
+;; The cost is nilpotency, which manifests as the low-entropy failure mode.
+;; This trade-off is acceptable for high-entropy data (which is the common
+;; case for content hashes) and can be mitigated with detection and
+;; preprocessing for edge cases.
+
+;; ### Alternative Structures
+
+;; For applications where low-entropy data is common, one could explore
+;; non-nilpotent groups such as GL(n, F_p) — the general linear group over
+;; a finite field. However, this would sacrifice the sparse matrix efficiency
+;; and require more complex encoding. The upper triangular approach remains
+;; practical for most content-addressed data structure applications.
+
+;; ## Appendix B: Alternatives Explored
+
+;; Before settling on upper triangular matrices, several alternative approaches
+;; were investigated. This section documents those explorations and explains
+;; why they were not suitable.
+
+;; ### Quaternion Multiplication
+
+;; Quaternions are a natural candidate: they form a non-commutative algebra
+;; and have efficient 4-component representations. However, experiments with
+;; quaternion-based fusing revealed the same nilpotent-like behavior observed
+;; with upper triangular matrices.
+;;
+;; When quaternions are encoded similarly to the UTM approach — with values
+;; close to the identity element (1 + εi + εj + εk where ε represents small
+;; perturbations) — repeated self-multiplication causes the perturbation terms
+;; to degenerate. While pure unit quaternions form a proper group (isomorphic
+;; to SU(2)), the encoding required for hash values reintroduces the same
+;; fundamental problem.
+
+;; ### Modified Fusing Operations
+
+;; Various modifications to the basic fusing operation were attempted,
+;; including:
+;; - XOR combined with rotations
+;; - Nonlinear mixing functions
+;; - Polynomial operations over finite fields
+;;
+;; The consistent finding was that **associativity is surprisingly fragile**.
+;; Most intuitive "mixing" operations that might improve entropy preservation
+;; break associativity, which is essential for the Finger Tree use case.
+;;
+;; Matrix multiplication is special precisely because it is one of the few
+;; operations that provides both associativity and non-commutativity naturally.
+
+;; ### Why Upper Triangular Matrices Were Chosen
+
+;; Despite the low-entropy limitation, upper triangular matrices remain the
+;; best practical choice because:
+;;
+;; 1. **Guaranteed associativity**: Matrix multiplication is inherently
+;;    associative — this cannot be broken by implementation choices.
+;;
+;; 2. **Guaranteed non-commutativity**: For n > 1, matrix multiplication
+;;    is non-commutative, preserving sequence order.
+;;
+;; 3. **Efficient sparse encoding**: The upper triangular structure means
+;;    only n(n-1)/2 cells carry information, mapping cleanly to hash bits.
+;;
+;; 4. **Predictable failure mode**: The low-entropy degeneration is
+;;    detectable and follows a known pattern (one bit per fold), making
+;;    it possible to implement safeguards.
+;;
+;; The alternatives explored either broke associativity (disqualifying them
+;; for Finger Trees) or exhibited the same degeneration behavior without
+;; the other benefits of the matrix approach.