Skip to content

Commit c53bf97

Browse files
jclaggettGormCopilot
authored
nilpotent explanation and practical usage (#310)
* Add Practical Usage section, nilpotent explanation, and alternatives explored - Added new 'Practical Usage' section with: - Recommended configuration (64-bit cells) - Low-entropy detection guidance - Strategies for handling problematic data (compression, salting, chunking) - Replaced AI-generated appendix with proper mathematical explanation: - Why upper triangular matrices form a nilpotent group - How nilpotency causes the low-entropy failure (2^k coefficient analysis) - Why this is fundamental to the chosen algebraic structure - Trade-offs and alternative structures (GL(n, F_p)) - Added 'Appendix B: Alternatives Explored' documenting: - Quaternion multiplication experiments (same nilpotent behavior) - Modified fusing operations that broke associativity - Why UTM remains the best practical choice despite limitations - Fixed typo: 'fasted' -> 'fastest' in Conclusion * feat: add disclaimer to appendix I'm not able to construct such a well written argument but I can at least read along to make sure it makes sense. The math is valid at least. also: remove extra verbage * Update src/math/hashing/hashfusing.clj Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/math/hashing/hashfusing.clj Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Gorm <gorm@clawd.bot> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 3619f26 commit c53bf97

File tree

1 file changed

+176
-36
lines changed

1 file changed

+176
-36
lines changed

src/math/hashing/hashfusing.clj

Lines changed: 176 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -660,28 +660,26 @@
660660
;; represents a doubling of the number of fuses of the same value, folding
661661
;; shows that fusing can _not_ reliably represent long runs of repeated values.
662662

663-
;; ## Conclusion
663+
;; ## Practical Usage
664664

665-
;; This article described a method for fusing hashes together using upper
666-
;; triangular matrix multiplication. The method is associative and
667-
;; non-commutative.
665+
;; Based on the experiments above, here are recommendations for using hash
666+
;; fusing in practice.
668667

669-
;; Experiment 1 showed that high entropy data can be well represented by fusing
670-
;; hashes together and that the size of cell fields, from 8 to 64 bits, all
671-
;; performed well with high entropy data. Since 64 bit cells fit in the
672-
;; smallest matrix, this is the fasted to compute.
668+
;; ### Recommended Configuration
673669

674-
;; Experiment 2 showed that low entropy data, specifically long runs of
675-
;; repeated values, can _not_ be well represented by fusing hashes together
676-
;; since they all approach a zero hash. The rate of approach is determined by
677-
;; the bit count in the cells. 64 bit cells were the most able to handle
678-
;; repeated values.
670+
;; Use 64-bit cells (4x4 matrix). This configuration offers:
671+
;; - The smallest matrix size, making it the fastest to compute
672+
;; - The highest tolerance for repeated values (64 folds before zero)
673+
;; - No observed collisions with high-entropy data
679674

680-
;; In practice, hash fusing with upper triangular matrix multiplication is
681-
;; okay to use if a check is made after each fuse to ensure the lower bits
682-
;; (e.g., the lowest 32 bits) are not all zero. If this check fails, a
683-
;; LowEntropyDataError can be raised to indicate the data being hashed has
684-
;; insufficient entropy.
675+
;; ### Detecting Low-Entropy Failures
676+
677+
;; The key insight from Experiment 2 is that low-entropy data causes the lower
678+
;; bits to become zero progressively. We can detect this condition and fail
679+
;; fast rather than silently producing degenerate hashes.
680+
681+
;; The following function fuses two hashes and raises an error if the result
682+
;; shows signs of low-entropy degeneration (lower 32 bits all zero):
685683

686684
(defn high-entropy-fuse
687685
"Fuse two 256-bit hashes together via upper triangular matrix multiplication
@@ -709,21 +707,163 @@
709707
(catch Exception e
710708
(.getMessage e)))
711709

712-
;; ## Appendix: Why Low Entropy Data Fails (AI Explanation)
713-
714-
;; _The following explanation was generated by Copilot in response to the
715-
;; question: "Why does low entropy data fail to be represented by hash fusing
716-
;; via upper triangular matrix multiplication?"_
717-
718-
;; The reason low entropy data fails to be represented by hash fusing is that
719-
;; the multiplication of upper triangular matrices causes the lower bits of the
720-
;; resulting matrix to rapidly approach zero when the same matrix is repeatedly
721-
;; multiplied by itself. This is because the multiplication operation involves
722-
;; summing products of the matrix elements, and when the same matrix is used
723-
;; repeatedly, the contributions to the lower bits become increasingly
724-
;; negligible compared to the higher bits. As a result, after a certain number
725-
;; of multiplications, the lower bits of the resulting matrix become zero,
726-
;; leading to a loss of information and the inability to distinguish between
727-
;; different inputs that produce the same low-entropy pattern. This phenomenon
728-
;; highlights the limitations of using hash fusing for low-entropy data, as it
729-
;; fails to provide unique identifiers for such data.
710+
;; ### Handling Low-Entropy Data
711+
712+
;; When working with data that may contain long runs of repeated values,
713+
;; consider these strategies:
714+
;;
715+
;; 1. **Compression**: Run-length encode or compress the data before hashing.
716+
;; This converts low-entropy patterns into high-entropy representations.
717+
;;
718+
;; 2. **Salting**: Inject position-dependent noise into the hash sequence.
719+
;; For example, XOR each element's hash with a hash of its index.
720+
;;
721+
;; 3. **Chunking**: Break long sequences into smaller chunks and hash each
722+
;; chunk separately before fusing. This limits the damage from repetition.
723+
724+
;; ## Conclusion
725+
726+
;; This article described a method for fusing hashes together using upper
727+
;; triangular matrix multiplication. The method is associative and
728+
;; non-commutative.
729+
730+
;; Experiment 1 showed that high entropy data can be well represented by fusing
731+
;; hashes together and that the size of cell fields, from 8 to 64 bits, all
732+
;; performed well with high entropy data. Since 64 bit cells fit in the
733+
;; smallest matrix, this is the fastest to compute.
734+
735+
;; Experiment 2 showed that low entropy data, specifically long runs of
736+
;; repeated values, can _not_ be well represented by fusing hashes together
737+
;; since they all approach a zero hash. The rate of approach is determined by
738+
;; the bit count in the cells. 64 bit cells were the most able to handle
739+
;; repeated values.
740+
741+
;; The Practical Usage section provides a working implementation with
742+
;; low-entropy detection, along with strategies for handling problematic data.
743+
744+
;; ## Appendix: Why Low Entropy Data Fails — Nilpotent Group Structure
745+
746+
;; _The following explanation, based on interactions with a Claude-based AI
747+
;; assistant, is very well written and accurate._
748+
749+
;; The low-entropy failure observed in Experiment 2 is not an accident of
750+
;; implementation — it is a fundamental consequence of the algebraic structure
751+
;; we chose. Understanding this helps clarify both the limitations and the
752+
;; design space for alternatives.
753+
754+
;; ### Upper Triangular Matrices Form a Nilpotent Group
755+
756+
;; An n×n upper triangular matrix with 1s on the diagonal can be written as
757+
;; I + N, where I is the identity matrix and N is _strictly_ upper triangular
758+
;; (all diagonal entries are 0). The key property of strictly upper triangular
759+
;; matrices is that they are **nilpotent**: repeated multiplication eventually
760+
;; yields zero.
761+
762+
;; Specifically, for an n×n strictly upper triangular matrix N:
763+
;;
764+
;; N^n = 0 (the zero matrix)
765+
;;
766+
;; This happens because each multiplication "shifts" the non-zero entries one
767+
;; diagonal further from the main diagonal. After n-1 multiplications, all
768+
;; entries have been shifted off the matrix.
769+
770+
;; ### How This Affects Hash Fusing
771+
772+
;; When we fuse a hash with itself (A × A), we are computing (I + N) × (I + N):
773+
;;
774+
;; (I + N)² = I + 2N + N²
775+
;;
776+
;; With our bit masking over w-bit cells, arithmetic is effectively done
777+
;; modulo 2^w. Multiplying by 2 shifts all bits left and, after masking back
778+
;; to w bits, forces the least significant bit of each cell to 0. After k squarings:
779+
;;
780+
;; (I + N)^(2^k) = I + 2^k · N + (higher powers of N)
781+
;;
782+
;; The coefficient 2^k means the lowest k bits of each cell are forced to zero
783+
;; under mod 2^w / bitmasking. This is exactly what Experiment 2 observed:
784+
;; each fold (squaring) zeros out one more bit, and after cell-size folds,
785+
;; all bits are zero.
786+
787+
;; ### Why This Is Fundamental
788+
789+
;; The nilpotent structure is intrinsic to upper triangular matrices. Any
790+
;; associative, non-commutative composition based on this structure will
791+
;; exhibit the same behavior. This is not a bug — it's a mathematical
792+
;; property of the group we chose.
793+
794+
;; ### The Trade-off
795+
796+
;; We chose upper triangular matrices because they provide:
797+
;; - **Associativity**: Essential for Finger Trees
798+
;; - **Non-commutativity**: Essential for ordered sequences
799+
;; - **Efficient representation**: The matrix structure maps cleanly to hash bits
800+
;;
801+
;; The cost is nilpotency, which manifests as the low-entropy failure mode.
802+
;; This trade-off is acceptable for high-entropy data (which is the common
803+
;; case for content hashes) and can be mitigated with detection and
804+
;; preprocessing for edge cases.
805+
806+
;; ### Alternative Structures
807+
808+
;; For applications where low-entropy data is common, one could explore
809+
;; non-nilpotent groups such as GL(n, F_p) — the general linear group over
810+
;; a finite field. However, this would sacrifice the sparse matrix efficiency
811+
;; and require more complex encoding. The upper triangular approach remains
812+
;; practical for most content-addressed data structure applications.
813+
814+
;; ## Appendix B: Alternatives Explored
815+
816+
;; Before settling on upper triangular matrices, several alternative approaches
817+
;; were investigated. This section documents those explorations and explains
818+
;; why they were not suitable.
819+
820+
;; ### Quaternion Multiplication
821+
822+
;; Quaternions are a natural candidate: they form a non-commutative algebra
823+
;; and have efficient 4-component representations. However, experiments with
824+
;; quaternion-based fusing revealed the same nilpotent-like behavior observed
825+
;; with upper triangular matrices.
826+
;;
827+
;; When quaternions are encoded similarly to the UTM approach — with values
828+
;; close to the identity element (1 + εi + εj + εk where ε represents small
829+
;; perturbations) — repeated self-multiplication causes the perturbation terms
830+
;; to degenerate. While pure unit quaternions form a proper group (isomorphic
831+
;; to SU(2)), the encoding required for hash values reintroduces the same
832+
;; fundamental problem.
833+
834+
;; ### Modified Fusing Operations
835+
836+
;; Various modifications to the basic fusing operation were attempted,
837+
;; including:
838+
;; - XOR combined with rotations
839+
;; - Nonlinear mixing functions
840+
;; - Polynomial operations over finite fields
841+
;;
842+
;; The consistent finding was that **associativity is surprisingly fragile**.
843+
;; Most intuitive "mixing" operations that might improve entropy preservation
844+
;; break associativity, which is essential for the Finger Tree use case.
845+
;;
846+
;; Matrix multiplication is special precisely because it is one of the few
847+
;; operations that provides both associativity and non-commutativity naturally.
848+
849+
;; ### Why Upper Triangular Matrices Were Chosen
850+
851+
;; Despite the low-entropy limitation, upper triangular matrices remain the
852+
;; best practical choice because:
853+
;;
854+
;; 1. **Guaranteed associativity**: Matrix multiplication is inherently
855+
;; associative — this cannot be broken by implementation choices.
856+
;;
857+
;; 2. **Guaranteed non-commutativity**: For n > 1, matrix multiplication
858+
;; is non-commutative, preserving sequence order.
859+
;;
860+
;; 3. **Efficient sparse encoding**: The upper triangular structure means
861+
;; only n(n-1)/2 cells carry information, mapping cleanly to hash bits.
862+
;;
863+
;; 4. **Predictable failure mode**: The low-entropy degeneration is
864+
;; detectable and follows a known pattern (one bit per fold), making
865+
;; it possible to implement safeguards.
866+
;;
867+
;; The alternatives explored either broke associativity (disqualifying them
868+
;; for Finger Trees) or exhibited the same degeneration behavior without
869+
;; the other benefits of the matrix approach.

0 commit comments

Comments
 (0)