|
660 | 660 | ;; represents a doubling of the number of fuses of the same value, folding |
661 | 661 | ;; shows that fusing can _not_ reliably represent long runs of repeated values. |
662 | 662 |
|
663 | | -;; ## Conclusion |
| 663 | +;; ## Practical Usage |
664 | 664 |
|
665 | | -;; This article described a method for fusing hashes together using upper |
666 | | -;; triangular matrix multiplication. The method is associative and |
667 | | -;; non-commutative. |
| 665 | +;; Based on the experiments above, here are recommendations for using hash |
| 666 | +;; fusing in practice. |
668 | 667 |
|
669 | | -;; Experiment 1 showed that high entropy data can be well represented by fusing |
670 | | -;; hashes together and that the size of cell fields, from 8 to 64 bits, all |
671 | | -;; performed well with high entropy data. Since 64 bit cells fit in the |
672 | | -;; smallest matrix, this is the fasted to compute. |
| 668 | +;; ### Recommended Configuration |
673 | 669 |
|
674 | | -;; Experiment 2 showed that low entropy data, specifically long runs of |
675 | | -;; repeated values, can _not_ be well represented by fusing hashes together |
676 | | -;; since they all approach a zero hash. The rate of approach is determined by |
677 | | -;; the bit count in the cells. 64 bit cells were the most able to handle |
678 | | -;; repeated values. |
| 670 | +;; Use 64-bit cells (4x4 matrix). This configuration offers: |
| 671 | +;; - The smallest matrix size, making it the fastest to compute |
| 672 | +;; - The highest tolerance for repeated values (64 folds before zero) |
| 673 | +;; - No observed collisions with high-entropy data |
679 | 674 |
|
680 | | -;; In practice, hash fusing with upper triangular matrix multiplication is |
681 | | -;; okay to use if a check is made after each fuse to ensure the lower bits |
682 | | -;; (e.g., the lowest 32 bits) are not all zero. If this check fails, a |
683 | | -;; LowEntropyDataError can be raised to indicate the data being hashed has |
684 | | -;; insufficient entropy. |
| 675 | +;; ### Detecting Low-Entropy Failures |
| 676 | + |
| 677 | +;; The key insight from Experiment 2 is that low-entropy data causes the lower |
| 678 | +;; bits to become zero progressively. We can detect this condition and fail |
| 679 | +;; fast rather than silently producing degenerate hashes. |
| 680 | + |
| 681 | +;; The following function fuses two hashes and raises an error if the result |
| 682 | +;; shows signs of low-entropy degeneration (lower 32 bits all zero): |
685 | 683 |
|
686 | 684 | (defn high-entropy-fuse |
687 | 685 | "Fuse two 256-bit hashes together via upper triangular matrix multiplication |
|
709 | 707 | (catch Exception e |
710 | 708 | (.getMessage e))) |
711 | 709 |
|
712 | | -;; ## Appendix: Why Low Entropy Data Fails (AI Explanation) |
713 | | - |
714 | | -;; _The following explanation was generated by Copilot in response to the |
715 | | -;; question: "Why does low entropy data fail to be represented by hash fusing |
716 | | -;; via upper triangular matrix multiplication?"_ |
717 | | - |
718 | | -;; The reason low entropy data fails to be represented by hash fusing is that |
719 | | -;; the multiplication of upper triangular matrices causes the lower bits of the |
720 | | -;; resulting matrix to rapidly approach zero when the same matrix is repeatedly |
721 | | -;; multiplied by itself. This is because the multiplication operation involves |
722 | | -;; summing products of the matrix elements, and when the same matrix is used |
723 | | -;; repeatedly, the contributions to the lower bits become increasingly |
724 | | -;; negligible compared to the higher bits. As a result, after a certain number |
725 | | -;; of multiplications, the lower bits of the resulting matrix become zero, |
726 | | -;; leading to a loss of information and the inability to distinguish between |
727 | | -;; different inputs that produce the same low-entropy pattern. This phenomenon |
728 | | -;; highlights the limitations of using hash fusing for low-entropy data, as it |
729 | | -;; fails to provide unique identifiers for such data. |
| 710 | +;; ### Handling Low-Entropy Data |
| 711 | + |
| 712 | +;; When working with data that may contain long runs of repeated values, |
| 713 | +;; consider these strategies: |
| 714 | +;; |
| 715 | +;; 1. **Compression**: Run-length encode or compress the data before hashing. |
| 716 | +;; This converts low-entropy patterns into high-entropy representations. |
| 717 | +;; |
| 718 | +;; 2. **Salting**: Inject position-dependent noise into the hash sequence. |
| 719 | +;; For example, XOR each element's hash with a hash of its index. |
| 720 | +;; |
| 721 | +;; 3. **Chunking**: Break long sequences into smaller chunks and hash each |
| 722 | +;; chunk separately before fusing. This limits the damage from repetition. |
| 723 | + |
| 724 | +;; ## Conclusion |
| 725 | + |
| 726 | +;; This article described a method for fusing hashes together using upper |
| 727 | +;; triangular matrix multiplication. The method is associative and |
| 728 | +;; non-commutative. |
| 729 | + |
| 730 | +;; Experiment 1 showed that high entropy data can be well represented by fusing |
| 731 | +;; hashes together and that the size of cell fields, from 8 to 64 bits, all |
| 732 | +;; performed well with high entropy data. Since 64 bit cells fit in the |
| 733 | +;; smallest matrix, this is the fastest to compute. |
| 734 | + |
| 735 | +;; Experiment 2 showed that low entropy data, specifically long runs of |
| 736 | +;; repeated values, can _not_ be well represented by fusing hashes together |
| 737 | +;; since they all approach a zero hash. The rate of approach is determined by |
| 738 | +;; the bit count in the cells. 64 bit cells were the most able to handle |
| 739 | +;; repeated values. |
| 740 | + |
| 741 | +;; The Practical Usage section provides a working implementation with |
| 742 | +;; low-entropy detection, along with strategies for handling problematic data. |
| 743 | + |
| 744 | +;; ## Appendix: Why Low Entropy Data Fails — Nilpotent Group Structure |
| 745 | + |
| 746 | +;; _The following explanation, based on interactions with a Claude-based AI |
| 747 | +;; assistant, is very well written and accurate._ |
| 748 | + |
| 749 | +;; The low-entropy failure observed in Experiment 2 is not an accident of |
| 750 | +;; implementation — it is a fundamental consequence of the algebraic structure |
| 751 | +;; we chose. Understanding this helps clarify both the limitations and the |
| 752 | +;; design space for alternatives. |
| 753 | + |
| 754 | +;; ### Upper Triangular Matrices Form a Nilpotent Group |
| 755 | + |
| 756 | +;; An n×n upper triangular matrix with 1s on the diagonal can be written as |
| 757 | +;; I + N, where I is the identity matrix and N is _strictly_ upper triangular |
| 758 | +;; (all diagonal entries are 0). The key property of strictly upper triangular |
| 759 | +;; matrices is that they are **nilpotent**: repeated multiplication eventually |
| 760 | +;; yields zero. |
| 761 | + |
| 762 | +;; Specifically, for an n×n strictly upper triangular matrix N: |
| 763 | +;; |
| 764 | +;; N^n = 0 (the zero matrix) |
| 765 | +;; |
| 766 | +;; This happens because each multiplication "shifts" the non-zero entries one |
| 767 | +;; diagonal further from the main diagonal. After n-1 multiplications, all |
| 768 | +;; entries have been shifted off the matrix. |
| 769 | + |
| 770 | +;; ### How This Affects Hash Fusing |
| 771 | + |
| 772 | +;; When we fuse a hash with itself (A × A), we are computing (I + N) × (I + N): |
| 773 | +;; |
| 774 | +;; (I + N)² = I + 2N + N² |
| 775 | +;; |
| 776 | +;; With our bit masking over w-bit cells, arithmetic is effectively done |
| 777 | +;; modulo 2^w. Multiplying by 2 shifts all bits left and, after masking back |
| 778 | +;; to w bits, forces the least significant bit of each cell to 0. After k squarings: |
| 779 | +;; |
| 780 | +;; (I + N)^(2^k) = I + 2^k · N + (higher powers of N) |
| 781 | +;; |
| 782 | +;; The coefficient 2^k means the lowest k bits of each cell are forced to zero |
| 783 | +;; under mod 2^w / bitmasking. This is exactly what Experiment 2 observed: |
| 784 | +;; each fold (squaring) zeros out one more bit, and after cell-size folds, |
| 785 | +;; all bits are zero. |
| 786 | + |
| 787 | +;; ### Why This Is Fundamental |
| 788 | + |
| 789 | +;; The nilpotent structure is intrinsic to upper triangular matrices. Any |
| 790 | +;; associative, non-commutative composition based on this structure will |
| 791 | +;; exhibit the same behavior. This is not a bug — it's a mathematical |
| 792 | +;; property of the group we chose. |
| 793 | + |
| 794 | +;; ### The Trade-off |
| 795 | + |
| 796 | +;; We chose upper triangular matrices because they provide: |
| 797 | +;; - **Associativity**: Essential for Finger Trees |
| 798 | +;; - **Non-commutativity**: Essential for ordered sequences |
| 799 | +;; - **Efficient representation**: The matrix structure maps cleanly to hash bits |
| 800 | +;; |
| 801 | +;; The cost is nilpotency, which manifests as the low-entropy failure mode. |
| 802 | +;; This trade-off is acceptable for high-entropy data (which is the common |
| 803 | +;; case for content hashes) and can be mitigated with detection and |
| 804 | +;; preprocessing for edge cases. |
| 805 | + |
| 806 | +;; ### Alternative Structures |
| 807 | + |
| 808 | +;; For applications where low-entropy data is common, one could explore |
| 809 | +;; non-nilpotent groups such as GL(n, F_p) — the general linear group over |
| 810 | +;; a finite field. However, this would sacrifice the sparse matrix efficiency |
| 811 | +;; and require more complex encoding. The upper triangular approach remains |
| 812 | +;; practical for most content-addressed data structure applications. |
| 813 | + |
| 814 | +;; ## Appendix B: Alternatives Explored |
| 815 | + |
| 816 | +;; Before settling on upper triangular matrices, several alternative approaches |
| 817 | +;; were investigated. This section documents those explorations and explains |
| 818 | +;; why they were not suitable. |
| 819 | + |
| 820 | +;; ### Quaternion Multiplication |
| 821 | + |
| 822 | +;; Quaternions are a natural candidate: they form a non-commutative algebra |
| 823 | +;; and have efficient 4-component representations. However, experiments with |
| 824 | +;; quaternion-based fusing revealed the same nilpotent-like behavior observed |
| 825 | +;; with upper triangular matrices. |
| 826 | +;; |
| 827 | +;; When quaternions are encoded similarly to the UTM approach — with values |
| 828 | +;; close to the identity element (1 + εi + εj + εk where ε represents small |
| 829 | +;; perturbations) — repeated self-multiplication causes the perturbation terms |
| 830 | +;; to degenerate. While pure unit quaternions form a proper group (isomorphic |
| 831 | +;; to SU(2)), the encoding required for hash values reintroduces the same |
| 832 | +;; fundamental problem. |
| 833 | + |
| 834 | +;; ### Modified Fusing Operations |
| 835 | + |
| 836 | +;; Various modifications to the basic fusing operation were attempted, |
| 837 | +;; including: |
| 838 | +;; - XOR combined with rotations |
| 839 | +;; - Nonlinear mixing functions |
| 840 | +;; - Polynomial operations over finite fields |
| 841 | +;; |
| 842 | +;; The consistent finding was that **associativity is surprisingly fragile**. |
| 843 | +;; Most intuitive "mixing" operations that might improve entropy preservation |
| 844 | +;; break associativity, which is essential for the Finger Tree use case. |
| 845 | +;; |
| 846 | +;; Matrix multiplication is special precisely because it is one of the few |
| 847 | +;; operations that provides both associativity and non-commutativity naturally. |
| 848 | + |
| 849 | +;; ### Why Upper Triangular Matrices Were Chosen |
| 850 | + |
| 851 | +;; Despite the low-entropy limitation, upper triangular matrices remain the |
| 852 | +;; best practical choice because: |
| 853 | +;; |
| 854 | +;; 1. **Guaranteed associativity**: Matrix multiplication is inherently |
| 855 | +;; associative — this cannot be broken by implementation choices. |
| 856 | +;; |
| 857 | +;; 2. **Guaranteed non-commutativity**: For n > 1, matrix multiplication |
| 858 | +;; is non-commutative, preserving sequence order. |
| 859 | +;; |
| 860 | +;; 3. **Efficient sparse encoding**: The upper triangular structure means |
| 861 | +;; only n(n-1)/2 cells carry information, mapping cleanly to hash bits. |
| 862 | +;; |
| 863 | +;; 4. **Predictable failure mode**: The low-entropy degeneration is |
| 864 | +;; detectable and follows a known pattern (one bit per fold), making |
| 865 | +;; it possible to implement safeguards. |
| 866 | +;; |
| 867 | +;; The alternatives explored either broke associativity (disqualifying them |
| 868 | +;; for Finger Trees) or exhibited the same degeneration behavior without |
| 869 | +;; the other benefits of the matrix approach. |
0 commit comments