Skip to content

Compile and run MNIST with Lattigo#2614

Closed
j2kun wants to merge 6 commits intogoogle:mainfrom
j2kun:go-mnist
Closed

Compile and run MNIST with Lattigo#2614
j2kun wants to merge 6 commits intogoogle:mainfrom
j2kun:go-mnist

Conversation

@j2kun
Copy link
Copy Markdown
Collaborator

@j2kun j2kun commented Feb 6, 2026

Overall we're finding the performance of OpenFHE bootstrap to be unmanageable. So we're looking to do the HEIR paper benchmarks entirely against lattigo, which means we need to compile models and then load the torch stuff in go directly.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 7, 2026

So the existing go pipeline won't compile mnist, because when it tries to generate primes to configure lattigo it hits (we use openfhe to generate primes)

terminate called after throwing an instance of 'lbcrypto::OpenFHEException'
what(): external/openfhe+/src/core/include/math/nbtheory-impl.h:l.354:LastPrime(): LastPrime: Requested bit length 172 exceeds maximum allowed length 60

So now I'm working on rolling the central loops. I will need to also add lattigo bootstrap ops and configure that properly in the pipeline.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 11, 2026

Quick update here: I am able to compile mnist to lattigo (fully unrolled) and it runs (about 1 minute per inference) but the outputs are wrong, even after rebasing over @asraa's change that disables the known buggy in-place transform. So now I'm comparing the compiled code to openfhe as well as the data inputs, since the go code has a manually-written loader for the mnist weights and inputs

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 12, 2026

Confirmed that the input vectors and weights loaded from the test data are the same (this was not entirely clear because I had to roll my own data loader for loading torch weights and inputs from go).

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 13, 2026

More debugging notes:

Overriding the default lattigo config (from HEIR) with

  param, err1168 := ckks.NewParametersFromLiteral(ckks.ParametersLiteral{
    LogN: 15,
    LogQ: []int{55, 55, 55, 55, 55, 55, 55, 55, 55},
    LogP: []int{60, 60, 60},
    LogDefaultScale: 60,
  })

And encrypting the input with scale

pt.Scale = param.NewScale(math.Pow(2, 60))

Does actually cause the resulting program to produce a different inference. So the working theory is that our configuration for lattigo is insufficient, not that the circuit is wrong.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 13, 2026

modifying the openfhe reference and lattigo implementation to output the raw logits as well:

Openfhe

CPU time used: 124981.92 ms
max_id: 7, label: 7
logits:
0, -1.114529
1, -4.291929
2, -4.677432
3, -1.066077
4, -3.428311
5, -2.115121
6, -2.479473
7, 6.087118
8, -2.045212
9, 2.146153
CPU time used: 123455.96 ms
max_id: 2, label: 2
logits:
0, -0.421192
1, -2.335960
2, 10.809712
3, -5.756362
4, -11.330153
5, 2.796868
6, -9.413978
7, -7.641119
8, 6.577792
9, -2.776271
CPU time used: 123824.07 ms
max_id: 1, label: 1
logits:
0, 0.011383
1, 6.587174
2, 0.821547
3, -7.520698
4, 4.171490
5, -3.873448
6, 0.209216
7, 1.469247
8, -0.095055
9, -1.110379

Lattigo

    mnist_test.go:279: Sample 0 took 1m48.627856981s
    mnist_test.go:296: Sample 0: predicted 2, actual 7
    mnist_test.go:297: logits:
    mnist_test.go:299: 0, -0.100946
    mnist_test.go:299: 1, -0.042298
    mnist_test.go:299: 2, 0.160055
    mnist_test.go:299: 3, -0.093670
    mnist_test.go:299: 4, 0.051379
    mnist_test.go:299: 5, -0.120866
    mnist_test.go:299: 6, -0.076338
    mnist_test.go:299: 7, 0.020700
    mnist_test.go:299: 8, 0.109448
    mnist_test.go:299: 9, 0.074945
    mnist_test.go:279: Sample 1 took 1m47.867796948s
    mnist_test.go:296: Sample 1: predicted 2, actual 2
    mnist_test.go:297: logits:
    mnist_test.go:299: 0, -0.100946
    mnist_test.go:299: 1, -0.042298
    mnist_test.go:299: 2, 0.160055
    mnist_test.go:299: 3, -0.093671
    mnist_test.go:299: 4, 0.051379
    mnist_test.go:299: 5, -0.120866
    mnist_test.go:299: 6, -0.076338
    mnist_test.go:299: 7, 0.020700
    mnist_test.go:299: 8, 0.109448
    mnist_test.go:299: 9, 0.074945
    mnist_test.go:279: Sample 2 took 1m49.516022234s
    mnist_test.go:296: Sample 2: predicted 2, actual 1
    mnist_test.go:297: logits:
    mnist_test.go:299: 0, -0.100946
    mnist_test.go:299: 1, -0.042298
    mnist_test.go:299: 2, 0.160055
    mnist_test.go:299: 3, -0.093671
    mnist_test.go:299: 4, 0.051380
    mnist_test.go:299: 5, -0.120866
    mnist_test.go:299: 6, -0.076338
    mnist_test.go:299: 7, 0.020701
    mnist_test.go:299: 8, 0.109448
    mnist_test.go:299: 9, 0.074945

Obviously, not only are the logits wrong, but they're the same for all three samples.

Copy link
Copy Markdown
Collaborator

@asraa asraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the image pixels one for one compared with openfhe and actually different per i? edit: oh okay, yes, you said weights AND inputs.

ok - then i would think there totally might be something wrong with in-place, but it works w/o in place too?!?!?! what!

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 13, 2026

So we checked that compiling a matvec to lattigo works, next I will truncate the mnist example down to its first layer and compare that with openfhe. I will also try running a ReLU in isolation and compare between lattigo and openfhe.

@j2kun j2kun added the pull_ready Indicates whether a PR is ready to pull. The copybara worker will import for internal testing label Feb 13, 2026
@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 13, 2026

Marking this as pull_ready so I can get it into google3 and do some more testing.

@j2kun j2kun removed the pull_ready Indicates whether a PR is ready to pull. The copybara worker will import for internal testing label Feb 13, 2026
@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 18, 2026

Some notes:

  • I was able to verify that doing a relu approximation in isolation works in lattigo (though the accuracy is not great with the default degree, it should still agree with openfhe in principle)
  • I was able to verify that the lattigo output is still incorrect on a first-layer-only MNIST.
  • The first step at which the OpenFHE compilation and the Lattigo compilation differ (on the "1 layer mnist" test) is 68_annotate-mgmt, at which point the lattigo path thinks the IR only needs 7 levels while the openfhe path thinks it needs 14 (the program is otherwise identical and only uses 7 levels in both cases). This difference is eliminated after 71_annotate-mgmt when both IRs get annotated with 7 levels.
  • After pass 68, the lattigo path chooses these params:
#ckks.scheme_param<logN = 15, Q = [36028797017456641, 35184376545281, 35184367828993, 35184373989377, 35184368025601, 35184373006337, 35184368877569, 35184372744193], P = [1152921504608747521, 1152921504614055937, 1152921504615628801], logDefaultScale = 45, encryptionTechnique = extended>

While openfhe path chooses

#ckks.scheme_param<logN = 16, Q = [36028797014376449, 35184351772673, 35184380870657, 35184353083393, 35184379035649, 35184355704833, 35184378511361, 35184358850561, 35184377331713, 35184363569153, 35184376545281, 35184365273089, 35184373006337, 35184368025601, 35184372744193], P = [1152921504614055937, 1152921504615628801, 1152921504616808449, 1152921504618381313], logDefaultScale = 45>

In particular, lattigo chooses a smaller logN, 7 vs 15 standard primes (corresponding to the initial choice of 14 levels in openfhe), one fewer special prime, and the extended encryption technique.

  • The two pipelines then continue, equivalent (except for parameter selection) to 87_tensor-ext-to-tensor. The openfhe pipeline inserts an extra canonicalizer pass, then they continue as before, and branch apart at lwe-to-openfhe/lwe-to-lattigo.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 18, 2026

Another thing is that I will often see this warning in the lattigo path: warning: Range Analysis indicate that the first modulus must be larger than the scaling modulus by at least 127 bits.

But not in the openfhe path.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 19, 2026

Making progress here: even just the first matvec differs in output between openfhe and lattigo.

@j2kun
Copy link
Copy Markdown
Collaborator Author

j2kun commented Feb 20, 2026

With help @ZenithalHourlyRate the bug has been identified: the input vectors is not cyclically rotated properly. The pass pipeline requests 1024 slots, but the parameter selection requires 4096 (at least). In OpenFHE we have special codegen that does additional cyclic repetition to ensure the slots are filled and rotations are semantically correct. In Lattigo we removed that code a while back, and no test was depending on it even though it breaks Halevi-Shoup matvec.

I'm going to put the fix into a different PR, and try to fix it in a way that applies to all backends, moving this logic out of the codegen step and into the proper pipeline

@j2kun j2kun closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants