Hello,
I have been reproducing the results for AdEMAMix, but I am observing a consistent discrepancy compared to the performance reported in the paper.
In my runs, shown in the figure below and compared with Figure 3b in the paper, AdEMAMix achieves substantially better performance at all iterations. For example, at 32k iterations, Figure 3b reports a validation loss of approximately 3.3, whereas my run achieves around 3.27. Similar improvements appear at other iteration points as well.
Because the difference is consistent across the entire training curve, I am concerned that I may have misconfigured something or unintentionally deviated from the intended setup.
For reference, here is the exact configuration I used:
torchrun --nproc_per_node=4 ./src/main.py --config_format base --model llama --distributed_backend nccl \
--n_embd 768 --n_head 12 --n_layer 12 \
--batch_size 64 --sequence_length 512 --acc_steps 4 \
--dataset fineweb --datasets_dir DATASETS_DIR --iterations {16000, 32000, 48000, 64000} \
--dropout 0.0 --warmup_steps 2000 --grad_clip 0.5 --seed 0 \
--opt ademamix --lr 1e-3 --weight_decay 0.1 --scheduler cos \
--beta1 0.9 --beta2 0.999 \
--adema_beta3 0.999 --adema_alpha 8.0 \
--adema_beta3_warmup {16000, 32000, 48000, 64000} --adema_alpha_warmup {16000, 32000, 48000, 64000} \
--wandb --wandb_project WANDB_PROJECT --wandb_entity WANDB_ENTITY \
--eval_interval 115 --latest_ckpt_interval 1000
Could you please let me know whether this setup matches the intended experimental configuration, or if there are any subtle implementation details that might explain this discrepancy?
I would greatly appreciate any clarification. Thank you for releasing the code and the paper.
Best regards.
Hello,
I have been reproducing the results for AdEMAMix, but I am observing a consistent discrepancy compared to the performance reported in the paper.
In my runs, shown in the figure below and compared with Figure 3b in the paper, AdEMAMix achieves substantially better performance at all iterations. For example, at 32k iterations, Figure 3b reports a validation loss of approximately 3.3, whereas my run achieves around 3.27. Similar improvements appear at other iteration points as well.
Because the difference is consistent across the entire training curve, I am concerned that I may have misconfigured something or unintentionally deviated from the intended setup.
For reference, here is the exact configuration I used:
Could you please let me know whether this setup matches the intended experimental configuration, or if there are any subtle implementation details that might explain this discrepancy?
I would greatly appreciate any clarification. Thank you for releasing the code and the paper.
Best regards.