Skip to content

Discrepancy between reported and reproduced AdEMAMix results #2

@baekrok

Description

@baekrok

Hello,

I have been reproducing the results for AdEMAMix, but I am observing a consistent discrepancy compared to the performance reported in the paper.

In my runs, shown in the figure below and compared with Figure 3b in the paper, AdEMAMix achieves substantially better performance at all iterations. For example, at 32k iterations, Figure 3b reports a validation loss of approximately 3.3, whereas my run achieves around 3.27. Similar improvements appear at other iteration points as well.

Image

Because the difference is consistent across the entire training curve, I am concerned that I may have misconfigured something or unintentionally deviated from the intended setup.

For reference, here is the exact configuration I used:

torchrun --nproc_per_node=4 ./src/main.py --config_format base --model llama --distributed_backend nccl \
    --n_embd 768 --n_head 12 --n_layer 12 \
    --batch_size 64 --sequence_length 512 --acc_steps 4 \
    --dataset fineweb --datasets_dir DATASETS_DIR --iterations {16000, 32000, 48000, 64000} \
    --dropout 0.0 --warmup_steps 2000 --grad_clip 0.5 --seed 0 \
    --opt ademamix --lr 1e-3 --weight_decay 0.1 --scheduler cos \
    --beta1 0.9 --beta2 0.999 \
    --adema_beta3 0.999 --adema_alpha 8.0 \
    --adema_beta3_warmup {16000, 32000, 48000, 64000} --adema_alpha_warmup {16000, 32000, 48000, 64000} \
    --wandb --wandb_project WANDB_PROJECT  --wandb_entity WANDB_ENTITY \
    --eval_interval 115 --latest_ckpt_interval 1000

Could you please let me know whether this setup matches the intended experimental configuration, or if there are any subtle implementation details that might explain this discrepancy?

I would greatly appreciate any clarification. Thank you for releasing the code and the paper.

Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions