Skip to content

Add Question 189: Compute Direct Preference Optimization Loss#583

Open
zhenhuan-yang wants to merge 1 commit intoOpen-Deep-ML:mainfrom
zhenhuan-yang:zhy-dpo
Open

Add Question 189: Compute Direct Preference Optimization Loss#583
zhenhuan-yang wants to merge 1 commit intoOpen-Deep-ML:mainfrom
zhenhuan-yang:zhy-dpo

Conversation

@zhenhuan-yang
Copy link

Summary

This PR adds a new medium-difficulty Deep Learning question on computing Direct Preference Optimization (DPO) loss for language model alignment.

Question Details

  • ID: 189
  • Title: Compute Direct Preference Optimization Loss
  • Difficulty: Medium
  • Category: Deep Learning

Implementation

  • ✅ Complete solution with proper numerical stability using np.log1p
  • ✅ Comprehensive educational content covering DPO theory and Bradley-Terry model
  • ✅ Mathematical formulation with LaTeX
  • ✅ 4 diverse test cases with varying parameters
  • ✅ Example with detailed reasoning

Validation

  • ✅ Build successful
  • ✅ Schema validation passed
  • ✅ All test cases pass

Educational Value

Covers an important modern technique for LLM alignment that's simpler and more stable than traditional RLHF, making it highly relevant for current ML practitioners.


# Compute loss using log-sigmoid for numerical stability
# Loss = -log(sigmoid(logits)) = log(1 + exp(-logits))
losses = np.log1p(np.exp(-logits))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overflows to inf when logits is large and negative
losses = np.logaddexp(0, -logits)
this might be more stable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test case to exploit this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants