Skip to content

Dev-Scodes5/DeepSeek-R1-Algo-Implementations

Repository files navigation

DeepSeek-R1-Paper-Implementations

This repository provides hands-on implementations and explanations of core algorithms and evaluation metrics introduced in the DeepSeek-R1 paper. Using Python and NumPy, explore novel reinforcement learning techniques like Group Relative Policy Optimization (GRPO), efficient knowledge distillation methods, and robust metrics for evaluating reasoning models. Ideal for researchers and practitioners wanting to understand and reproduce DeepSeek-R1 contributions.

Table of Contents

Group Relative Policy Optimization

Introduction

Group Relative Policy Optimization (GRPO) is a novel reinforcement learning technique used to train large language models such as DeepSeek R1. GRPO improves upon traditional methods by efficiently calculating advantage scores that guide model training without requiring an additional critic network. This technique allows models to learn better instruction-following and text generation capabilities.

Background: Reinforcement Learning for Language Models

Reinforcement learning is a powerful approach where models learn optimal behaviours by receiving feedback in the form of rewards. For language models, reinforcement learning assists in refining outputs by indicating which generated texts are desirable or not. This feedback mechanism is essential for improving instruction adherence and output quality in complex language tasks.

Traditional Method: Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm. It seeks to improve the model (policy) to maximize the expected rewards by updating it towards actions that yield higher returns. A critical component of PPO is the advantage function, which measures how much better or worse an action is compared to the expected outcome. PPO typically uses a separate critic network to estimate this expected reward (value function), doubling computational and memory costs and making training more complicated.

At its core, PPO uses the probability ratio between the new policy pi_theta and the old policy pi_theta_old for an action a_i:

Probability ratio

This ratio measures how much the policy has changed. PPO clips r_t(theta) to stay within a small range around 1, preventing overly large updates that can destabilize training.

The clipped objective is:

PPO objective

where (A^hat)_t is the advantage estimate. The objective takes the minimum between the unclipped and clipped terms, ensuring the update improves the policy without moving it too far from the previous version. This balance leads to stable and reliable learning across many reinforcement learning problems.

PPO Pseudocode (Python-style)

for each training iteration:
    for each timestep t in rollout:
        # Compute probability ratio of new and old policies
        r_t=pi_theta(a_t|s_t)/pi_theta_old(a_t|s_t)

        # Calculate clipped ratio
        clipped_r_t=clip(r_t,1-epsilon,1+epsilon)

        # Compute surrogate losses
        surrogate1=r_t*advantage[t]
        surrogate2=clipped_r_t*advantage[t]

        # Use the minimum to form the clipped objective
        loss_t=-min(surrogate1,surrogate2) # Negative for gradient ascent

    # Update policy parameters by minimizing total loss over batch
    optimize(policy_parameters,loss=mean(loss_t over t))

GRPO: A New Approach

GRPO stands out by eliminating the need for a separate critic network. Instead, it estimates the baseline reward for a given prompt by averaging the rewards of a group of outputs generated by the current policy. This group-relative comparison provides a direct, meaningful feedback signal, making GRPO more efficient and scalable for massive language models.

GRPO Core Algorithm

For each prompt, sample a group of outputs (size G) from the language model. Assign rewards r_i to each output based on correctness or quality. Calculate each output's advantage A_i using the formula:

Advantage calculation

Here, the numerator r_i−mean(r) represents how much the reward deviates from the group average, while the denominator normalizes this deviation by the standard deviation of the group's rewards. This normalization ensures that the advantage is scale-invariant, allowing fair comparison across outputs regardless of the reward scale.

Knowledge Distillation

What is Knowledge Distillation?

A technique used in machine learning to transfer knowledge from a large, complex model called the 'teacher' to a smaller, more efficient model called the 'student'. By doing this, the student is able to learn from the teacher's expertise without needing to replicate the full training process. This allows individuals to deploy lightweight models that maintain most of the model's accuracy, but are also faster and require less resources.

Soft Labels and Temperature Scaling

Traditional hard labels tend to use one-hot encoding (where a correct class is marked with 1 and incorrect class is marked with 0), however, Knowledge Distillation changes it up and uses soft labels. Soft labels are probability distributions over all classes, reflecting the teacher's confidence in each class. For example, instead of saying 'dog' is the correct class, the teacher might indicate 90% dog, 5% cat, and 5% bird, which reveals relationships between classes.

To create these soft labels, the teacher's output logits are passed through a temperature-scaled softmax function. The temperature parameter T controls how softened or peaked the output probabilities are. A higher temperature means a softer, more uniform distribution that reveals more detailed information for the student to learn from.

Softmax with temperature T used in knowledge distillation:

Softmax temperature

Distillation Loss Function

The student model is trained to mimic the teacher's softened output distribution using a loss function based on the Kullback-Leibler (KL) divergence. KL divergence measures how different two probability distributions are. In this case, it calculates the difference between the teacher's soft labels and the student's prediction.

To ensure stability and effective learning, you scale the KL divergence by the square of the temperature T^2. By doing this, it compensates for the effect of the temperature on the gradients during back propagation, keeping them consistent regardless of the temperature chosen.

Mathematically, the distillation loss is given by the formula:

Distillation loss

where z_t and z_s are the logits from the teacher and student models respectively, and sigma represents the softmax function.

Challenges in Evaluating Reasoning Models

Reasoning models tend to output multiple possible solutions through different reasoning paths, not just a single answer. This results in variability between responses. This variability makes it difficult to assess model performance accurately using traditional metrics. Single-sample evaluation tends to fail to reflect the model's true abilities. Thus, robust evaluation metrics are critical to account for this variation and provide a more reliable measure of reasoning quality.

Evaluation Metrics Overview

Pass@1: Baseline Accuracy

Pass@1 measures the probability that a single response the model generates is correct. It is calculated by averaging the correctness over all responses. This simple metric is efficient and serves as a baseline for comparing different models or configurations. Pass@1 reflects the expected accuracy when only one response per problem is considered.

Pass@1 formula

Majority Voting (Consensus): Improving Accuracy

Majority voting shows how correct answers tend to be consistent across multiple generated responses, whereas incorrect answers are more variable. By choosing the most frequently occurring answer from multiple samples, majority voting will filter out inconsistent errors and will significantly improve accuracy for problems with a single unambiguous solution. This useful approach shines in domains like math problem-solving.

Majority Voting formula

Pass@k: Probability of Success in k Attempts

Pass@k measures the probability that at least one out of k generated responses is correct. This metric becomes useful in scenarios like code generation, where any correct output among multiple attempts counts as success. Pass@k is computed using an unbiased estimator involving combinations. Efficient computational methods avoid dealing with large factorials, ensuring numerical stability.

Pass@k formula

n is the total number of samples, c is the number of correct samples, and k is the number of attempts. This formula calculates the complement of the probability that all k samples are incorrect.

KL Divergence in RLHF

KL Divergence measures the difference between two probability distributions. In this case, it is the current policy and the original pre-trained policy. It acts as a penalty to prevent the model from drifting too far. However, the standard KL divergence formula requires summing over all possible outputs, which is infeasible for language models with vast outputs.

The KL divergence between two policies pi_theta and pi_ref is:

KL divergence

GRPO KL Divergence Estimator

The Group Relative Policy Optimization algorithm introduces an unbiased, per-sample estimator for KL divergence. Instead of summing over all outputs, it calculates divergence based on the ratio of probabilities for a single sampled output.

GRPO uses a per-sample KL divergence estimator given by:

KL estimate

where

r definition

pi_theta is the current policy and pi_ref is the reference policy, o is the output and q is the input.

Conclusion

This README highlighted GRPO’s efficient reward estimation without a critic network, enabling scalable training of large language models. I covered knowledge distillation techniques for transferring expertise from teacher to student models using soft labels and temperature scaling. Key evaluation metrics like Pass@1, Majority Voting, and Pass@k were presented to reliably measure reasoning models’ performance despite output variability. Finally, KL divergence methods, including GRPO’s per-sample estimator, help maintain stable and feasible policy updates. Together, these techniques offer a solid foundation for developing and evaluating advanced reasoning-capable AI models.

References

  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  3. OpenAI Spinning Up. (n.d.). Proximal Policy Optimization (PPO). https://spinningup.openai.com/en/latest/algorithms/ppo.html
  4. Deep ML. (n.d.). 101 Problems in Machine Learning. https://www.deep-ml.com/problems/101?from=DeepSeek%20R1
  5. Zhang, T., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. https://arxiv.org/pdf/2501.12948

About

I am going to learn and code various components of the popular DeepSeek-R1 paper and share my findings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages