← back

Direct Preference Optimization (DPO) Loss

#382 · LLM · Medium

⊣ Solve on deep-ml.com

Problem

Implement the Direct Preference Optimization (DPO) loss function. Given pairs of preferred and dispreferred responses, compute the loss that directly optimizes the policy to match human preferences without training a separate reward model.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np

def dpo_loss(
    pi_logprobs_chosen: np.ndarray,
    pi_logprobs_rejected: np.ndarray,
    ref_logprobs_chosen: np.ndarray,
    ref_logprobs_rejected: np.ndarray,
    beta: float = 0.1
) -> float:
    # Compute log-ratios
    log_ratio_chosen = pi_logprobs_chosen - ref_logprobs_chosen
    log_ratio_rejected = pi_logprobs_rejected - ref_logprobs_rejected

    # DPO loss: -E[log sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))]
    logits = beta * (log_ratio_chosen - log_ratio_rejected)
    # Numerically stable log-sigmoid
    loss = -np.mean(np.where(
        logits >= 0,
        -np.log(1 + np.exp(-logits)),
        logits - np.log(1 + np.exp(logits))
    ))
    return float(loss)

def dpo_gradients(
    pi_logprobs_chosen: np.ndarray,
    pi_logprobs_rejected: np.ndarray,
    ref_logprobs_chosen: np.ndarray,
    ref_logprobs_rejected: np.ndarray,
    beta: float = 0.1
) -> dict:
    log_ratio_chosen = pi_logprobs_chosen - ref_logprobs_chosen
    log_ratio_rejected = pi_logprobs_rejected - ref_logprobs_rejected
    logits = beta * (log_ratio_chosen - log_ratio_rejected)
    sigmoid = 1.0 / (1.0 + np.exp(-logits))

    # Gradient: push up chosen, push down rejected
    grad_chosen = -beta * (1 - sigmoid) / len(logits)
    grad_rejected = beta * (1 - sigmoid) / len(logits)

    return {
        "loss": float(dpo_loss(pi_logprobs_chosen, pi_logprobs_rejected,
                               ref_logprobs_chosen, ref_logprobs_rejected, beta)),
        "grad_chosen": grad_chosen,
        "grad_rejected": grad_rejected,
        "mean_reward_margin": float(np.mean(log_ratio_chosen - log_ratio_rejected)),
    }

Explanation

  1. Compute the log-probability ratio between the policy and reference model for both chosen and rejected responses.
  2. The DPO loss is -log(sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))), derived from the Bradley-Terry preference model.
  3. Beta controls how far the policy can deviate from the reference model — higher beta means stronger KL penalty.
  4. The gradient increases the log-probability of chosen responses and decreases it for rejected responses, weighted by how "surprised" the model is (1 - sigmoid).

Complexity

  • Time: O(n) where n is the number of preference pairs
  • Space: O(n) for intermediate computations