Loading...
Loading...
Compare original and translation side by side
Prompt: "Explain quantum entanglement"
Response A: [technical explanation]
Response B: [simpler explanation with analogy]
Human preference: B > APrompt: "Explain quantum entanglement"
Response A: [technical explanation]
Response B: [simpler explanation with analogy]
Human preference: B > AP(A > B) = sigmoid(r(A) - r(B))L = -log(sigmoid(r(chosen) - r(rejected)))P(A > B) = sigmoid(r(A) - r(B))L = -log(sigmoid(r(chosen) - r(rejected)))reference/reward-modeling.mdreference/reward-modeling.mdmaximize E[R(x, y)] - β * KL(π || π_ref)maximize E[R(x, y)] - β * KL(π || π_ref)reference/policy-optimization.mdreference/policy-optimization.mdL = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))reference/direct-alignment.mdreference/direct-alignment.mdreference/reward-modeling.mdreference/policy-optimization.mdreference/direct-alignment.mdreference/reward-modeling.mdreference/policy-optimization.mdreference/direct-alignment.md