29 Mar 2026 • on gradient estimation reinforce evolution strategies optimization

Some Notes on Gradient Estimation

Let

\[J(\theta) = \mathbb{E}_{x \sim p_{\theta}(x)}[f(x)],\]

where $f(x)$ is any scalar quantity of interest, such as a reward or a loss.

REINFORCE-Based Estimators

REINFORCE

Assuming $f$ depends on $\theta$ only through the sample $x$,

\[\begin{aligned} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \int p_{\theta}(x) f(x)\, dx \\ &= \int f(x)\, \nabla_{\theta} p_{\theta}(x)\, dx \\ &= \int f(x)\, p_{\theta}(x)\, \nabla_{\theta} \log p_{\theta}(x)\, dx \\ &= \mathbb{E}_{x \sim p_{\theta}(x)} \left[ f(x)\, \nabla_{\theta} \log p_{\theta}(x) \right]. \end{aligned}\]

This is the REINFORCE, or score-function, gradient estimator (Williams, 1992).

REINFORCE + Baseline

For any baseline $b$ that does not depend on the sampled variable $x$,

\[\mathbb{E}_{x \sim p_{\theta}(x)} \left[ \left( f(x) - b \right) \nabla_{\theta} \log p_{\theta}(x) \right] = \nabla_{\theta} J(\theta),\]

because

\[\mathbb{E}_{x \sim p_{\theta}(x)} \left[ b \nabla_{\theta} \log p_{\theta}(x) \right] = b \nabla_{\theta} \int p_{\theta}(x)\, dx = 0.\]

So subtracting such a baseline preserves unbiasedness. For a scalar parameter, writing $s(x) = \nabla_{\theta} \log p_{\theta}(x)$, the variance is

\[\operatorname{Var}\left[(f(x) - b)s(x)\right] = \mathbb{E}\left[(f(x) - b)^{2} s(x)^{2}\right] - \left(\nabla_{\theta} J(\theta)\right)^{2}.\]

As a function of $b$, this is a quadratic:

\[\mathbb{E}[s(x)^{2}]\, b^{2} - 2 \mathbb{E}[f(x) s(x)^{2}]\, b + \text{const},\]

so it is minimized at

\[b^{\star} = \frac{\mathbb{E}[f(x) s(x)^{2}]}{\mathbb{E}[s(x)^{2}]}.\]

Therefore an appropriate baseline reduces variance, and the optimal constant baseline is $b^{\star}$.

Proximal Policy Optimization

Let $p_{\mathrm{old}}$ be the sampling policy and let $A(x)$ be an advantage estimate treated as constant with respect to $\theta$. The unclipped policy-gradient surrogate is

\[L_{\mathrm{PG}}(\theta) = \mathbb{E}_{x \sim p_{\mathrm{old}}(x)} \left[ r_{\theta}(x)\, A(x) \right], \qquad r_{\theta}(x) = \frac{p_{\theta}(x)}{p_{\mathrm{old}}(x)}.\]

Then

\[\begin{aligned} \nabla_{\theta} L_{\mathrm{PG}}(\theta) &= \mathbb{E}_{x \sim p_{\mathrm{old}}(x)} \left[ A(x)\, \nabla_{\theta} r_{\theta}(x) \right] \\ &= \mathbb{E}_{x \sim p_{\mathrm{old}}(x)} \left[ A(x)\, r_{\theta}(x)\, \nabla_{\theta} \log p_{\theta}(x) \right]. \end{aligned}\]

So PPO is just an importance-weighted REINFORCE estimator. Its clipped objective replaces $r_{\theta}(x) A(x)$ by

\[\min \left( r_{\theta}(x) A(x), \operatorname{clip}(r_{\theta}(x), 1-\varepsilon, 1+\varepsilon) A(x) \right),\]

which keeps the same REINFORCE form when the ratio is unclipped and truncates the update when it is too large (Schulman et al., 2017).

Group Relative Policy Optimization

For the same prompt $q$, let $o_{1}, \ldots, o_{G} \sim \pi_{\mathrm{old}}(\cdot \mid q)$ be a group of sampled outputs with rewards $R_{1}, \ldots, R_{G}$. GRPO uses the group-normalized advantage

\[\hat{A}_{i} = \frac{R_{i} - \operatorname{mean}(R_{1}, \ldots, R_{G})} {\operatorname{std}(R_{1}, \ldots, R_{G})}.\]

Ignoring clipping and the reference-policy KL term, the surrogate objective is

\[L_{\mathrm{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_{i}|} \sum_{t=1}^{|o_{i}|} r_{i,t}(\theta)\, \hat{A}_{i} \right],\]

with token-level importance ratio

\[r_{i,t}(\theta) = \frac{\pi_{\theta}(o_{i,t} \mid q, o_{i,<t})} {\pi_{\mathrm{old}}(o_{i,t} \mid q, o_{i,<t})}.\]

Therefore

\[\nabla_{\theta} L_{\mathrm{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_{i}|} \sum_{t=1}^{|o_{i}|} \hat{A}_{i}\, r_{i,t}(\theta)\, \nabla_{\theta} \log \pi_{\theta}(o_{i,t} \mid q, o_{i,<t}) \right].\]

So GRPO is again REINFORCE with importance weighting, but the baseline is estimated from the rewards of the other samples in the same group rather than from a critic; the full GRPO objective then adds PPO-style clipping and a KL penalty to a reference policy (Shao et al., 2024).

REINFORCE Leave-One-Out

For $K$ independent samples $x_{1}, \ldots, x_{K} \sim p_{\theta}(x)$, define the leave-one-out baseline

\[b_{k} = \frac{1}{K-1} \sum_{j \neq k} f(x_{j}).\]

Then

\[\widehat{\nabla_{\theta} J} = \frac{1}{K} \sum_{k=1}^{K} \left[ \left( f(x_{k}) - b_{k} \right) \nabla_{\theta} \log p_{\theta}(x_{k}) \right].\]

This is unbiased because, conditional on $x_{-k}$, the baseline $b_{k}$ does not depend on $x_{k}$, so

\[\mathbb{E} \left[ b_{k} \nabla_{\theta} \log p_{\theta}(x_{k}) \mid x_{-k} \right] = b_{k}\, \mathbb{E} \left[ \nabla_{\theta} \log p_{\theta}(x_{k}) \right] = 0.\]

This is the leave-one-out control-variate idea underlying REINFORCE Leave-One-Out (RLOO) and VIMCO (Mnih and Rezende, 2016; Ahmadian et al., 2024).

Augment-REINFORCE-Merge / DisARM

For a Bernoulli random variable $z \sim \mathrm{Bernoulli}(\sigma(\phi))$, the Augment-REINFORCE-Merge (ARM) estimator is

\[\nabla_{\phi} \mathbb{E}[f(z)] = \mathbb{E}_{u \sim \mathrm{Uniform}(0, 1)} \left[ \left( f(\mathbf{1}[u > \sigma(-\phi)]) - f(\mathbf{1}[u < \sigma(\phi)]) \right) \left( u - \frac{1}{2} \right) \right].\]

DisARM integrates out the continuous augmentation and, for Bernoulli logits $\alpha_{\theta}$ and an antithetic pair $(b, \tilde{b})$, yields

\[\widehat{\nabla_{\theta} J} = \sum_{i} \left[ \frac{1}{2} \left( f(b) - f(\tilde{b}) \right) (-1)^{\tilde{b}_{i}} \mathbf{1}[b_{i} \neq \tilde{b}_{i}] \sigma(|(\alpha_{\theta})_{i}|) \nabla_{\theta} (\alpha_{\theta})_{i} \right].\]

These are unbiased binary estimators based on antithetic sampling (Yin and Zhou, 2019; Dong et al., 2020).

REBAR

Let $b = H(z)$ with $z \sim p(z \mid \theta)$ and let $\tilde{z} \sim p(z \mid b, \theta)$ be a conditional relaxed sample. With the control variate $c(z) = \eta f(\sigma_{\lambda}(z))$, REBAR uses the unbiased estimator

\[\widehat{\nabla_{\theta} J} = \left[ f(b) - \eta f(\sigma_{\lambda}(\tilde{z})) \right] \nabla_{\theta} \log p(b \mid \theta) + \eta \nabla_{\theta} f(\sigma_{\lambda}(z)) - \eta \nabla_{\theta} f(\sigma_{\lambda}(\tilde{z})).\]

So REBAR is REINFORCE plus a debiased continuous-relaxation control variate (Tucker et al., 2017).

RELAX

RELAX replaces the hand-designed REBAR control variate by a learned differentiable surrogate $c_{\phi}$. With the same notation as above,

\[\widehat{\nabla_{\theta} J} = \left[ f(b) - c_{\phi}(\tilde{z}) \right] \nabla_{\theta} \log p(b \mid \theta) + \nabla_{\theta} c_{\phi}(z) - \nabla_{\theta} c_{\phi}(\tilde{z}).\]

This estimator is unbiased for any $c_{\phi}$; the point is to learn $c_{\phi}$ so as to reduce variance (Grathwohl et al., 2018).

Pathwise Derivative / Reparameterization Trick

If we can write the random variable as a differentiable transformation $x = g_{\theta}(\varepsilon)$ with $\varepsilon \sim p(\varepsilon)$ independent of $\theta$, then

\[\begin{aligned} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \mathbb{E}_{\varepsilon \sim p(\varepsilon)} \left[ f(g_{\theta}(\varepsilon)) \right] \\ &= \mathbb{E}_{\varepsilon \sim p(\varepsilon)} \left[ \nabla_{\theta} f(g_{\theta}(\varepsilon)) \right] \\ &= \mathbb{E}_{\varepsilon \sim p(\varepsilon)} \left[ \frac{\partial f}{\partial x} \frac{\partial g_{\theta}(\varepsilon)}{\partial \theta} \right]. \end{aligned}\]

This is the pathwise, or reparameterization, gradient estimator (Kingma and Welling, 2013; Rezende et al., 2014).

Evolution Strategies

For a Gaussian search distribution over parameters,

\[J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)}[f(\theta + \sigma \varepsilon)] = \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)}[f(x)].\]

Applying the same identity,

\[\begin{aligned} \nabla_{\theta} J(\theta) &= \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)} \left[ f(x)\, \nabla_{\theta} \log \mathcal{N}(x; \theta, \sigma^{2} I) \right] \\ &= \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)} \left[ f(x)\, \frac{x - \theta}{\sigma^{2}} \right] \\ &= \frac{1}{\sigma}\, \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)} \left[ f(\theta + \sigma \varepsilon)\, \varepsilon \right]. \end{aligned}\]

Equivalently, if we define the scaled perturbation $\delta = \sigma \varepsilon$, then $\delta \sim \mathcal{N}(0, \sigma^{2} I)$ and

\[\nabla_{\theta} J(\theta) = \frac{1}{\sigma^{2}}\, \mathbb{E}_{\delta \sim \mathcal{N}(0, \sigma^{2} I)} \left[ f(\theta + \delta)\, \delta \right].\]

This is the same estimator: it is only a change of variables. Indeed, substituting $\delta = \sigma \varepsilon$ gives

\[\frac{1}{\sigma}\, \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)} \left[ f(\theta + \sigma \varepsilon)\, \varepsilon \right] = \frac{1}{\sigma^{2}}\, \mathbb{E}_{\delta \sim \mathcal{N}(0, \sigma^{2} I)} \left[ f(\theta + \delta)\, \delta \right].\]

So Evolution Strategies are just REINFORCE applied to a Gaussian distribution over parameters; the standardized-noise form $\varepsilon \sim \mathcal{N}(0, I)$ is the one used by (Salimans et al., 2017), while the scaled-perturbation form $\delta \sim \mathcal{N}(0, \sigma^{2} I)$ is the notation used by (Huszár, 2017) and (Barber, 2017).

Straight-Through Estimation

Let $z = H(a_{\theta}(\xi))$ be a hard discrete decision, where $H$ is a step function or an $\arg\max$. The exact pathwise derivative is unusable because $\partial H / \partial a$ is zero almost everywhere or undefined. The Straight-Through Estimator (STE) replaces the backward derivative by that of a smooth surrogate $\tilde{H}$:

\[\frac{\partial z}{\partial a} \; := \; \frac{\partial \tilde{H}(a)}{\partial a}.\]

Therefore, for $J(\theta) = \mathbb{E}_{\xi}[L(z)]$,

\[\nabla_{\theta} J(\theta) \approx \mathbb{E}_{\xi} \left[ \frac{\partial L}{\partial z} \frac{\partial \tilde{H}(a_{\theta}(\xi))}{\partial a} \frac{\partial a_{\theta}(\xi)}{\partial \theta} \right].\]

This is not an exact identity: STE is a biased surrogate-gradient estimator (Bengio et al., 2013).

Implicit Maximum Likelihood Estimation

Let $\operatorname{MAP}(\theta)$ denote the maximum a posteriori solution, i.e. the feasible discrete structure that maximizes the score induced by $\theta$. In practice this can be top-$k$ selection, shortest-path inference via Dijkstra, or an integer linear programming solver. If we add random perturbations $\varepsilon \sim \rho$ to the scores and solve the perturbed optimization problem, we obtain

\[z_{\theta}(\varepsilon) = \operatorname{MAP}(\theta + \varepsilon), \qquad \varepsilon \sim \rho.\]

The map $\varepsilon \mapsto \operatorname{MAP}(\theta + \varepsilon)$ pushes the perturbation law $\rho$ forward to an implicit distribution over feasible solutions, $p_{\mathrm{PM}}(z; \theta) = \mathbb{P}_{\varepsilon \sim \rho} \left[ z = \operatorname{MAP}(\theta + \varepsilon) \right].$ Sampling from this implicit distribution by drawing $\varepsilon \sim \rho$ and solving one perturbed MAP problem is the perturb-and-MAP construction introduced by (Papandreou and Yuille, 2011).

Let $g = \nabla_{z} f(z_{\theta}(\varepsilon))$ be the downstream gradient and define target parameters

\[\theta^\prime = \theta - \lambda g.\]

Here $\theta^\prime$ is chosen so as to move in a direction expected to reduce the downstream loss. This is the key idea behind perturbation-based implicit differentiation: in the smooth-marginal setting, perturbing parameters in the negative downstream-gradient direction yields a lower-loss solution (Domke, 2010), and I-MLE adapts this idea to the discrete perturb-and-MAP setting.

For a constrained exponential-family distribution

\[p(z; \theta) \propto \exp(\langle z, \theta \rangle - A(\theta)),\]

the negative log-likelihood of one observation $\hat{z}$ is

\[\ell(\theta; \hat{z}) = A(\theta) - \langle \hat{z}, \theta \rangle,\]

so its gradient is

\[\nabla_{\theta} \ell(\theta; \hat{z}) = \nabla_{\theta} A(\theta) - \hat{z} = \mu(\theta) - \hat{z},\]

where $\mu(\theta) = \mathbb{E}_{p(z; \theta)}[z]$. More generally, if the target is itself a distribution $q(z; \theta^\prime)$ from the same exponential family, the implicit maximum-likelihood objective is

\[L(\theta, \theta^\prime) = -\mathbb{E}_{z \sim q(z; \theta^\prime)}[\log p(z; \theta)],\]

and its gradient is

\[\nabla_{\theta} L(\theta, \theta^\prime) = \mu(\theta) - \mathbb{E}_{z \sim q(z; \theta^\prime)}[z] = \mu(\theta) - \mu(\theta^\prime).\]

Implicit Maximum Likelihood Estimation (I-MLE) chooses the target distribution $q$ through $\theta^\prime$ and then approximates these marginals by perturb-and-MAP samples:

\[\widehat{\nabla_{\theta} J} \propto z_{\theta}(\varepsilon) - z_{\theta^\prime}(\varepsilon) = \operatorname{MAP}(\theta + \varepsilon) - \operatorname{MAP}(\theta - \lambda g + \varepsilon).\]

So the I-MLE estimator is obtained by taking the maximum-likelihood gradient $\mu(\theta) - \mu(\theta^\prime)$ for exponential-family distributions and replacing the intractable marginals by perturb-and-MAP approximations $\mu(\theta) \approx \mathbb{E}_{\varepsilon}[z_{\theta}(\varepsilon)], \qquad \mu(\theta^\prime) \approx \mathbb{E}_{\varepsilon}[z_{\theta^\prime}(\varepsilon)].$

With multiple independent perturbations, one uses

\[\widehat{\nabla_{\theta} J} = \frac{1}{S} \sum_{i=1}^{S} \left[ \operatorname{MAP}(\theta + \varepsilon_{i}) - \operatorname{MAP}(\theta - \lambda g_{i} + \varepsilon_{i}) \right], \qquad \varepsilon_{i} \sim \rho, \qquad g_{i} = \nabla_{z} f(z_{\theta}(\varepsilon_{i})),\]

with the factor $1 / \lambda$ often absorbed into the learning rate. This is a biased finite-difference estimator based on an implicit target distribution (Niepert et al., 2021).

Adaptive I-MLE (AIMLE) extends this by adapting the perturbation scale $\lambda$ during training, explicitly trading off sparsity and bias in the estimate, and by also considering centered finite differences (Minervini et al., 2022).

Gumbel-Softmax / Gumbel-Sigmoid

For a categorical variable with logits $\ell_{\theta}$, sample i.i.d. Gumbel noise $g_{i} = - \log(-\log u_{i})$ with $u_{i} \sim \mathrm{Uniform}(0, 1)$, and define

\[y_{i} = \frac{\exp((\ell_{\theta, i} + g_{i}) / \tau)} {\sum_{j} \exp((\ell_{\theta, j} + g_{j}) / \tau)}.\]

Then $y$ is a differentiable function of $\theta$ and noise $g$, so the pathwise gradient applies:

\[\nabla_{\theta} J(\theta) = \mathbb{E}_{g} \left[ \nabla_{\theta} f(y_{\theta}(g)) \right] = \mathbb{E}_{g} \left[ \frac{\partial f}{\partial y} \frac{\partial y_{\theta}(g)}{\partial \theta} \right].\]

This is the pathwise, or reparameterization, gradient estimator underlying Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017).

In the binary case, often called Gumbel-Sigmoid or Binary Concrete, sample logistic noise $\lambda = \log u - \log(1-u)$ with $u \sim \mathrm{Uniform}(0, 1)$ and define

\[y = \sigma \left( \frac{a_{\theta} + \lambda}{\tau} \right).\]

Then the same reparameterization argument gives

\[\nabla_{\theta} \mathbb{E}_{\lambda}[f(y_{\theta}(\lambda))] = \mathbb{E}_{\lambda} \left[ \frac{\partial f}{\partial y} \frac{\partial y_{\theta}(\lambda)}{\partial \theta} \right].\]

Measure-Valued Derivatives

Suppose the derivative of the measure can be decomposed as

\[\nabla_{\theta_{i}} p(x; \theta) = c_{\theta_{i}} \left[ p_{i}^{+}(x; \theta) - p_{i}^{-}(x; \theta) \right].\]

Then

\[\begin{aligned} \nabla_{\theta_{i}} J(\theta) &= \int f(x)\, \nabla_{\theta_{i}} p(x; \theta)\, dx \\ &= c_{\theta_{i}} \left( \mathbb{E}_{p_{i}^{+}(x; \theta)}[f(x)] - \mathbb{E}_{p_{i}^{-}(x; \theta)}[f(x)] \right). \end{aligned}\]

So the gradient becomes a scaled difference of two expectations under the positive and negative parts of the weak derivative (Mohamed et al., 2020).