Optimization in PRISTINE¶
This document provides a mathematical and algorithmic breakdown of the optimizer used in the PRISTINE framework. The optimizer is designed for maximum-likelihood inference tasks where model parameters are fitted by minimizing a differentiable loss function, typically derived from the negative log-likelihood of observed data under a probabilistic model.
Overview¶
The Optimizer
class in optimize.py
implements robust gradient-based parameter optimization using the Adam algorithm with adaptive learning rate control and backtracking. It targets the minimization of a differentiable loss function:
where \(\theta \in \mathbb{R}^n\) is the vector of free model parameters. The objective is:
The optimizer operates in discrete steps and includes: - Adam update rule with momentum and adaptive variance normalization - Learning rate backtracking when \(\mathcal{L}(\theta_{t+1}) > \mathcal{L}(\theta_t) + \varepsilon\) - Early stopping on convergence or too-small learning rate
Mathematical Formulation and Inner Workings¶
1. Objective¶
Given a differentiable model loss \(\mathcal{L}(\theta)\), compute gradients:
2. Adam Update Rule¶
Maintain moving averages of gradient and squared gradient:
Bias-corrected estimates:
Parameter update:
where \(\alpha\) is the learning rate.
3. Loss Evaluation and Backtracking¶
After updating parameters:
- Evaluate \(\mathcal{L}(\theta_{t+1})\)
- If \(\mathcal{L}(\theta_{t+1}) > \mathcal{L}(\theta_t) + \varepsilon\), reduce learning rate \(\alpha \leftarrow \rho \alpha\), revert update, retry
Repeat until: - Improvement is sufficient - Learning rate \(< \alpha_{\min}\) → stop
Practical Safeguards¶
- Gradient sanity checks: abort if NaNs or infinities are found in parameters or gradients
- Adaptive LR recovery: re-accelerates learning rate when updates stabilize
- Stopping conditions:
- Max iterations
- Loss convergence
- Learning rate decay limit
Python Implementation Notes¶
- Uses
torch.optim.Adam
- Backtracking loop wraps optimizer step with retry logic
- Convergence threshold \(\varepsilon\) and backtrack decay \(\rho\) are configurable
Applicability¶
This optimizer is ideal for: - Fitting parameters in models with complex loss surfaces - Phylogenetic likelihood inference - Numerical stability in curved or sloppy landscapes