Topic 01: Project — Optimization for Advanced Regression¶
1. Objective¶
Apply advanced optimization techniques (custom optimizers and second-order approximations) to a high-dimensional regression task: the Kaggle 'House Prices: Advanced Regression Techniques' dataset. You will investigate how second-order information (Gauss-Newton) affects convergence speed and stability compared to standard first-order methods (Adam, SGD).
2. Dataset Setup¶
- Source: Kaggle House Prices
- Task: Predict
SalePricebased on 79 explanatory variables. - Challenges: High dimensionality, categorical features, non-linear relationships.
Data Exploration Tips¶
- Skewness: The target
SalePriceis often right-skewed. Apply a \(\log(1+x)\) transformation to normalize it. - Missing Values: Many features have
NaNvalues. Some signify "None" (e.g.,PoolQC), while others are truly missing data. - Multicollinearity: Features like
GarageCarsandGarageAreaare highly correlated. Consider dropping one or using PCA.
3. Step-by-Step Implementation Guide¶
Step 1: Preprocessing Pipeline¶
Use pandas and scikit-learn to prepare the data for PyTorch.
- Fill numerical NaNs with the median.
- Use
OneHotEncoderfor categorical variables (handle "unknown" categories). - Crucial: Use
StandardScaleron all input features. Second-order methods are sensitive to feature scaling.
Step 2: The Model¶
Implement a 3-layer MLP in PyTorch.
import torch.nn as nn
class HousePriceMLP(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x):
return self.net(x)
Step 3: Custom Gauss-Newton Optimizer¶
For a residual vector \(\mathbf{r}(w) = \mathbf{y}_{pred} - \mathbf{y}_{true}\), the Gauss-Newton update is:
where \(\mathbf{J} = \frac{\partial \mathbf{y}_{pred}}{\partial w}\).
Implementation Detail: Use torch.func.jacrev to compute the Jacobian for a mini-batch. Note that \(\mathbf{J}^\top \mathbf{J}\) is an \(M \times M\) matrix where \(M\) is the number of parameters. For small networks, this is feasible. For larger ones, use block-diagonal approximations.
Step 4: Training and Monitoring¶
Run three training loops:
- SGD (Learning rate \(10^{-2}\)).
- Adam (Learning rate \(10^{-3}\)).
- Gauss-Newton (\(\lambda = 0.1\)).
4. Expected Results and Analysis¶
What to look for?¶
- Loss Curves: The Gauss-Newton optimizer should show much sharper initial drops in loss compared to SGD. Does it reach a lower plateau?
- Hessian Trace: Use the Hutchinson Trace Estimator (sample \(v \sim \mathcal{N}(0, I)\), compute \(v^\top \mathbf{H} v\) via two backprops) to monitor the curvature. Second-order methods often "ride" along narrow valleys.
- Overfitting: Does the faster convergence of Gauss-Newton lead to worse generalization on the validation set? Compare the gap \(|Loss_{train} - Loss_{val}|\).
Table: Performance Comparison¶
| Optimizer | Final RMSE (Log) | Epochs to Converge | Max Eigenvalue \(\lambda_{max}\) |
|---|---|---|---|
| SGD | |||
| Adam | |||
| Gauss-Newton |
5. Extensions (Optional)¶
- Damping Schedule: Implement a Levenberg-Marquardt schedule for \(\lambda\). Increase \(\lambda\) if the loss increases; decrease it if the loss decreases.
- K-FAC: Implement a Kronecker-factored version of the second-order update to handle wider layers more efficiently.