This report outlines the technical implementation of a multi-task Convolutional Neural Network (CNN) designed for 24-hour atmospheric forecasting using multi-year meteorological datasets (2018–2021).
The model, WeatherCNN, uses a deep spatial feature extractor coupled with a multi-task prediction head.
The backbone consists of sequential blocks designed to downsample spatial resolution while increasing feature depth:
- Input Stage: Accepts a 42-channel tensor representing various atmospheric variables (e.g., temperature, pressure, wind).
-
Layer 1:
ConvLayer(42$\rightarrow$ 64 filters,$3 \times 3$ kernel) followed by BatchNorm2d and ReLU. -
Pooling:
MaxPool2d($2 \times 2$ kernel, stride 2) reduces the spatial dimensions. -
Layer 2:
ConvLayer(64$\rightarrow$ 128 filters,$3 \times 3$ kernel) followed by BatchNorm2d and ReLU. -
Global Pooling: A second
MaxPool2dlayer further abstracts the spatial features.
The model utilizes a LazyLinear layer in the FCLayer class. This allows the model to automatically infer the flattened input size from the convolutional backbone, outputting a 7-dimensional vector:
- Indices 0–5: Continuous variables (Regression).
- Index 6: Event probability (Classification).
Meteorological data often contains missing values that can destabilize training. This pipeline employs a three-tier cleaning strategy:
- Scanning: A utility identifies "dirty" files containing NaNs across 42 channels.
- Specific Logic: Channel 5 (DSWRF at surface) contains legitimate NaNs during nighttime. These are explicitly imputed with
0.0within theWeatherDatasetclass. - Filtering: Any file containing NaNs in non-exempt channels is excluded from the training indices.
To ensure stable gradient descent, both inputs and targets are scaled:
- Inputs: Per-channel Z-score normalization using pre-computed means and standard deviations from the training set.
- Targets: Regression targets are normalized. During training, the model predicts in the normalized space, while "real-unit" MSE and RMSE are tracked for interpretability.
The model was trained on a single NVIDIA GPU using a multi-task loss approach.
| Component | Specification |
|---|---|
| Optimizer | AdamW |
| Learning Rate | |
| Batch Size | 128 |
| Gradient Clipping | |
| Lead Time | 24 Hours |
The training objective minimizes a combined loss:
Training logs below combine work from two setups: the original 42-channel report (Darius) and Subhanga-6 feature CNN training on the Tufts CS 137 cluster.
The model showed strong convergence within the first three epochs, particularly in the classification task.
| Epoch | Avg Reg Loss (Norm) | Avg Cls Loss |
|---|---|---|
| 0 | 1.1367 | 0.1628 |
| 1 | 0.6316 | 0.0801 |
| 2 | 0.4956 | 0.0589 |
| 3 | 0.6304 | 0.0496 |
Analysis:
- Rapid convergence: The model effectively learned the primary variance of the weather patterns within two epochs, cutting the real-unit RMSE by more than half.
- Stability: BatchNorm and AdamW prevented gradient spikes despite the high dynamic range of the input weather data.
-
Best model: The lowest total loss (
$0.5546$ ) was achieved at Epoch 2, after which the model was saved as the production checkpoint.
Checkpoint: checkpoints/best_model.pt.
| Path | |
|---|---|
| Repo (on cluster) | /cluster/tufts/c26sp1cs0137/supadh03/SkyOracle/ |
| Dataset | /cluster/tufts/c26sp1cs0137/data/assignment2_data/dataset |
| Checkpoints (general) | /cluster/tufts/c26sp1cs0137/supadh03/SkyOracle/checkpoints |
| This run (Subhanga) | /cluster/tufts/c26sp1cs0137/supadh03/SkyOracle/checkpoints/subhanga |
- GPU: NVIDIA A100-PCIE-40GB, CUDA 12.9
- PyTorch (module):
module load pytorch/2.8.0-cuda12.9-cudnn9 - Environment:
conda activatebase on the HPC; training driven from that stack instead of a one-off venv (avoids missing-package issues on compute nodes).
Training started ~9:13 AM targeting the A100 + CUDA 12.9 stack. Notable batch attempts:
- Job 386265 (~9:27 AM) — virtual environment not installed on the node.
- Node 386279 (~10:02 AM) — switched to conda after fixes; job 386281 (~10:08 AM).
- Job 386303 (~10:37 AM).
torch.OutOfMemoryErroron GPU (requested ~1.54 GiB; ~44 GiB already in use by PyTorch). Node 386627, job 386631 — led to batch/memory tuning (PyTorch docs suggestPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truewhen fragmentation is an issue).- Timeout on node 387630.
- Job 387680 — further retry.
Final successful log excerpt below matches a completed run (e.g. subhanga/logs/train_388842.out); best weights written to checkpoints/subhanga/best.pt.
Python 3.11.13
torch 2.8.0+cu129
cuda available True
DATASET_DIR=/cluster/tufts/c26sp1cs0137/data/assignment2_data/dataset
CHECKPOINT_DIR=/cluster/tufts/c26sp1cs0137/supadh03/SkyOracle/checkpoints/subhanga
Device: cuda
Total time steps: 47952
Total usable inputs (t with t+24 in range): 47928
Train inputs: 21672 years=(2018, 2019, 2020)
Val inputs : 8760 years=(2021,)
Targets shape: (47952, 6)
Binary label positives: 988/47952
Train inputs after NaN-target filter: 21144
Val inputs after NaN-target filter : 8760
Loaded channel normalization stats from channel_stats.pt
Mixed precision (autocast): enabled, dtype=torch.bfloat16
| Epoch | train_rmse (norm) | val_rmse (norm) | Best checkpoint (val MSE) |
|---|---|---|---|
| 0 | 0.7985 | 0.8810 | 0.776150 |
| 1 | 0.7631 | 0.8706 | 0.757902 |
| 2 | 0.7515 | 0.8697 | 0.756311 |
| 3 | 0.7409 | 0.8527 | 0.727078 |
| 4 | 0.7316 | 0.8559 | — |
| 5 | 0.7255 | 0.8477 | 0.718675 |
| 6 | 0.7195 | 0.8571 | — |
| 7 | 0.7135 | 0.8681 | — |
| 8 | 0.7096 | 0.8628 | — |
| 9 | 0.7024 | 0.8671 | — |
Training completed with Done. Job finished: Thu Apr 2 02:58:42 PM EDT 2026.
Takeaways: Mixed precision (bfloat16 autocast), filtered NaN targets (21144 train / 8760 val), and channel stats from channel_stats.pt. Best validation MSE in this run was 0.718675 at epoch 5 (checkpoints/subhanga/best.pt).