Description
I noticed that in the SimpleLinear implementation, the weights are initialized using: $$self.weight = torch.randn(out \textunderscore features, in \textunderscore features) * (1 / math.sqrt(in \textunderscore features))$$
Analysis
This formula corresponds to Xavier initialization, which is optimized for symmetric activation functions like Tanh. However, since many problems in this repo (like GPT blocks or standard MLPs) use ReLU/GELU, using Kaiming initialization would be more appropriate.
According to He et al. (2015), the standard deviation should be $\sqrt{2 / n}$ to compensate for the variance reduction caused by ReLU's zero-out effect.
Proposed Change
Update the initialization to: $$self.weight = torch.randn(out \textunderscore features, in \textunderscore features) * math.sqrt(2.0 / in \textunderscore features)$$
What do you think? Is the current Xavier-style initialization a conscious design choice or an oversight?
Description$$self.weight = torch.randn(out \textunderscore features, in \textunderscore features) * (1 / math.sqrt(in \textunderscore features))$$ $\sqrt{2 / n}$ to compensate for the variance reduction caused by ReLU's zero-out effect.$$self.weight = torch.randn(out \textunderscore features, in \textunderscore features) * math.sqrt(2.0 / in \textunderscore features)$$
I noticed that in the SimpleLinear implementation, the weights are initialized using:
Analysis
This formula corresponds to Xavier initialization, which is optimized for symmetric activation functions like Tanh. However, since many problems in this repo (like GPT blocks or standard MLPs) use ReLU/GELU, using Kaiming initialization would be more appropriate.
According to He et al. (2015), the standard deviation should be
Proposed Change
Update the initialization to:
What do you think? Is the current Xavier-style initialization a conscious design choice or an oversight?