Suggestion: Update Linear layer initialization from Xavier to Kaiming for ReLU compatibility

**Description**
I noticed that in the _SimpleLinear_ implementation, the weights are initialized using: $$self.weight = torch.randn(out \textunderscore  features, in \textunderscore features) * (1 / math.sqrt(in \textunderscore features))$$
**Analysis**
This formula corresponds to **Xavier** initialization, which is optimized for symmetric activation functions like Tanh. However, since many problems in this repo (like GPT blocks or standard MLPs) use ReLU/GELU, using **Kaiming** initialization would be more appropriate.
According to He et al. (2015), the standard deviation should be $\sqrt{2 / n}$ to compensate for the variance reduction caused by ReLU's zero-out effect.
**Proposed Change**
Update the initialization to: $$self.weight = torch.randn(out \textunderscore features, in \textunderscore features) * math.sqrt(2.0 / in \textunderscore features)$$

What do you think? Is the current Xavier-style initialization a conscious design choice or an oversight?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Update Linear layer initialization from Xavier to Kaiming for ReLU compatibility #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Suggestion: Update Linear layer initialization from Xavier to Kaiming for ReLU compatibility #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions