his repository contains experimental code aimed at comparing standard and novel non-linear activation functions within Transformer models, specifically using a GPT-2 backbone. The project focuses on investigating how different activations affect model training dynamics, performance, and stability.
his repository contains experimental code aimed at comparing standard and novel non-linear activation functions within Transformer models, specifically using a GPT-2 backbone. The project focuses on investigating how different activations affect model training dynamics, performance, and stability.