Current artificial intelligence models utilize billions of trainable parameters to achieve challenging tasks. However, this large number of parameters comes with a hefty cost. Training and deploying these huge models require immense memory space and computing capability that can only be provided by hangar-sized data centers in processes that consume energy equivalent to the electricity needs of midsized cities.
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these…
Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present…
The “weights” of a neural network is referred as “parameters” in PyTorch code and it is fine-tuned by optimizer during training. On the contrary, hyperparameters are the parameters of a neural network that is fixed by design and not tuned by training. Examples are the number of hidden layers and…