The Generalist Language Model
About Google GLaM
Google GLaM is a type of model that utilizes a mixture of experts (MoE) strategy. This means it has separate submodels (experts) that are each tailored for different inputs. The experts are managed by a gating network, which decides which of them to activate depending on the data being processed. For each piece of data being analyzed (usually a word or part of a word), the gating network chooses the two most appropriate experts to work on it. In its full version, GLaM has 1.2T parameters in total, distributed across 64 experts per MoE layer and 32 MoE layers overall. However, during inference, only a fraction of these parameters, roughly 97B (8% of 1.2T), are activated for each token prediction.