Peter Izsak Moshe Berchansky Omer Levy

Abstract
While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| linguistic-acceptability-on-cola | 24hBERT | Accuracy: 57.1 |
| natural-language-inference-on-multinli | 24hBERT | Matched: 84.4 Mismatched: 83.8 |
| natural-language-inference-on-qnli | 24hBERT | Accuracy: 90.6 |
| natural-language-inference-on-rte | 24hBERT | Accuracy: 57.7% |
| question-answering-on-quora-question-pairs | 24hBERT | Accuracy: 70.7 |
| semantic-textual-similarity-on-mrpc | 24hBERT | Accuracy: 87.5% |
| semantic-textual-similarity-on-sts-benchmark | 24hBERT | Pearson Correlation: 0.820 |
| sentiment-analysis-on-sst-2-binary | 24hBERT | Accuracy: 93.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.