Bert
文章目录
参考
- BERT详解:概念、原理与应用__StarryNight_的博客-CSDN博客
- huggingface 解释 bert: BERT 101 - State Of The Art NLP Model Explained
huggingface 训练 model 流程
GitHub - datawhalechina/learn-nlp-with-transformers: we want to create a repo…
- 系列学习资料 transformers
训练资料
多 gpu 训练
base bert
训练
机器和时间
原文

Training of BERT_BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.
训练预料 corpus (pretraning data)
原文

The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.
- BooksCorpus (800M words)
- Wikipedia (2,500M words)
- 2.5B + 0.8B = 3.3B
- 这些 BooksCorpus 是小说类的文本
训练超参数 hyper parameter
原文
We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.
- batch size: 256
- epochs: 40
- 训练 token(word) 量: 3.3B
- 优化器 optimizer: Adam
- learning rate: 1e-4
模型结构和层数
原文

scibert
参数
- 基于 bertbase 的参数(weight)
- 使用原始 bert 的代码训练
经验
- 训练长句子速度慢
scibert 对长句子问题的处理
- 先训练 length <= 128 的句子
- 达到 train loss 不再下降后,再训练长度 128~512 的句子
训练
机器
原文
We use a single TPU v3 with 8 cores. Training the SCIVOCAB models from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512). The BASEVOCAB models take 2 fewer days of training because they aren’t trained from scratch.
- a single TPU v3 with 8 cores
时间
Training the SCIVOCAB models:
- from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512).
The BASEVOCAB models
- take 2 fewer days of training because they aren’t trained from scratch
- 即 基于 BERTbase 训练 使用了 5 天
训练数据
pretrain
1.14M papers from Semantic Scholar

训练超参数 hyperparameter
finetune
In all settings, we apply a dropout of 0.1 and optimize cross entropy loss using Adam (Kingma and Ba, 2015). We finetune for 2 to 5 epochs using a batch size of 32 and a learning rate of 5e-6, 1e-5, 2e-5, or 5e-5 with a slanted triangular schedule (Howard and Ruder, 2018) which is equivalent to the linear warmup followed by linear decay (Devlin et al., 2019). For each dataset and BERT variant, we pick the best learning rate and number of epochs on the development set and report the corresponding test results.
We found the setting that works best across most datasets and models is 2 or 4 epochs and a learning rate of 2e-5. While task-dependent, optimal hyperparameters for each task are often the same across BERT variants.
训练命令
| |
vocabulary 词库文件
使用 huggingface transformers 训练 bert
多 GPU 比单 GPU 速度更慢分析
原因
huggingface transformers 训练过程中多 GPU 涉及到 data parallel 通信问题
- 这个消耗(overhead)在一定的条件下会成为训练速度的瓶颈
加速方法
使用 DDP 而不是 DP 数据并行模式
文章作者
上次更新 2024-07-16 (7f33ae8)