Bert - Sawyer Zheng's Blog

参考

BERT详解：概念、原理与应用__StarryNight_的博客-CSDN博客
huggingface 解释 bert: BERT 101 - State Of The Art NLP Model Explained
huggingface 训练 model 流程
- How to train a new language model from scratch using Transformers and Tokenizers
  - 来源： python - Questions when training language models from scratch with Huggingfac…
GitHub - datawhalechina/learn-nlp-with-transformers: we want to create a repo…
- 系列学习资料 transformers

训练资料

多 gpu 训练

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT
GitHub - guotong1988/BERT-GPU: multi-gpu pre-training in one machine for BERT…
deepspeed 基于 transformers 和 nvidia expamples 改进版： https://github.com/microsoft/DeepSpeedExamples/tree/master/training/bing_bert

使用 huggingface transformers 训练

base bert

训练

机器和时间

原文
Training of BERT_BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

训练预料 corpus (pretraning data)

原文
The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.
- BooksCorpus (800M words)
- Wikipedia (2,500M words)
- 2.5B + 0.8B = 3.3B
这些 BooksCorpus 是小说类的文本

训练超参数 hyper parameter

原文
We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.
- batch size: 256
- epochs: 40
- 训练 token(word) 量： 3.3B
- 优化器 optimizer: Adam
- learning rate: 1e-4

模型结构和层数

原文

scibert

参数

基于 bert_base 的参数（weight)
使用原始 bert 的代码训练

经验

训练长句子速度慢
scibert 对长句子问题的处理
- 先训练 length <= 128 的句子
- 达到 train loss 不再下降后，再训练长度 128~512 的句子

训练

机器

原文
We use a single TPU v3 with 8 cores. Training the SCIVOCAB models from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512). The BASEVOCAB models take 2 fewer days of training because they aren’t trained from scratch.
- a single TPU v3 with 8 cores

时间

Training the SCIVOCAB models:
- from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512).
The BASEVOCAB models
- take 2 fewer days of training because they aren’t trained from scratch
- 即基于 BERT_base 训练使用了 5 天

训练数据

pretrain
- 1.14M papers from Semantic Scholar

训练超参数 hyperparameter

finetune
In all settings, we apply a dropout of 0.1 and optimize cross entropy loss using Adam (Kingma and Ba, 2015). We finetune for 2 to 5 epochs using a batch size of 32 and a learning rate of 5e-6, 1e-5, 2e-5, or 5e-5 with a slanted triangular schedule (Howard and Ruder, 2018) which is equivalent to the linear warmup followed by linear decay (Devlin et al., 2019). For each dataset and BERT variant, we pick the best learning rate and number of epochs on the development set and report the corresponding test results.
We found the setting that works best across most datasets and models is 2 or 4 epochs and a learning rate of 2e-5. While task-dependent, optimal hyperparameters for each task are often the same across BERT variants.

训练命令

1
2
3
4
5
6
# Run BERT training for sequences of length 128
python3 run_pretraining.py --input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_128/*.tfrecord --output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_128  --do_train=True --do_eval=True --bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json --train_batch_size=256 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=500000 --num_warmup_steps=1000 --learning_rate=1e-4 --use_tpu=True --tpu_name=node-3 --max_eval_steps=2000  --eval_batch_size 256  --init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 --tpu_zone=us-central1-a


# Run BERT training for sequences of length 512
python3 run_pretraining.py --input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_512/*.tfrecord --output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_512_finetune128  --do_train=True --do_eval=True --bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json --train_batch_size=64 --max_seq_length=512 --max_predictions_per_seq=75 --num_train_steps=800000 --num_warmup_steps=100 --learning_rate=1e-5 --use_tpu=True --tpu_name=node-1 --max_eval_steps=2000  --eval_batch_size 64 --init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_512_finetune128 --tpu_zone=us-central1-a

vocabulary 词库文件

使用 huggingface transformers 训练 bert

多 GPU 比单 GPU 速度更慢分析

原因

huggingface transformers 训练过程中多 GPU 涉及到 data parallel 通信问题
- 这个消耗（overhead）在一定的条件下会成为训练速度的瓶颈

加速方法

使用 DDP 而不是 DP 数据并行模式
- 参考： Efficient Training on Multiple GPUs

文章目录

参考

训练资料

多 gpu 训练

使用 huggingface transformers 训练

base bert

训练

机器和时间

训练预料 corpus (pretraning data)

训练超参数 hyper parameter

模型结构和层数

scibert

参数

经验

训练

机器

时间

训练数据

训练超参数 hyperparameter

训练命令

vocabulary 词库文件

使用 huggingface transformers 训练 bert

多 GPU 比单 GPU 速度更慢分析

原因

加速方法