base bert

训练

机器和时间

  • 原文

    img/bert_training_machine_usage2023-07-18_10-20-58_screenshot.png

    Training of BERT_BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

训练预料 corpus (pretraning data)

  • 原文

    img/2023-07-18_10-32-30_screenshot.png

    The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

    • BooksCorpus (800M words)
    • Wikipedia (2,500M words)
    • 2.5B + 0.8B = 3.3B
  • 这些 BooksCorpus 是小说类的文本

训练超参数 hyper parameter

  • 原文

    We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.

    • batch size: 256
    • epochs: 40
    • 训练 token(word) 量: 3.3B
    • 优化器 optimizer: Adam
    • learning rate: 1e-4

模型结构和层数

  • 原文

    img/2023-07-18_10-43-07_screenshot.png

scibert

参数

  1. 基于 bertbase 的参数(weight)
  2. 使用原始 bert 的代码训练

经验

  1. 训练长句子速度慢
  2. scibert 对长句子问题的处理

    • 先训练 length <= 128 的句子
    • 达到 train loss 不再下降后,再训练长度 128~512 的句子

训练

机器

  • 原文

    We use a single TPU v3 with 8 cores. Training the SCIVOCAB models from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512). The BASEVOCAB models take 2 fewer days of training because they aren’t trained from scratch.

    • a single TPU v3 with 8 cores

时间

  • Training the SCIVOCAB models:

    • from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512).
  • The BASEVOCAB models

    • take 2 fewer days of training because they aren’t trained from scratch
    • 即 基于 BERTbase 训练 使用了 5 天

训练数据

  • pretrain

    • 1.14M papers from Semantic Scholar

      img/2023-07-18_11-07-45_screenshot.png

训练超参数 hyperparameter

  • finetune

    In all settings, we apply a dropout of 0.1 and optimize cross entropy loss using Adam (Kingma and Ba, 2015). We finetune for 2 to 5 epochs using a batch size of 32 and a learning rate of 5e-6, 1e-5, 2e-5, or 5e-5 with a slanted triangular schedule (Howard and Ruder, 2018) which is equivalent to the linear warmup followed by linear decay (Devlin et al., 2019). For each dataset and BERT variant, we pick the best learning rate and number of epochs on the development set and report the corresponding test results.

    We found the setting that works best across most datasets and models is 2 or 4 epochs and a learning rate of 2e-5. While task-dependent, optimal hyperparameters for each task are often the same across BERT variants.

训练命令

1
2
3
4
5
6
# Run BERT training for sequences of length 128
python3 run_pretraining.py --input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_128/*.tfrecord --output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_128  --do_train=True --do_eval=True --bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json --train_batch_size=256 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=500000 --num_warmup_steps=1000 --learning_rate=1e-4 --use_tpu=True --tpu_name=node-3 --max_eval_steps=2000  --eval_batch_size 256  --init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 --tpu_zone=us-central1-a


# Run BERT training for sequences of length 512
python3 run_pretraining.py --input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_512/*.tfrecord --output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_512_finetune128  --do_train=True --do_eval=True --bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json --train_batch_size=64 --max_seq_length=512 --max_predictions_per_seq=75 --num_train_steps=800000 --num_warmup_steps=100 --learning_rate=1e-5 --use_tpu=True --tpu_name=node-1 --max_eval_steps=2000  --eval_batch_size 64 --init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_512_finetune128 --tpu_zone=us-central1-a

vocabulary 词库文件

使用 huggingface transformers 训练 bert

多 GPU 比单 GPU 速度更慢分析

原因

  1. huggingface transformers 训练过程中多 GPU 涉及到 data parallel 通信问题

    • 这个消耗(overhead)在一定的条件下会成为训练速度的瓶颈

加速方法

  1. 使用 DDP 而不是 DP 数据并行模式