Pretraining BERT from scratch on openwebtext data on a single GPU using Docker.
Hey everyone ! welcome to my blog. This blog covers detailed instructions to building a docker image to pretrain ELECTRA. I can hear you asking why do I need to build a docker image? Okay here it goes. The required version for running ELECTRA is tensorflow 1.15.5 (which is older version) that requires CUDA version 10. In case you have a different version of CUDA, like CUDA 11 installed in your machine and you want to pretrain ELECTRA from scratch, dockerization is the best way to run older version of tensorflow 1.15.5.
Git clone ELECTRA
Let us first clone the Electra code to our local machine using the command git clone https://github.com/google-research/electra.git
and then use the command cd electra
to get into that directory. To pretrain ELECTRA from scratch, we are going to use openwebtext corpus as explained in Quickstart: Pre-train a small ELECTRA model.
Creating data folder and downloading vocab.txt
Let us create a folder named ‘data’ using the command mkdir data
within the electra
folder. change into the directory using the command cd data
and download the vocab.txt from https://storage.googleapis.com/electra-data/vocab.txt using the command wget https://storage.googleapis.com/electra-data/vocab.txt
.
Downloading dataset with gdown
Next let us download the openwebtext dataset. It is possible to download the dataset within docker container, however for simplicity purposes, let us download it first onto the local machine. we are going pip install gdown
Using the command,
gdown https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx
This downloads openwebtext.tar.xz
in the data/
folder.
Building Docker image using Dockerfile
This blog does not cover the installation of docker.Please refer the documentation here for installation in ubuntu. Once docker is installed in your system, we need to create Dockerfile
. Create a file using nano Dockerfile
in electra folder. Type the following commands in Dockerfile
and save Dockerfile
FROM tensorflow/tensorflow:1.15.5-gpu-py3RUN /usr/bin/python3 -m pip install -U pipADD requirements.txt /tmp/RUN /usr/bin/python3 -m pip install -r /tmp/requirements.txtADD . /appWORKDIR /app# bashCMD ["/bin/bash"]
We need TensorFlow 1.15.0 with GPU and Python3. The command FROM tensorflow/tensorflow:1.15.0-gpu-py3
creates a docker image from the base image TensorFlow 1.15.0 image with python3.
RUN /usr/bin/python3 -m pip install -U pip
The above command installs pip
. ADD requirements.txt /tmp/
adds the requirements.txt
file to the tmp
folder.
RUN /usr/bin/python3 -m pip install -r /tmp/requirements.txt
The above command installs all the dependencies mentioned in requirements.txt.
ADD . /app
adds all the files in the current directory, in our case electra folder to app
folder in the docker image. WORKDIR /app
sets app
directory as the current working directory in the docker image created.
Create.dockerignore
file and add data/openwebtext/*.xz .
This ensures these files are not excluded while building the docker.
Building the docker image
We are going to create volume
To build an image, make sure you are within the directory where the Dockerfile is present. Ideally, this folder contains all the data and files. In our case it is the electra folder. We will use the docker build
command, the -t
flag tags the new image with the name electra-image
. .
indicates the current directory which has all the data, files and Dockerfile.
docker build -t electra-image .
After you have successfully built the image, you can check run docker images to list all the images. You should see your image namedelectra-image
.
Since we need to get the models from the container, we are going to mount the data folder in the electra folder as a volume. ( Note : please give the absolute path to data folder ).The following command creates and starts the container for our previously built electra-image
.
docker run --rm --gpus 0 -v /path/electra/data:/app/data -it electra-image
After running this command, it automatically takes you to the terminal of the image as shown in the image below. You can see that the app/
folder has all the files from the electra
folder.
To preprocess the openwebtext dataset for electra, go to the app/
folder and run the command python3 build_openwebtext_pretraining_dataset.py --data-dir data/ --num-processes 32.
It pre-processes/tokenizes the data and outputs examples as tfrecord files under data/pretrain_tfrecords
. The tfrecords require roughly 30G of disk space. There are 32 cores available in the machine hence using 32 num-processes. Please check the number of cores available in the machine and choose the number accordingly.
Pretraining the model
Run python3 run_pretraining.py --data-dir data/ --model-name electra_small_owt
For running a small model, we used the hyperparameters set in configure_pretraining.py
except for electra_objective
.Since we are training a BERT model, we set electra_objective
to false
. The model and our config files get written in the data/models
directory.
Finetuning BERT on glue data
Download GLUE data from here. Follow the steps as given under Finetune ELECTRA on a GLUE task. Make sure all the downloaded GLUE datasets are under /data/finetuning_data
. You will have to edit the .dockerignore
file to exclude the models and GLUE dataset. Refer the .dockerignore
file in the github repo. Since we have mounted the data
folder as volume, we will be able to access the model and GLUE datasets in the docker container. Build a docker image and run the docker command
docker build -t electra-image-finetune .
docker run --rm --gpus 0 -v /path/electra/data:/app/data -it electra-image-finetune
To finetune on Cola dataset, run the command given below.
python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"model_size": "small", "task_names": ["cola"]}'
If you want to experiment with different learning_rates, pass "learning_rate":3e-5
to --hparams
In case you are looking at cloning the repo with Dockerfile
, .dockerignore
and requirements.txt
, you can find it here :
To delete your image, you can run the command docker image rm electra-image -f
. This command forcefully deletes electra-image
. It is a good practice to delete unwanted images to free up memory :) Please leave a comment or open an issue in the github repo if you have any questions.