Pretraining BERT from scratch on openwebtext data on a single GPU using Docker.

Bhuvana Kundumani
Analytics Vidhya
Published in
4 min readOct 5, 2021

--

Hey everyone ! welcome to my blog. This blog covers detailed instructions to building a docker image to pretrain ELECTRA. I can hear you asking why do I need to build a docker image? Okay here it goes. The required version for running ELECTRA is tensorflow 1.15.5 (which is older version) that requires CUDA version 10. In case you have a different version of CUDA, like CUDA 11 installed in your machine and you want to pretrain ELECTRA from scratch, dockerization is the best way to run older version of tensorflow 1.15.5.

Git clone ELECTRA

Let us first clone the Electra code to our local machine using the command git clone https://github.com/google-research/electra.git and then use the command cd electra to get into that directory. To pretrain ELECTRA from scratch, we are going to use openwebtext corpus as explained in Quickstart: Pre-train a small ELECTRA model.

Creating data folder and downloading vocab.txt

Let us create a folder named ‘data’ using the command mkdir data within the electra folder. change into the directory using the command cd data and download the vocab.txt from https://storage.googleapis.com/electra-data/vocab.txt using the command wget https://storage.googleapis.com/electra-data/vocab.txt .

Downloading dataset with gdown

Next let us download the openwebtext dataset. It is possible to download the dataset within docker container, however for simplicity purposes, let us download it first onto the local machine. we are going pip install gdown Using the command,

gdown https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx

This downloads openwebtext.tar.xz in the data/ folder.

Building Docker image using Dockerfile

This blog does not cover the installation of docker.Please refer the documentation here for installation in ubuntu. Once docker is installed in your system, we need to create Dockerfile . Create a file using nano Dockerfile in electra folder. Type the following commands in Dockerfile and save Dockerfile

FROM tensorflow/tensorflow:1.15.5-gpu-py3RUN /usr/bin/python3 -m pip install -U pipADD requirements.txt /tmp/RUN /usr/bin/python3 -m pip install -r /tmp/requirements.txtADD . /appWORKDIR /app# bashCMD ["/bin/bash"]

We need TensorFlow 1.15.0 with GPU and Python3. The command FROM tensorflow/tensorflow:1.15.0-gpu-py3 creates a docker image from the base image TensorFlow 1.15.0 image with python3.

RUN /usr/bin/python3 -m pip install -U pip

The above command installs pip. ADD requirements.txt /tmp/ adds the requirements.txt file to the tmpfolder.

RUN /usr/bin/python3 -m pip install -r /tmp/requirements.txt

The above command installs all the dependencies mentioned in requirements.txt. ADD . /app adds all the files in the current directory, in our case electra folder to app folder in the docker image. WORKDIR /app sets app directory as the current working directory in the docker image created.

Create.dockerignore file and add data/openwebtext/*.xz . This ensures these files are not excluded while building the docker.

Building the docker image

We are going to create volume

To build an image, make sure you are within the directory where the Dockerfile is present. Ideally, this folder contains all the data and files. In our case it is the electra folder. We will use the docker build command, the -t flag tags the new image with the name electra-image. . indicates the current directory which has all the data, files and Dockerfile.

docker build -t electra-image .

After you have successfully built the image, you can check run docker images to list all the images. You should see your image namedelectra-image .

Since we need to get the models from the container, we are going to mount the data folder in the electra folder as a volume. ( Note : please give the absolute path to data folder ).The following command creates and starts the container for our previously built electra-image .

docker run --rm --gpus 0  -v /path/electra/data:/app/data -it electra-image

After running this command, it automatically takes you to the terminal of the image as shown in the image below. You can see that the app/ folder has all the files from the electra folder.

To preprocess the openwebtext dataset for electra, go to the app/ folder and run the command python3 build_openwebtext_pretraining_dataset.py --data-dir data/ --num-processes 32. It pre-processes/tokenizes the data and outputs examples as tfrecord files under data/pretrain_tfrecords. The tfrecords require roughly 30G of disk space. There are 32 cores available in the machine hence using 32 num-processes. Please check the number of cores available in the machine and choose the number accordingly.

Pretraining the model

Run python3 run_pretraining.py --data-dir data/ --model-name electra_small_owt For running a small model, we used the hyperparameters set in configure_pretraining.py except for electra_objective .Since we are training a BERT model, we set electra_objective to false . The model and our config files get written in the data/models directory.

Finetuning BERT on glue data

Download GLUE data from here. Follow the steps as given under Finetune ELECTRA on a GLUE task. Make sure all the downloaded GLUE datasets are under /data/finetuning_data . You will have to edit the .dockerignore file to exclude the models and GLUE dataset. Refer the .dockerignorefile in the github repo. Since we have mounted the data folder as volume, we will be able to access the model and GLUE datasets in the docker container. Build a docker image and run the docker command

docker build -t electra-image-finetune .
docker run --rm --gpus 0 -v /path/electra/data:/app/data -it electra-image-finetune

To finetune on Cola dataset, run the command given below.

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"model_size": "small", "task_names": ["cola"]}'

If you want to experiment with different learning_rates, pass "learning_rate":3e-5 to --hparams

In case you are looking at cloning the repo with Dockerfile , .dockerignoreand requirements.txt , you can find it here :

To delete your image, you can run the command docker image rm electra-image -f . This command forcefully deletes electra-image . It is a good practice to delete unwanted images to free up memory :) Please leave a comment or open an issue in the github repo if you have any questions.

--

--