TFRecods for offline data processing in TensorFlow#

Why TFRecords?#

It is more efficient storage, as in takes less space and can be partitioned into multiple files. It can be read really fast using parallel I/O operations that TPUs or CS systems can take advantage of. Also it is all self-contained. Essentially, by using binary files, you make it easier to distribute and make the data better aligned for efficient reading. Another advantage is efficient sequence data storing such as word encodings, time series.

TFRecord#

One of the best ways to work with large datasets is through binary file format for storage of the data. It takes less space on disk, less time to copy and can be read much more efficiently from disk. Additionally, to use with Tensorflow, it becomes a lot easier to combine datasets and data import and preprocessing provided. With TFRecord only the data that is required at the time(i.e. a minibatch) is loaded from disk and then processed.

The main challenge is to convert your data to this format. A TFRecord file stores data as a sequence of binary strings and there are two components to specify the structure of data:

Keep in mind these two are protocol buffers, which were developed by Google to serialize structured data in an efficient way.

You must store each sample of your data in one of these structures and then serialize it and use tf.io.TFRecordWriter | TensorFlow v2.10.0 to write it to disk.

You must have every feature in a list. Normally they’re of the type Bytes, Float, and Int64. Keep in mind that Python strings must be converted to bytes, (e.g. my_string.encode(‘utf-8’)) before they are stored in a tf.train.BytesList.

When each feature is converted into one of the three types, then we use tf.train.Feature to wrap that so TensorFlow can understand it. Then you use tf.train.Features to wrap the named features into one collection. You essentially pass in a dictionary of tf.train.Feature and then you pass the tf.train.Features into tf.train.Example.

The TFRecordWriter is a python class that accepts a file path and creates a writer object that works just like any other file object and it has write, flush, and close methods. The write method takes a string as parameter and writes it to disk but it needs to be serialized first for which we use the SerializeToString method.

It is clearly important that the type of a feature is the same across all samples in the dataset. This is a great jupyter notebook showing writing features to a TFRecord file.

Sometimes you have data that is a sequence and cannot be simply represented using tf.train.Example | TensorFlow v2.10.0, and rather, we end up using tf.train.SequenceExample | TensorFlow v2.10.0, where it does not store a list of bytes, floats or int64s. but a list of lists of bytes, floats or int64s. It has two attributes:

  • Context of type tf.train.Features

  • Features_lists of type tf.train.FeatureLists

FeatureList is differnt from tf.train.Feature in two ways:

  1. All of the features in the list must have the same internal list type.

  2. tf.train.FeatureList is a dictionary containing (unordered) named features; tf.train.FeatureList is a list containing ordered unnamed features.

A typical example of data stored in a tf.train.FeatureList would be a time series where each tf.train.Feature in the list is a time step of the sequence. This is a good example of such a FeatureList.

Example using Cerebras Model Zoo BERT#

Now that you know how TFRecords work, we will go through an example on Cerebras systems. If you follow this READ.ME, once you download the dataset in the tar.xz, then you extract it. When that is done, there is a directory with multiple subdirectories, each containing a collection of *.txt files of raw data (one document per .txt file).

Allocate subsets for training and validation#

In the next step, you will create two subsets of extracted .txt files, one for training and the second for validation. These two subsets are then used to create TFRecords that will be used for pre-training.

Create TFRecords#

create_tfrecords.py is a TensorFlow Record generator for BERT pretraining from raw text documents. For which the command-line syntax to run the Python utlity create_tfrecords.py is:

python create_tfrecords.py --metadata_files /path/to/metadata_file.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file /path/to/vocab.txt --do_lower_case

where:

  • metadata_file.txt is a metadata file containing a list of paths to documents, and

  • /path/to/vocab.txt contains a vocabulary file to map WordPieces to word IDs.

This file takes care of sentence segmentation needed for BERT pretraining using NLP toolkit SpaCy. Also you want to shard the input file when you have a large input file instead of keeping the entire input file in memory.

TFRecords with Image data#

  1. Use tf.python_io.TFRecordWriter to open the tfrecord file and start writing.

  2. Before writing into tfrecord file, the image data and label data should be converted into proper datatype. (byte, int, float).

  3. Now the data types are converted into tf.train.Feature.

4, Finally, create an Example Protocol Buffer using tf.Example and use the converted features into it. Serialize the Example using serialize() function.

  1. Write the serialized Example.

Writing to TFRecord#

First, you should have some utility functions converting the values into features, such as:

For numeric values:

def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

For string/char values:

def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

Once you have your image and its label loaded like this:

img = Image.open(image)
label = 0

Then you simply do the following

xfeature = { 'label': _int64_feature(label),
'image': _bytes_feature(img.tostring()) }

Then you initiate the writer:

writer = tf.python_io.TFRecordWriter(tfrecord_filename)``

Then you create an example protocol buffer as shown below:

example = tf.train.Example(features=tf.train.Features(feature=feature))

Then you write the serialized example:

writer.write(example.SerializeToString())

Then you close the writer

writer.close()

Keep in mind that TensorFlow provides image format support for JPEG, PNG, and GIF in the computation graph.

If you dont want to use TFRecords you can simply convert the image into Tensorflow Tensors like the following(again we reiterate that it is highly recommended that you convert into TFRecords to get optimal performance on CS systems):

tensor = tf.placeholder(tf.uint8)
encode_jpeg = tf.image.encode_jpeg(tensor)
jpeg_bytes = session.run(encode_jpeg, feed_dict={tensor: image})
with open('freedom.jpg', 'wb') as f:
... f.write(jpeg_bytes)

The following is an example of PNG or JPEG images into TFRecord Example.

my_example = tf.train.Example(features=tf.train.Features(feature={
 'png_bytes': tf.train.Feature(bytes_list=tf.train.BytesList(value=[png_bytes]))
 }))

 my_example_str = my_example.SerializeToString()
 with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
   writer.write(my_example_str)

   reader = tf.python_io.tf_record_iterator('my_example.tfrecords')
   those_examples = [tf.train.Example().FromString(example_str)
           for example_str in reader]
   same_example = those_examples[0]

   same_png_bytes = same_example.features.feature['png_bytes'].bytes_list.value[0]

When the same_png_bytes is decoded by tf.image.decode_image, as above, or tf.image.decode_png directly, you’ll get back a tensor with the correct dimensions, because PNG (and JPEG) include that information in their encodings. Again, keep in mind you have access to other image encoding and decodings, such as bmp and gif.

If you want to save dense matrix representations in TFRecords, here it can be an example:

 image_bytes = image.tostring()
 image_shape = image.shape

 my_example = tf.train.Example(features=tf.train.Features(feature={
  'image_bytes': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes])),
  'image_shape': tf.train.Feature(int64_list=tf.train.Int64List(value=image_shape))
}))

my_example_str = my_example.SerializeToString()
with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
  writer.write(my_example_str)

reader = tf.python_io.tf_record_iterator('my_example.tfrecords')
those_examples = [tf.train.Example().FromString(example_str)
            for example_str in reader]
same_example = those_examples[0]

same_image_bytes = same_example.features.feature['image_bytes'].bytes_list.value[0]
same_image_shape = list(
  same_example.features.feature['image_shape'].int64_list.value)

With the information recovered from TFRecord form, it’s easy to use NumPy to put the image back together.

same_image = np.fromstring(same_image_bytes, dtype=np.uint8)
same_image.shape = same_image_shape

You can do the same using TensorFlow.

shape = tf.placeholder(tf.int32)
new_image = tf.reshape(tf.decode_raw(bytes, tf.uint8), shape)
same_image = session.run(encode_jpeg, feed_dict={bytes: same_image_bytes,
                                                 shape: same_image_shape})

In this example, however, the parsing of the Example was already done outside the TensorFlow graph, so there isn’t a strong reason to stay inside the graph here.

One last note, you may be able to use TFRecords decoder for tiff images through this experimental TensorFlow function.

An example code would look like:

import tensorflow as tf
import tensorflow.io as tfio
...
def parse_image(img_path: str) -> dict:
...
image = tf.io.read_file(img_path)
tfio.experimental.image.decode_tiff(image)

Note

Many of the code snippets in this tutorial are taken from here: