.. _basics-of-tfr-records:

Basics of TFRecords
===================

Why TFRecords?
--------------

It is more efficient storage, as in takes less space and can be partitioned into multiple files. It can be read really fast using parallel I/O operations that TPUs or CS systems can take advantage of. Also it is all self-contained. Essentially, by using binary files, you make it easier to distribute and make the data better aligned for efficient reading. Another advantage is efficient sequence data storing such as word encodings, time series.

`TFRecord <https://www.tensorflow.org/tutorials/load_data/tfrecord>`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One of the best ways to work with large datasets is through binary file format for storage of the data. It takes less space on disk, less time to copy and can be read much more efficiently from disk. Additionally, to use with Tensorflow, it becomes a lot easier to combine datasets and data import and preprocessing provided. With TFRecord only the data that is required at the time(i.e. a minibatch) is loaded from disk and then processed. 

The main challenge is to convert your data to this format. A TFRecord file stores data as a sequence of binary strings and there are two components to specify the structure of data:

  - `tf.train.Example  |  TensorFlow v2.10.0 <https://www.tensorflow.org/api_docs/python/tf/train/Example>`_

  - `tf.train.SequenceExample  |  TensorFlow v2.10.0 <https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample>`_
  
  Keep in mind these two are `protocol buffers <https://en.wikipedia.org/wiki/Protocol_Buffers>`_, which were developed by Google to serialize structured data in an efficient way. 
  
You must store each sample of your data in one of these structures and then serialize it and use `tf.io.TFRecordWriter  |  TensorFlow v2.10.0 <https://www.tensorflow.org/api_docs/python/tf/io/TFRecordWriter>`_  to write it to disk. 

You must have every feature in a list. Normally they’re of the type ``Bytes``, ``Float``, and ``Int64``.  Keep in mind that Python strings must be converted to bytes, (e.g. ``my_string.encode(‘utf-8’)``) before they are stored in a ``tf.train.BytesList``.

When each feature is converted into one of the three types, then we use `tf.train.Feature <https://www.tensorflow.org/api_docs/python/tf/train/Feature>`_ to wrap that so TensorFlow can understand it. Then you use `tf.train.Features <https://www.tensorflow.org/api_docs/python/tf/train/Features>`_ to wrap the named features into one collection.  You essentially pass in a dictionary of `tf.train.Feature <https://www.tensorflow.org/api_docs/python/tf/train/Feature>`_ and then you pass the  `tf.train.Features <https://www.tensorflow.org/api_docs/python/tf/train/Features>`_ into `tf.train.Example <https://www.tensorflow.org/api_docs/python/tf/train/Example>`_. 

The TFRecordWriter is a python class that accepts a file *path* and creates a writer object that works just like any other file object and it has *write*, *flush*, and *close* methods. The *write* method takes a string as parameter and writes it to disk but it needs to be serialized first for which we use the *SerializeToString* method. 

It is clearly important that the type of a feature is the same across all samples in the dataset. `This <https://gist.github.com/tgamauf/5eb04f59becc045c88cba29fcd168d24#file-write_tfrecord_with_example-ipynb>`_ is a great jupyter notebook showing writing features to a TFRecord file.

Sometimes you have data that is a sequence and cannot be simply represented using `tf.train.Example  |  TensorFlow v2.10.0 <https://www.tensorflow.org/api_docs/python/tf/train/Example>`_, and rather, we end up using `tf.train.SequenceExample  |  TensorFlow v2.10.0 <https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample>`_, where it does not store a list of bytes, floats or int64s. but a list of lists of bytes, floats or int64s. It has two attributes:

  - Context of type tf.train.Features

  - Features_lists of type tf.train.FeatureLists

FeatureList is differnt from tf.train.Feature in two ways:

  1. All of the features in the list must have the same internal list type.

  2. ``tf.train.FeatureList`` is a dictionary containing (unordered) named features; ``tf.train.FeatureList`` is a list containing ordered unnamed features. 

A typical example of data stored in a ``tf.train.FeatureList`` would be a time series where each ``tf.train.Feature`` in the list is a time step of the sequence. `This is a good example <https://gist.github.com/tgamauf/69ec66a1257c00ee6765841c4286c1dd#file-tf_featurelist-py>`_ of such a FeatureList. 

Walk Through BERT dataset
-------------------------

Now that you know how TFRecords work, we will go through an example on Cerebras systems. If you follow `this READ.ME <https://github.com/Cerebras/modelzoo/transformers/tf/bert>`_, once you download the dataset in the ``tar.xz``, then you extract it. When that is done, there is a directory with multiple subdirectories, each containing a collection of ``*.txt`` files of raw data (one document per ``.txt`` file).

Allocate subsets for training and validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the next step, you will create two subsets of extracted ``.txt`` files, one for training and the second for validation. These two subsets are then used to create TFRecords that will be used for pre-training.

Create TFRecords
~~~~~~~~~~~~~~~~

`create_tfrecords.py <https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/tf/bert/input/scripts/create_tfrecords.py>`_ is a TensorFlow Record generator for BERT pretraining from raw text documents. For which the command-line syntax to run the Python utlity `create_tfrecords.py <https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/tf/bert/input/scripts/create_tfrecords.py>`_ is:

  .. code-block:: bash

       python create_tfrecords.py --metadata_files /path/to/metadata_file.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file /path/to/vocab.txt --do_lower_case
       
where:

  - ``metadata_file.txt`` is a metadata file containing a list of paths to documents, and

  - ``/path/to/vocab.txt`` contains a vocabulary file to map WordPieces to word IDs.
  
This file takes care of sentence segmentation needed for BERT pretraining using NLP toolkit ``SpaCy``. Also you want to shard the input file when you have a large input file instead of keeping the entire input file in memory. 

TFRecords with Image data
-------------------------

  1. Use ``tf.python_io.TFRecordWriter`` to open the tfrecord file and start writing.

  2. Before writing into tfrecord file, the image data and label data should be converted into proper datatype. (``byte``, ``int``, ``float``).

  3. Now the data types are converted into ``tf.train.Feature``.

  4, Finally, create an ``Example Protocol Buffer`` using tf.Example and use the converted features into it. Serialize the Example using ``serialize()`` function.

  5. Write the serialized ``Example``.
  
Writing to TFRecord
~~~~~~~~~~~~~~~~~~~

First, you should have some utility functions converting the values into features, such as:

For numeric values:

  .. code-block:: bash

       def _int64_feature(value):
       return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
       
For string/char values:

  .. code-block:: bash

       def _bytes_feature(value): 
       return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
       
Once you have your image and its label loaded like this:

  .. code-block:: bash

       img = Image.open(image)
       label = 0 
       
Then you simply do the following 

  .. code-block:: bash

       xfeature = { 'label': _int64_feature(label),
       'image': _bytes_feature(img.tostring()) }
       
Then you initiate the writer: 

  .. code-block:: bash

       writer = tf.python_io.TFRecordWriter(tfrecord_filename)``

Then you create an example protocol buffer as shown below:

  .. code-block:: bash

       example = tf.train.Example(features=tf.train.Features(feature=feature))

Then you write the serialized example: 

  .. code-block:: bash

       writer.write(example.SerializeToString())

Then you close the writer

  .. code-block:: bash

       writer.close()

Keep in mind that TensorFlow provides image format support for JPEG, PNG, and GIF in the computation graph.

If you dont want to use TFRecords you can simply convert the image into Tensorflow Tensors like the following(again we reiterate that it is highly recommended that you convert into TFRecords to get optimal performance on CS systems):

  .. code-block:: bash

       tensor = tf.placeholder(tf.uint8)
       encode_jpeg = tf.image.encode_jpeg(tensor) 
       jpeg_bytes = session.run(encode_jpeg, feed_dict={tensor: image})
       with open('freedom.jpg', 'wb') as f:
       ... f.write(jpeg_bytes)
       
The following is an example of PNG or JPEG images into TFRecord ``Example``.

  .. code-block:: bash

       my_example = tf.train.Example(features=tf.train.Features(feature={
        'png_bytes': tf.train.Feature(bytes_list=tf.train.BytesList(value=[png_bytes]))
        }))
        
        my_example_str = my_example.SerializeToString()
        with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
          writer.write(my_example_str)
          
          reader = tf.python_io.tf_record_iterator('my_example.tfrecords')
          those_examples = [tf.train.Example().FromString(example_str)
                  for example_str in reader]
          same_example = those_examples[0]
          
          same_png_bytes = same_example.features.feature['png_bytes'].bytes_list.value[0]
          
When the ``same_png_bytes`` is decoded by ``tf.image.decode_image``, as above, or ``tf.image.decode_png`` directly, you'll get back a tensor with the correct dimensions, because PNG (and JPEG) include that information in their encodings. Again, keep in mind you have access to other image encoding and decodings, such as ``bmp`` and ``gif``. 

If you want to save dense matrix representations in TFRecords, here it can be an example:

  .. code-block:: bash

       image_bytes = image.tostring()
       image_shape = image.shape
       
       my_example = tf.train.Example(features=tf.train.Features(feature={
        'image_bytes': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes])),
        'image_shape': tf.train.Feature(int64_list=tf.train.Int64List(value=image_shape))
      }))
      
      my_example_str = my_example.SerializeToString()
      with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
        writer.write(my_example_str)
        
      reader = tf.python_io.tf_record_iterator('my_example.tfrecords')
      those_examples = [tf.train.Example().FromString(example_str)
                  for example_str in reader]
      same_example = those_examples[0]
      
      same_image_bytes = same_example.features.feature['image_bytes'].bytes_list.value[0]
      same_image_shape = list(
        same_example.features.feature['image_shape'].int64_list.value)
        
With the information recovered from TFRecord form, it's easy to use NumPy to put the image back together.
 
  .. code-block:: bash

       same_image = np.fromstring(same_image_bytes, dtype=np.uint8) 
       same_image.shape = same_image_shape
       
You can do the same using TensorFlow.
 
  .. code-block:: bash

       shape = tf.placeholder(tf.int32)
       new_image = tf.reshape(tf.decode_raw(bytes, tf.uint8), shape)
       same_image = session.run(encode_jpeg, feed_dict={bytes: same_image_bytes,
                                                        shape: same_image_shape})
                                                        
In this example, however, the parsing of the ``Example`` was already done outside the TensorFlow graph, so there isn't a strong reason to stay inside the graph here.

One last note, you may be able to use TFRecords decoder for ``tiff`` images through `this experimental TensorFlow function <https://www.tensorflow.org/io/api_docs/python/tfio/experimental/image/decode_tiff>`_. 

An example code would look like:
 
  .. code-block:: bash

       import tensorflow as tf
       import tensorflow.io as tfio
       ...
       def parse_image(img_path: str) -> dict:
       ...
       image = tf.io.read_file(img_path)
       tfio.experimental.image.decode_tiff(image)
       
Many of the code snippets in this tutorial are taken from here:

  - `How to create tfrecord from image ndarray  dataset <https://kanoki.org/2022/06/02/tensorflow-create-tfrecord-from-image_ndarray_dataset/>`_

  - `TFRecords Basics <https://www.kaggle.com/code/ryanholbrook/tfrecords-basics/notebook>`_

  - `Images and TFRecords <https://planspace.org/20170403-images_and_tfrecords/>`_