Data Echoing

Oct 03, 2021

Discussing an approach claimed to speed up the training process for large data pipelines with upstream bottlenecks.

At the bottom make sure to check the threads since the last issue if you were away from Twitter.

Thanks for coming here!

The Problem

Imagine creating a data pipeline for a machine learning solution. It takes streamlining a number of processes, from data reading, augmenting to batching and training. The time for the entire process to run could be a lot.

How can it be reduced? By using accelerators.

Use GPUs, TPUs, and the pipeline would thrive time-wise. But what if it's not satisfactory, can we keep improving the hardware and the speed would keep improving? No. The accelerators used are only effective for part of the process, not the whole.

We create batches of data for the training process, the batch is used by the optimizer to take one step in the right direction. Then for the next step, a fresh batch would need to be created may be from the hard drive or a cloud bucket.

But what if the data is not on the hard drive or in a convenient bucket, what if it's compressed data in terabytes?

The data loading and the reading process would suck!

The accelerator used would be sitting idle, so it wouldn't matter whether it's a 50$ GPU or million-dollar TPU if it's only speeding up the update step. Even if you prefetch the batch before it's requested, it's not super effective if the bottleneck is data loading and reading.

The Alternative

People at Google Research came up with something called Data Echoing to combat this problem.

In a nutshell,

Data Echoing is repeating the batch of data that has already been fetched while the new batch is being fetched.

Now, before any of you revolt, I know the idea seems somewhat counterintuitive at first glance. You may argue that how would repeating the data help?

First of all, we can agree that it won't harm at the very least, yes?

Yes! alright.

So helpfulness of the repeated data could depend on where in the pipeline it's being repeated.

Before we discuss that we can agree on one more point that whether it helps or not, it's certainly better to try that than letting the accelerator sit idle, yes?

Yes! great.

Considering the below form of the pipeline, which is very generic in several machine learning tasks,

The data can be echoed in two ways

Before and after batching
Before and after augmentation

Before batching would mean the examples are shuffled and repeated, so it's example echoing and a batch can have the same examples as well. After the batch would mean batch echoing, repeating the exact batch would not have any same examples within one.

Before augmentation, the echoed examples would be shuffled and randomly augmented, making even the same examples slightly varied, more similar to fresh examples. While after the augmentation could be less effective and also be influenced by how much it's shuffled.

So, if we consider that loading a batch of data from a cloud resource in our pipeline takes twice as much time in an epoch as the rest of the process (or as they are referred to as upstream operations) the accelerator would be idle for 50% of the total time. Echoing by a factor of 2 (repeating twice each example) can help converge a little faster than baseline.

Now take a guess which of the approaches discussed above gives the most fruitful results?

Before augmentation echoing, of course. The random data augmentation is a real value-for-computation operation in the pipeline.

The way researchers measured data echoing was not through wallclock time. Why? Because it can vary with individual hardware, frameworks, and gazillion other factors. What they did use was a proportional proxy metric.

They set a target accuracy to reach, say 91%, and measured how many fresh examples were needed in each of the approaches to get to that target performance. Clever eh!

This was tested on several tasks, like NLP with transformers, Computer Vision on datasets like Imagenet, CIFAR-10, SSD on COCO dataset, etc. Let's see the performance chart for one of them.

The dashed line indicates the expected values if the repeated examples were just as useful as the fresh examples. It can be observed that the echoing before augmentation is so darn close to that! The echo after augmentation is slightly less valuable and batch echoing the least of the three.

But all of them performed better than the baseline. This goes to show that echoing is definitely not an arrow in the void.

There are more experiments tested, like the value of the echo factors and their impact, also what role do the different batch sizes play in this. The technique was still very effective for a range of echo factor values and larger batch sizes resulted in better results than smaller ones.

There are a lot of nitty-gritty details in this technique that I am sparing to keep it less boring.

If you're a TensorFlow person, you'll be glad to know that this echo technique can be applied by just adding this statement where you want it in the pipeline.

dataset.flat_map( lambda t: tf.data.Dataset.from_tensors(t).repeat(e))

Takeaways

Data Echo is repeating data batch while the next one is fetched.
It does work and is great for hardware utilization.
If reading data is the bottleneck then before augmentation echoing works the best.
Works for vision tasks, language modeling, object detection.
Relatively simple to implement.

---