TensorFlow Queues and Validation

Thursday 22 March 2018

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path])
X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000)
X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1])
y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Labels: python, data_science, machine_learning, tensorflow
No comments

Google CoLaboratory File Persistence

Friday 23 February 2018

It took me a while to figure out exactly what was going on with the files I was uploading and creating using Google's CoLaboratory. Each user has a VM where their notebooks run and the VM only runs for 12 hours before it is spun down and recycled, taking with it any files you may have downloaded or created. The second day I used it I was surpised that the files I had spent time downloading, unzipping and importing were no longer there, and I had deleted the code to do that, so if you are using CoLab make sure you keep the code to get your data files!

I also tried to have two notebooks running at the same time thinking it would speed up some work I was doing, but it seems as if all of a user's notebooks run in the same VM, so there really is no advantage to having multiple notebooks running.

There is an instruction notebook that explains how to save files to Google Drive, which works very well and is easy to use. To do that run:

from google.colab import auth
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build

auth.authenticate_user()

Then you have to enter a code to authenticate yourself. Then I use this function to save files:

drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
  file_metadata = {
    'name': name,
    'mimeType': 'application/octet-stream'
  }
  
  media = MediaFileUpload(path, 
                        mimetype='application/octet-stream',
                        resumable=True)
  
  created = drive_service.files().create(body=file_metadata,
                                       media_body=media,
                                       fields='id').execute()

  print('File ID: {}'.format(created.get('id')))
  return created

The function takes two arguments, the name of the file and the path to it, and write the file to the root of your Google drive.

Note - This post was updated because my original guess as to how the VMs work was completely wrong. The VM instance exists for 12 hours, they are not tied to the runtime.

Labels: coding, machine_learning, tensorflow, google
No comments

Google CoLab

Monday 19 February 2018

On my laptop it takes forever to train my TensorFlow models. I was looking for cheap online services where I could run the code and not having any luck finding anything, Google Cloud Computing does give you $300 worth of free processing time, but that's not really free. I did find Google Colab which is a Python notebook based environment where you can run code for free, and it includes GPU support!

It took me a little while to get everything set up, but it was relatively easy and it runs incredibly fast. The tricky part was getting my data into the notebook. While Colab saves the notebooks to your Google Drive, they do not run on your Google Drive so you can't just put the data on the Drive and then access it.

I used wget to download the data from a URL to wherever the notebook is running, then unzipped it with Python and then I was able to read the data, so it wasn't all that complicated. When I tried to follow the instructions on importing data from Google Drive via an API I was unable to get it to work - I kept getting errors about directories and files not existing despite the fact that they showed up when I did !ls.

They have Tesla K80 GPUs available and the code runs incredibly fast. I'm still training my first model, but it seems like it's going to finish in about 20 minutes whereas it would have taken 3+ hours to train it locally. This difference in speed makes it possible to do things like tune the learning rate and hyperparameters, which are not practical to do locally if it takes hours to train the model.

This is an amazing service from Google and I am already using it heavily, just hours after having discovered it.

Labels: coding, python, machine_learning, google
No comments

Update on TensorFlow GPU Windows Errors

Friday 16 February 2018

After playing with TensorFlow GPU on Windows for a few days I have more information on the errors. I am running TensorFlow 1.6, currently the latest version, with Python 3.6 and Nvidia CUDA 9.0 on an Nvidia GE Force GT 750M.

When the Python Windows process crashes with an error that says CUDA_ERROR_LAUNCH_FAILED, the problem can be solved by reducing the fraction of the GPU memory available with:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7

If the Python script fails with an error about exhausted resources or being unable to allocate enough memory, then you need to use a smaller batch size. This problem does not crash the Python process, Python throws an Exception but does not crash.

Once I figured these out, I have had no problems running models on the GPU at all.

Labels: python, machine_learning, tensorflow
No comments

Batch Normalization with TensorFlow

Tuesday 13 February 2018

I was trying to use batch normalization in order to improve the accuracy of my CIFAR classifier with tf.layers.batch_normalization, and it seemed to have little to no effect. According to this StackOverflow post you need to do something extra, which is not mentioned in the documentation, in order to get the batch normalization to work.

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
sess.run([train_op, extra_update_ops], ...)

The batch norm update operations are added to UPDATE_OPS collection, so you need to create that operation and then feed it into the session along with the training op. Before I had added the extra_update_ops the batch normalization was definitely not running, now it is, whether it helps or not remains to be seen.

Also make sure to use a training=[BOOLEAN | TENSOR] in the call to batch_normalization() to prevent it from being applied during evaluation. I use a placeholder and pass whether it is training or not in via the feed_dict:

training = tf.placeholder(dtype=tf.bool)

And then use this in my batch norm and dropout layers:

training=training

There were a few other things I had to do to get batch normalization to work properly:

  1. I had been using local response normalization, which apparently doesn't help that much. I removed those layers and replaced them with batch normalization layers.
  2. Remove the activation from the conv2d layers. I run the output through the batch normalize layers and then apply the relu.

Before I made these changes the model with the batch normalization didn't seem to be training at all, the accuracy was just going up and down right around the baseline of .10. After these changes it seems to be training properly now.

Labels: data_science, machine_learning, tensor_flow
No comments

Archives