CoLab Pro

Thursday 12 November 2020

I have been using CoLab for quite a few years now and have always really appreciated the ability to get access to GPUs (and TPUs) for free. So when I recently found out about CoLab Pro I was reluctant to pay $10 a month for something I had been getting for free. However, at the same time I was paying hundreds of dollars a month for cloud GPU instances. Last week, after going well over my AWS budget last month, I decided to maybe try CoLab Pro and I am very glad I did.

CoLab Pro gives you priority on high-end GPUs - so far I have never not gotten a V100. This is the same GPU I was paying $0.90/hour spot rate (preemptible) on AWS. For me, the main disadvantage of CoLab was that each instance lasted usually about 10 hours before shutting down, and they would time out if left unattended or if I wasn't at the computer. CoLab Pro instances will last up to 24 hours, and they will not time out. I had one running at work the other day and when I got home I figured it had timed out, but when I went back the next morning it was still running !

Obviously, CoLab Pro is better suited to running experiments than executing long training, and it doesn't support multiple GPUs. And if you are using TensorFlow you have TPUs (I prefer PyTorch.) In the past I have repeatedly kicked myself after spending hundreds of dollars training a model, and then finding a small mistake. In the future I will be running my experiments on CoLab Pro and only using VMs when I am sure everything is correct and I need to train models quickly.


K80 vs V100

Monday 16 September 2019

Discovering how much cheaper spot EC2 instances were than normal on-demand instances gave me the courage to try out a faster GPU. I had been using K80s which are painfully slow, but very cheap. The spot price for the V100 is about the same as the on-demand price of the K80s, so using those with spot instances won't be any cheaper, but it won't be more expensive either.

I didn't think the V100s were such great GPUs, so I wasn't expecting it to be worth the extra cost. How wrong I was. Training the network I am currently playing with on a K80 with a batch size of 48 took about 8-12 hours per epoch. Training it on a V100 with a batch size of 64 is looking like it's going to take about 2 hours. With the V100s priced at about 4x the K80s, that works out to about the same price per compute to a little bit cheaper, depending on exactly how long it took per epoch on the K80.

When you factor in the value of not having to wait an entire day to see the results of an epoch, this is a no-brainer as far as I'm concerned. Unfortunately, I'm sure my AWS bill is going to increase substantially. That's how they get you... Once you have a taste of HPC they know you'll be back for more...

