AWS Lambda

Thursday 07 March 2019

I've been working with AWS Lambda recently and I am very impressed. Usually if I need to set up a microservice or a recurring task or anything like that I'll just set up something on one of my virtual servers so I didn't think Lambda would be all that useful. But it makes it really, really easy to set up little tasks and it is much cheaper than having a whole virtual server.

You can create tasks in a number of different languages, and set up a variety of triggers ranging from HTTP requests to scheduled tasks, and when the Lambda is triggered AWS spins it up, executes it and then shuts it down. Since it is so ephemeral it is completely stateless, but you can load files from S3 buckets if you need data of any sort. I assume you can probably also connect to a variety of AWS databases as well, although I haven't done this yet. If you need additional libraries or packages that are not default you can create a layer containing them.

Lambda is not going to replace servers for most use cases, but I think serverless technology is going to make quite a dent in the near future.

Labels: coding, aws, lambda
No comments

CatBoost

Thursday 10 January 2019

Usually when you think of a gradient boosted decision tree you think of XGBoost or LightGBM. I'd heard of CatBoost but I'd never tried it and it didn't seem too popular. I was looking at a Kaggle competition which had a lot of categorical data and I had squeezed just about every drop of performance I could out of LGBM so I decided to give CatBoost a try. I was extremely impressed.

Out of the box, with all default parameters, CatBoost scored better than the LGBM I had spent about a week tuning. CatBoost trained significantly slower than LGBM, but it will run on a GPU and doing so makes it train just slightly slower than the LGBM. Unlike XGBoost it can handle categorical data, which is nice because in this case we have far too many categories to do one-hot encoding. I've read the documentation several times but I am still unclear as to how exactly it encodes the categorical data, but whatever it does works very well.

I am just beginning to try to tune the hyperparameters so it is unclear how much (if any) extra performance I'll be able to squeeze out of it, but I am very, very impressed with CatBoost and I highly recommend it for any datasets which contain categorical data. Thank you Yandex! 

Labels: coding, data_science, machine_learning, kaggle, catboost
2 comments

Exercise Log

Tuesday 27 November 2018

I exercise quite a lot and I have not been able to find an app to keep track of it which satisfies all of my criteria. Most fitness trackers are geared towards cardio and I also do a lot of strength training. After spending a year trying to make due with combinations of various fitness trackers and other apps I decided to just write my own, which could do everything I wanted and could show all of the reports I wanted.

I did that and after using it for a few weeks put it online at workout-log.com. It's not fancy and it is quite likely very buggy at this point, but it is open to anyone who wants to use it. 

It's written with Django and jQuery and uses ChartJS for the charts. 

Labels: python, django, data_science, machine_learning
1 comments

CoLab TPUs One Month Later

Wednesday 31 October 2018

After having used both CoLab GPUs and TPUs for almost a month I must significantly revise my previous opinion. Even for a Keras model not written or optimized for TPUs, with some minimal configuration changes TPUs perform much faster - minimum of twice the speed. In addition to making sure that all operations are TPU compatible, the only major configuration change required is increasing the batch size by 8. At first I was playing around with the batch size, but I realized that this was unnecessary. TPUs have 8 shards, so you simply multiple the GPU batch size by 8 and that should be a good baseline. 

The model I am currently training on a TPU and a GPU simultaneously is training 3-4x faster on the TPU than on the GPU and the code is exactly the same. I have this block of code:

use_tpu = True
# if we are using the tpu copy the keras model to a new var and assign the tpu model to model
if use_tpu:
    TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    
    # create network and compiler
    tpu_model = tf.contrib.tpu.keras_to_tpu_model(
        model, strategy = tf.contrib.tpu.TPUDistributionStrategy(
            tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
    
    BATCH_SIZE = BATCH_SIZE * 8

The model is created with Keras and the only change I make is setting use_tpu to True on the TPU instance. 

One other thing I thought I would mention is that CoLab creates separate instances for GPU, TPU and CPU, so you can run multiple notebooks without sharing RAM or processor if you give each one a different type.

Labels: machine_learning, tensorflow, google, google_cloud
2 comments

Archives