Deep Learning Frameworks: PyTorch and TensorFlow


Deep learning is an area of artificial intelligence loosely inspired by the animal brain. Deep learning researchers build huge models of neural networks with dozens of layers and millions of inter-connections. The models are trained on large datasets to make accurate predictions on previously unseen data.

To deal with the complexity of building large deep neural network models and train them on large datasets, deep learning frameworks have been developed. Deep learning frameworks are basically a set of libraries for manipulating large matrices. They differ from other numerical libraries in two ways: they provide transparent support for GPUs and they provide automatic differentiation.

GPU support allows to train deep learning models on very large datasets in parallel thus providing an unbeatable speed boost. Fast training using GPUs allows for further research to find optimal neural network algorithms and architectures. Automatic differentiation allows to define new deep learning models and architectures with ease and in a less error-prone manner relative to defining the differentials manually. Deep learning models are usually trained using the gradient descent algorithm and automatic differentiation plays a vital role in providing the gradients.

Several deep learning frameworks have been open-sourced by large research groups including TensorFlow, Torch, Caffe, CNTK , MXNet and Theano. In this blog I will highlight some of the features of two Python libraries: TensorFlow and PyTorch for someone who is completely new to deep learning frameworks but has some experience with other libraries such as Python's Numpy or Matlab. The focus is to get the underlying principles across rather than building useful neural network models.

Auto-Differentiaiton

Each operation that is defined in these libraries has a corresponding operation that computes the derivative of the operation with respect to its input, e.g., if the operation is $sin(x)$, its derivative is defined automatically as $cos(x)$. The whole model is built using such primitive operations.

Let's look at a concrete and simple example of automatic differentiation. Let's suppose we have the following expression: $$ \mathbf{y} = 3f(\mathbf{x})^2 + 2f(\mathbf{x}) + 3, $$ where $f$ is some element-wise function of $\mathbf{x}$. Now if for some reason we need to find the derivative of this expression, which is often required for neural networks, we can manually write: $$ \frac{dy_i}{dx_i} = \big(6 f(x_i) + 2\big) \frac{df(x_i)}{dx_i}. $$ In this case the derivative was simple, but it can get very complicated when building large models. Deep learning frameworks are handy in that they allow us to do auto-differentiation, i.e., we don't need to write the second expression. The libraries will automatically provide the exact derivatives for each expression that we define.

PyTorch Implementation

Let's look at the implementation of the above expression in PyTorch. PyTorch builds the graph (program) dynamically, executing each operation sequentially as defined by the programmer. The result of each operation can be printed just like a Numpy array. Information on installation can be found at pytorch.org.

import torch
from torch.autograd import Variable

x = Variable(torch.Tensor([[1,2],[3,4]]), requires_grad=True)
y =  3 * x**2 + 2*x + 3
y.backward(torch.ones(2, 2))

# Print auto-gradients
print(x.grad)
# Print manual derivatives
manual_grad = (6*x+2)*torch.ones(2, 2)
print(manual_grad)

Lines $1$-$2$ import the torch libraries. Line $4$ defines the input $x$ as a tensor $\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}.$ Line $5$ is our expression $\mathbf{y} = 3f(\mathbf{x})^2 + 2f(\mathbf{x}) + 3$. Line $6$ asks for computing the gradients of the expression $\mathbf{y}$ with respect to the input $\mathbf{x}$. Since $\mathbf{y}$ is assumed to be a function of $f(\mathbf{x})$, we need to provide $\frac{df(x_i)}{dx_i}$. In this case we assume $f(\mathbf{x}) =\mathbf{x}$ and thus all $\frac{df(x_i)}{dx_i}=1$. Line $9$ prints the automatically computed gradients of the graph output $\mathbf{y}$ with respect to the graph input $\mathbf{x}$ and line $11$-$12$ computes and prints manual gradient as defined above. The output of the code snippet is given below:

Variable containing:
  8  14
 20  26
[torch.FloatTensor of size 2x2]
Variable containing:
  8  14
 20  26
[torch.FloatTensor of size 2x2]

TensorFlow Implementation

To install TensorFlow run pip install tensorflow in a terminal, see tensorflow.org for further information. As opposed to the dynamic graphs of PyTorch, TensorFlow defines a graph statically, provides mechanism to feed data to the graph and then executes the graph in what is called a session. Such static graphs are faster than dynamic graphs but static graphs make debugging hard. Here is the TensorFlow implementation of the expression discussed above:

import tensorflow as tf

x = tf.placeholder(tf.float32, [2, 2])
y =  3 * x**2 + 2*x + 3
g = tf.gradients(y, x)

with tf.Session() as sess:
  grad = sess.run(g, feed_dict={x:[[1.0, 2.0], [3.0, 4.0]]})
  print(grad)
Lines $3$-$5$ define the graph consisting of: a placeholder node for the input, a node that defines the expression $\mathbf{y}$ and a node that computes the gradient of the node $\mathbf{y}$ with respect to the input node $\mathbf{x}$. Line 7 starts a TensorFlow session to execute the graph. Line 8 feeds $\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ to the graph and gets the output of the node $g$. The output is given below:
[array([[  8.,  14.],
       [ 20.,  26.]], dtype=float32)]

GPU Utilization

One of the main advantages of the deep learning libraries is that one can very easily specify what operations should be executed on what device. Thus the same code that is written for the CPU can very easily be executed on GPUs.

To give a concrete example, we will generate two $n\times n$-dimensional random matrices and multiply them together. We will see how easily this computation can be performed on a GPU.

PyTorch Implementation

The PyTorch implementation is given below. Lines $5$-$9$ generate random matrices $a$ and $b$ that will be processed on CPU while $x$ and $y$ will processed by cuda using GPU. Line $11$ and $16$ are identical except that one is matrix multiplication on CPU while the other is processed on GPU as both the operands are in GPU. For this simple example, the output shows that the GPU matrix multiplication is $5$ times faster than the CPU computation.

import torch
import time

n=30000
a = torch.randn(n,n)
b = torch.randn(n,n)
x = torch.randn(n,n).cuda()
y = torch.randn(n,n).cuda()

start = time.time()
c = torch.mm(a,b)
finish = time.time()
print('Time Elapsed using CPUs:', finish - start, 'seconds')

start = time.time()
z = torch.mm(x,y)
finish = time.time()
print('Time Elapsed using GTX 1080 GPU:', finish - start, 'seconds')
Time Elapsed using CPUs: 103.64060735702515 seconds
Time Elapsed using GTX 1080 GPU: 21.5598406791687 seconds

TensorFlow Implementation

If GPU devices are available, TensorFlow will automatically distribute the nodes in the graph to different devices. However, one can manually specify what parts of the graph should be placed on which device. In the code snippit below, lines $6$-$9$ and $11$-$14$ represent identical graphs, performing random matrix multiplication. The only difference between them is that the former is executed on the CPU while the later is placed on the GPU. The output in this case shows around $40$x speedup.

import tensorflow as tf
import time

n=30000

with tf.device('/cpu:0'):
  a = tf.random_normal((n,n))
  b = tf.random_normal((n,n))
  c = tf.matmul(a,b)

with tf.device('/device:GPU:0'):
  x = tf.random_normal((n,n))
  y = tf.random_normal((n,n))
  z = tf.matmul(x,y)

with tf.Session() as sess:
  start = time.time()
  sess.run(c)
  finish = time.time()
  print('Time Elapsed using CPUs:', finish - start, 'seconds')

  start = time.time()
  sess.run(z)
  finish = time.time()
  print('Time Elapsed using GTX 1080 GPU:', finish - start, 'seconds')
Time Elapsed using CPU: 396.8419020175934 seconds
Time Elapsed using GTX 1080 GPU: 9.227999687194824 seconds

Summary

In this blog I have provided a brief introduction to PyTorch and TensorFlow. Many details were skipped in favor of simplicity and many first-order truths were presented. Examples to illustrate auto-differentiation and GPU computations were presented. To read more details about auto-differentiation please refer to Automatic differentiation in machine learning: a survey , technical details of TensorFlow are described in TensorFlow: Large-Scale Machine Learning. To build neural network models using PyTorch and TensorFlow, many examples are available on their respective websites. Please send any feedback to najeeb dot khan at usask dot ca.