Monday, April 27, 2015

GPU Programming - Pyopencl

[Pyopencl]

A Simple program with Pyopencl

This post is talks about a simple map program written using Pyopencl. It compares the running time of a normal python map and opencl kernel. Let us start with first importing opencl libary and get all the platforms in the current system.

In [3]:
import pyopencl as cl
plt = cl.get_platforms()
plt
Out[3]:
[<pyopencl.Platform 'Intel(R) OpenCL' at 0x8011360>]
As you can see, we have only one OpenCL platform in my host machine. Let us proceed to see the devices in this platform.
In [4]:
devices = plt[0].get_devices()
devices
Out[4]:
[<pyopencl.Device 'Intel(R) HD Graphics 4000' on 'Intel(R) OpenCL' at 0x6c50fd80>]
As you can see we have Inter Graphics card installed. We can verify the same by looking at the device manager.
In [14]:
from IPython.display import Image
Image(filename='e:\devicemanager.png')
Out[14]:
In [6]:
ctx = cl.Context([devices[0]])
ctx.get_info(cl.context_info.DEVICES)
Out[6]:
[<pyopencl.Device 'Intel(R) HD Graphics 4000' on 'Intel(R) OpenCL' at 0x6c50fd80>]
We create a device context in the previous step. Let us proceed to create our input data.
In [7]:
import numpy as np
in_vector = np.arange(100000).astype(np.float32)
out_vector = np.empty_like(in_vector)
We creat a numpy vector of size 100K as in_vector. We also create an output vector called out_vector to store out results.
In [8]:
mf = cl.mem_flags
in_buffer  = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=in_vector)
out_buffer = cl.Buffer(ctx,mf.WRITE_ONLY,out_vector.nbytes)
We create two buffers, in_buffer from where the GPU would read the input in_vector data into global memory space. You can see that hostbuf parameter is pointing to in_vector. The output will be written to out_buffer by the GPU.
In [9]:
prg_str = " \
    __kernel void sq_input(__global const float *in_vector, \
    __global float *out_vector) \
    { \
      int gid = get_global_id(0); \
      out_vector[gid] = in_vector[gid] * in_vector[gid]; \
    } \
    "
This our kernel program. Many instances of this kernel will be run, we will be specifying how many later. As you can see this kernel takes two pointers pointing to the memory location of gpu. These locations should be initialized before runnning the kernel. Let us proceed to build the kernel.Every processors on the GPU will run a copy of this kernel. Using get_global_id we get each thread’s unique global id. This will allow the processor in GPU to figure out which piece of memory it should process. THe processor then loads the value from the a array and squares it, storing it in the correct position in the out_vector.
In [10]:
kernel_prg = cl.Program(ctx,prg_str).build()
C:\Python27\Lib\site-packages\pyopencl\__init__.py:61: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  "to see more.", CompilerWarning)

In [11]:
queue = cl.CommandQueue(ctx)
kernel_prg.sq_input(queue, in_vector.shape, None, in_buffer, out_buffer)
cl.enqueue_copy(queue, out_vector, out_buffer)
Out[11]:
<pyopencl._cl.NannyEvent at 0x5ffad88>
We now create a queue to which the kernel will be submitted. We invoke the sq_input program, and finally copy the output to out_vector. Let us now see the value of our input and output.
In [12]:
in_vector
Out[12]:
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
         9.99970000e+04,   9.99980000e+04,   9.99990000e+04], dtype=float32)
In [13]:
out_vector
Out[13]:
array([  0.00000000e+00,   1.00000000e+00,   4.00000000e+00, ...,
         9.99939994e+09,   9.99959962e+09,   9.99980032e+09], dtype=float32)

Timeit to check performance

Let us do some test with respect to time taken by GPU computing compared to CPU computing. For CPU computing we are going to use a simple map function to run an anonymous squaring function on all the elements of our input.
In [16]:
%timeit kernel_prg.sq_input(queue, in_vector.shape, None, in_buffer, out_buffer)
10000 loops, best of 3: 40.5 µs per loop

In [17]:
%timeit map(lambda x: x*x,in_vector)
10 loops, best of 3: 38.5 ms per loop

You can see the difference. GPU is running in micro seconds, and CPU takes milli seconds to run. Almost 900 times faster. We ran it for different input sizes and have plotted the results below.
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
In [7]:
N = np.asarray([100, 1000, 10000, 100000, 1000000, 10000000])
cpu_time = np.asarray([35.4, 337, 3500, 37100, 386000, 4080000])
gpu_time = np.asarray([39.2, 44.7, 42, 41.1, 670, 7590])
In [23]:
plt.title("CPU vs GPU Time in log scale")
plt.xlabel("Log(N)")
plt.ylabel("Log(Execution Time n micro seconds)")
plt.plot(np.log(N),np.log(cpu_time),label='CPU Time')
plt.plot(np.log(N),np.log(gpu_time),label='GPU Time')
plt.legend(loc='best')
Out[23]:
<matplotlib.legend.Legend at 0x896d570>