Notes on Numpy Performance improvement

Disclaimer: These are complied notes from various sources. Apologies if some references are missing.


When conducting experiments with Gensim(an awesome NLP library), i noticed i wasn't getting the same results as fast as the authors claimed. This led me to thinking can something we done to speed up Numpy Operations? Numpy performs accelerated linear algebra, using one of several backends such as OpenBLAS or Intel MKL optimized for the CPUs at hand.

In this post, we will discuss follow:

  1. Brief Introduction to the libraries:
  • Default BLAS & LAPACK
  • Automatically Tuned Linear Algebra Software (ATLAS)
  • Intel Math Kernel Library (Intel MKL)
  • OpenBLAS

2. Use parallel primitives

3. Diagnostics - Check Linkage and installations

4. Installation Guides

  • Installation and Experimentation on Local Machine
  • Installation Guide 1: Install OpenBLAS library:
  • https://roman-kh.github.io/numpy-multicore/
  • Installation Guide 2: compiled numpy inside a virtualenv with OpenBLAS integration
  • Installation Guide 3: Installing Atlas on Linux Server(official documentation):

5. Benchmarking Gensim operations:

  • Basic Matrix Decomposition Experiment
  • My Experiment on Wikipedia Dataset
  • OpenBlas vs Intel MKL performance comparison (3rd Party)
  • Lessons Learned

1. Brief Introduction to the libraries:

BLAS & LAPACK

BLAS is the low–level part of your system that is responsible for efficiently performing numerical linear algebra, i.e. all the heavy number crunching. More precisely:

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits.

Together with something like LAPACK, it is the cornerstone of today's scientific software packages.

LAPACK (Linear Algebra Package) is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition.


Now, as mentioned above, BLAS & LAPACK are merely defining a standard interface, but what really matters is the implementation you choose. There are many different implementations under different licenses, which all have their advantages and disadvantages. It can sometimes be difficult to choose the most beneficial implementation for your machine.

Default BLAS & LAPACK
By default, you will most likely only have the standard BLAS & LAPACK implementation on your system (e.g. Arch Linux packages blas/cblas and lapack). While this situation is certainly sufficient if you don't care about performance and works out of the box, is has considerable shortcomings compared to more recent implementation, e.g. not fully supporting multi–core computations.

Automatically Tuned Linear Algebra Software (ATLAS)
The main purpose of the ATLAS project is to provide a portable high–performance implementation for many different platforms. To achieve this, they use a complex mechanism to automatically determine the optimal parameters for the respective hardware during compilation. ATLAS has become quite popular in recent years, since it uses the BSD license and is platform independent. When compiled properly, it probably provides a reasonably good BLAS library for most performance–critical applications on most platforms.

On Arch Linux you can use the atlas-lapack package from the repository, which will download and compile ATLAS on your machine. However, before starting the compilation, make sure to disable CPU throttling in your OS, since this will mess up the timing when benchmarking ATLAS during compilation.

Intel Math Kernel Library (Intel MKL)

The Intel Math Kernel Library (Intel MKL) is a proprietary library that is hand–optimized specifically for Intel processors. As of February 2017, it is free of charge for some use cases. read more about the Community Licensing here.

OpenBLAS

OpenBLAS is another popular open–source implementation that is based on a fork of GotoBLAS2. They claim in their FAQ that OpenBLAS achieves a performance comparable to Intel MKL for Intel's Sandy Bridge CPUs. Furthermore, OpenBLAS is well–known for its multi-threading features and apparently scales very nicely with the number of cores.Obtain a copy from openblas.net or GitHub and follow the installation instructions. This will walk you through the compilation and installation process.
reference: http://markus-beuckelmann.de/blog/boosting-numpy-blas.html


2. Use parallel primitives:

One of the great strengths of numpy is that you can express array operations very cleanly. For example to compute the product of the matrix A and the matrix B, you just do:

C = numpy.dot(A,B)


Not only is this simple and clear to read and write, since numpy knows you want to do a matrix dot product it can use an optimized implementation obtained as part of "BLAS" (the Basic Linear Algebra Subroutines). This will normally be a library carefully tuned to run as fast as possible on your hardware by taking advantage of cache memory and assembler implementation. But many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything. Similarly for other matrix operations, like inversion, singular value decomposition, determinant, and so on. For example, the open source library ATLAS allows compile time selection of the level of parallelism (number of threads). The proprietary MKL library from Intel offers the possibility to chose the level of parallelism at runtime. There is also the GOTO library that allow run-time selection of the level of parallelism. This is a commercial product but the source code is distributed free for academic use.

Finally, scipy/numpy does not parallelize operations like

A = B + C
A = numpy.sin(B)
A = scipy.stats.norm.isf(B)


These operations run sequentially, taking no advantage of multicore machines (but see below). In principle, this could be changed without too much work. OpenMP is an extension to the C language which allows compilers to produce parallelizing code for appropriately-annotated loops (and other things). If someone sat down and annotated a few core loops in numpy (and possibly in scipy), and if one then compiled numpy/scipy with OpenMP turned on, all three of the above would automatically be run in parallel. Of course, in reality one would want to have some runtime control - for example, one might want to turn off automatic parallelization if one were planning to run several jobs on the same multiprocessor machine.

reference: from SciPy documentation: https://scipy.github.io/old-wiki/pages/ParallelProgramming


3. Diagnostics

Check Linkage and installations:

To see what BLAS and LAPACK you are using, type into your shell:


  python -c 'import scipy; scipy.show_config()’

To see what library was your NumPy compiled against, run


python -c 'import numpy; numpy.show_config()'

Output on Mac after installations:


lapack_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/asjad/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/asjad/anaconda/include']
lapack_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/asjad/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/asjad/anaconda/include']
blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/asjad/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/asjad/anaconda/include']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/asjad/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/asjad/anaconda/include']

On Amazon t2.Medium (Intel Xeon Server)


lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_mkl_info:
  NOT AVAILABLE

Check Linkage:

To find out what you have, check whether your numpy is linked to BLAS
The most definitive way to check on *nix is to use ldd to find out which shared libraries numpy links against at runtime.


~$ ldd /usr/lib/python2.7/dist-packages/numpy/core/_dotblas.so 

    linux-vdso.so.1 =>  (0x00007fff12db8000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0x00007fce7b028000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce7ac60000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce7a958000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce7a738000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fce7ca40000)

/usr/lib/libblas.so.3 is actually the start of a chain of symlinks. If I follow them to their ultimate target using readlink -e, I see that they point to my OpenBLAS shared library:

~$ readlink -e /usr/lib/libblas.so.3/usr/lib/openblas-base/libblas.so.3

4. Installation Guides

Installation and Experimentation on Local Machine

Guide 1: Install OpenBLAS library:

Guide 2: compiled numpy inside a virtualenv with OpenBLAS


Following steps:

  1. Compile OpenBLAS:

$ git clone https://github.com/xianyi/OpenBLAS
$ cd OpenBLAS && make FC=gfortran
$ sudo make PREFIX=/opt/OpenBLAS install

2. If you don't have admin rights you could set PREFIX= to a directory where you have write privileges (just modify the corresponding steps below accordingly).

3. Make sure that the directory containing libopenblas.so is in your shared library search path.

  • To do this locally, you could edit your ~/.bashrc file to contain the line

export LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
  • The LD_LIBRARY_PATH environment variable will be updated when you start a new terminal session (use $ source ~/.bashrc to force an update within the same session).
  • Another option that will work for multiple users is to create a .conf file in /etc/ld.so.conf.d/ containing the line /opt/OpenBLAS/lib, e.g.:

$ sudo sh -c "echo '/opt/OpenBLAS/lib' > /etc/ld.so.conf.d/openblas.conf"

4. Once you are done with either option, run


$ sudo ldconfig

5. Grab the numpy source code:


$ git clone https://github.com/numpy/numpy
$ cd numpy

6. Copy site.cfg.example to site.cfg and edit the copy:


$ cp site.cfg.example site.cfg
$ nano site.cfg

7. Uncomment these lines:


....
[openblas]
libraries = openblas
library_dirs = /opt/OpenBLAS/lib
include_dirs = /opt/OpenBLAS/include
....

8. Check configuration, build, install (optionally inside a virtualenv)

python setup.py config

9. The output should look something like this:


...
openblas_info:
  FOUND:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

  FOUND:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
...

10. Installing with pip is preferable to using python setup.py install, since pip will keep track of the package metadata and allow you to easily uninstall or upgrade numpy in the future.


$ pip install .

11. Optional: you can use this script to test performance for different thread counts.


$ OMP_NUM_THREADS=1 python build/test_numpy.py

version: 1.10.0.dev0+8e026a2
maxint:  9223372036854775807

BLAS info:
 * libraries ['openblas', 'openblas']
 * library_dirs ['/opt/OpenBLAS/lib']
 * define_macros [('HAVE_CBLAS', None)]
 * language c

dot: 0.099796795845 sec

$ OMP_NUM_THREADS=8 python build/test_numpy.py

version: 1.10.0.dev0+8e026a2
maxint:  9223372036854775807

BLAS info:
 * libraries ['openblas', 'openblas']
 * library_dirs ['/opt/OpenBLAS/lib']
 * define_macros [('HAVE_CBLAS', None)]
 * language c

dot: 0.0439578056335 sec

reference: https://stackoverflow.com/questions/11443302/compiling-numpy-with-openblas-integration/14391693#14391693

Guide 3: Installing Atlas on Linux Server(official documentation):

a. downloading:

  1. wget https://sourceforge.net/projects/math-atlas/files/latest/download?source=files
  2. rename downloaded file to atlas3.10.3.tar.bz2
  3. Unzip:

 bunzip2 -c ~/dload/atlas3.10.3.tar.bz2 | tar xfm -

4. Disable Throtteling:


http://math-atlas.sourceforge.net/atlas_install/node5.html    



relevant links:

C. Installation(Basic Steps of an ATLAS install):


    http://math-atlas.sourceforge.net/atlas_install/node6.html


Benchmarking

Basic Matrix Decomposition Experiment:


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function

import numpy as np
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)

size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
    np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
    np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))


This performs some matrix multiplication, vector–vector multiplication, singular value decomposition (SVD), Cholesky factorization and Eigendecomposition, and averages the timing results (which are of course arbitrary) over multiple runs.

Conclusion: Intel's Math Kernel Library (Intel MKL) performs best
reference :http://markus-beuckelmann.de/blog/boosting-numpy-blas.html

Wikipedia Experiment:


This experiment is divided into three phases
Phase A:
Tasks:

  • Convert the articles to plain text (process Wiki markup)
  • create and save dictionary + corpus
  • create and save sparse TF-IDF vectors representation.

python -m gensim.scripts.make_wiki



Phase B(create the model):


>>> # extract 400 LSI topics; use the default one-pass algorithm
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)

>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.print_topics(10)


Phase C(create an Index):

index = Similarity('./data/wiki_index', lsi_corpus, num_features=lsi_corpus.num_terms)# store to diskindex.save('./data/wiki_index.index')# load back = same indexindex = Similarity.load('./data/wiki_index.index')print(index)

My Experimental results:

Operation(Wikipedia Exp. - phase A) 
Library
System Details
Total Time
  • Convert the articles to plain text (process Wiki markup)
  • create and save dictionary + corpus 
  • create and save sparse TF-IDF vectors representation.  
mkl_intel_lp64
MacBook Pro (Intel Core i7(4 cores))
Start: 18:26:59,674 
End: 00:55:16,216 
~6hrs 
  • Convert the articles to plain text (process Wiki markup)
  • create and save dictionary + corpus 
  • create and save sparse TF-IDF vectors representation. 
Openblas
t2.Medium (Intel Xeon, 2 cores) 
Start: 
2017-09-18:
07:38:50,521

 
still going at: 2017-09-19:  01:15:30,764




Thread Comparison:


Experiments by others

Library Comparison - Benchmarks: 

details: https://software.intel.com/en-us/articles/performance-comparison-of-openblas-and-intel-math-kernel-library-in-r

Intel Xeon E5-2670 has eight physical cores in one chipset, so there are 16 physical cores in one server node.Intel MKL library supports the single thread mode (Sequential) or OpenMP threading mode. MKL with OpenMP threading mode defaultly uses all physical cores in one node(here is 16).Fig.1 shows the results of using Intel MKL for 1 thread and 16 threads with automatic parallel execution are shown in blue. There are five subtasks showing a significant benefit from either optimized sequential math library or the automatic parallelization with MKL including crossprod(matrix size 2800*2800), linear regression, matrix decomposition, computing inverse anddeterminant of a matrix. Other non-computational intensive tasks received very little performance gains from parallel execution with MKL.

More details :http://www.parallelr.com/r-hpac-benchmark-analysis/

Benchmark OpenBLAS, Intel MKL vs ATLAS

reference : https://github.com/tmolteno/necpp/issues/18

Lessons Learned:


The numpy parts of LSI implementation that deal with matrices (matrix multiplications, SVD, ...) are done via low-level libraries (BLAS + LAPACK). MKL and similar libraries are highly optimized and tailored for a particular hardware architecture and can yield significant speed–ups almost for free. Many of these libraries are also multi-threaded and can make use of all the available cores on a given system. The default implementation has many limitations. E.g it does not make use of all cores on a given machine. Installing ATLAS, OpenBLAS or preferably Intel's Math Kernel Library (Intel MKL) can make use of all available cores on a given machine, and we can expect speed–ups upto a factor of 10x-20x. From the Wikipedia experiment I can confirm that Linking numpy and sci-py to the appropriate BLAS library (in my case it was intel’s MKL) can indeed Accelerate the process of cleaning+dictionary and corpus creation+TF_IDF Conversion.