Scientific Python 1: Introduction

When I first started doing research, I would try to use C++ to write my programs because of “performance concerns”. My instructor and facts soon made me realize that it is extremely inefficient to do so. I am starting to realize that in scientific research (especially for machine learning), Python almost always yields less total time consumption. The development time with Python is obviously less than that of C++ by magnitudes. But believe it or not, carefully written Python code can also run as fast as C/C++ code. Together with simple multithreading, serialization and graphing makes Python the ideal language for research. I am slowly changing my views towards Python from “an extremely slow and hard-to-read” language to “the best language ever if not for its relatively slower speed”.

This series of articles are mainly aimed at two aspects:

  1. Discussing the use of scientific packages in Python, such as numpy, scipy, matplotlib, etc.

  2. Exploring the use of Python during paper composition, with packages like pythontex, pylatex and so on.

Just for drawing some attention, let us begin with a simple example. Assume that we are given a gradient image, and our goal is to find out pixels with values greater than 200.

lenna
Lenna, the old cliche.

The way I would do it is shown as below.

# using Pillow as Image reader
# there are a lot of alternatives, such as opencv, scikit-image, ...
from PIL import Image
import numpy as np

# change the filename as you would like to
img = Image.open('2019-05-14-sci-py-1.jpg')
# convert the image into numpy format
# only take the first channel
npImg = np.asarray(img)[:, :, 0]

# region is a boolean matrix, where the region of interest is
# labeled as True
region = npImg > 200
# now it is converted to a binary matrix
region = region.astype(np.uint8)
# scale it to 0-255
region *= 255
# the region of interest is labeled as black pixels
region = 255 - region

# save the image
pilImage = Image.fromarray(region)
pilImage.save('2019-05-14-sci-py-2.jpg')

The output is shown as below.

lenna-output
Pixels greater than 200 (black region)

This is just a simple example of how scientific computation can be with the help of numpy and scipy. Note that no loop is used throughout the entire code segment, and most computations are done in the backend of numpy, which is mainly written in C. Therefore, the speed of this program is almost as fast as pure C/C++ implementation.

Imagine how much work one has to do in order to achieve this in C++! He/she must do the following things:

All these stuff can easily take someone an hour, not even mentioning the painful debugging process of C++. But with Python, it only takes me less than 5 minutes to do so. Although using C++ may introduce some 10% performance gain (or maybe not, because compiling jpeg libraries to their full functionalities requires a ton of preparations, especially on Windows), in the field of scientific research, where efficiency is vital, it is unwise to use C++ as one’s major programming language. As professor Daniel Acuna always tells me: “perfection is the enemy of productivity.” If something barely works, it is already enough for us to generate some sort of conclusion and proceed to the next level.

Just to end this post with a useful reminder. If numpy is linked to LAPACK/BLAS, there can be a considerable performance gain (around 20% on my laptop). To check if the linkage is successful, one can type the following commands in the interactive session:

$ python3
Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]