Scientific Python 1: Introduction14 May 2019
When I first started doing research, I would try to use C++ to write my programs because of “performance concerns”. My instructor and facts soon made me realize that it is extremely inefficient to do so. I am starting to realize that in scientific research (especially for machine learning), Python almost always yields less total time consumption. The development time with Python is obviously less than that of C++ by magnitudes. But believe it or not, carefully written Python code can also run as fast as C/C++ code. Together with simple multithreading, serialization and graphing makes Python the ideal language for research. I am slowly changing my views towards Python from “an extremely slow and hard-to-read” language to “the best language ever if not for its relatively slower speed”.
This series of articles are mainly aimed at two aspects:
Just for drawing some attention, let us begin with a simple example. Assume that we are given a gradient image, and our goal is to find out pixels with values greater than 200.
The way I would do it is shown as below.
# using Pillow as Image reader # there are a lot of alternatives, such as opencv, scikit-image, ... from PIL import Image import numpy as np # change the filename as you would like to img = Image.open('2019-05-14-sci-py-1.jpg') # convert the image into numpy format # only take the first channel npImg = np.asarray(img)[:, :, 0] # region is a boolean matrix, where the region of interest is # labeled as True region = npImg > 200 # now it is converted to a binary matrix region = region.astype(np.uint8) # scale it to 0-255 region *= 255 # the region of interest is labeled as black pixels region = 255 - region # save the image pilImage = Image.fromarray(region) pilImage.save('2019-05-14-sci-py-2.jpg')
The output is shown as below.
This is just a simple example of how scientific computation can be with the help of numpy and scipy. Note that no loop is used throughout the entire code segment, and most computations are done in the backend of numpy, which is mainly written in C. Therefore, the speed of this program is almost as fast as pure C/C++ implementation.
Imagine how much work one has to do in order to achieve this in C++! He/she must do the following things:
All these stuff can easily take someone an hour, not even mentioning the painful debugging process of C++. But with Python, it only takes me less than 5 minutes to do so. Although using C++ may introduce some 10% performance gain (or maybe not, because compiling jpeg libraries to their full functionalities requires a ton of preparations, especially on Windows), in the field of scientific research, where efficiency is vital, it is unwise to use C++ as one’s major programming language. As professor Daniel Acuna always tells me: “perfection is the enemy of productivity.” If something barely works, it is already enough for us to generate some sort of conclusion and proceed to the next level.
Just to end this post with a useful reminder. If numpy is linked to LAPACK/BLAS, there can be a considerable performance gain (around 20% on my laptop). To check if the linkage is successful, one can type the following commands in the interactive session:
$ python3 Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.__config__.show() blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)]