Julia: Flajolet-Martin Algorithm

I choose to pick up Julia because of its high performance. The problem for Python is that it works nicely 95 percent of the time, when all the computationally expensive operations can be replaced by function calls to C libraries. However, it fails completely when the rest 5 percent of nasty situations show up. Recently, such an annoying instance just showed up for me. I am trying to migrate my code to Julia to see if there is a performance boost.

The count distinct problem

When counting the number of distinct elements in a stream, what we usually do is to keep a set in the memory and union the set with the new incoming element. This approach requires storing all unique elements in the main memory, which can be infeasible when the size of the stream is huge. The Flajolet–Martin algorithm proposes a way of estimating the number of unique elements in the stream without the need to store all elements. It’s basic idea is as follows: we generate the hash value of each record and count the number of tailing zeros in the hashed bits. Suppose the greatest tailing zero is \(R\), then the number of unique elements is \(2^R\).

Using pointer to array

When I first start writing the Julia version of the algorithm, I found that there was no significant performance gain. After some investigation, I realize that it’s because of array slicing. Whenever a new array slice is created, the memory will be copied and new object will be allocated. This process can be extremely time consuming. Therefore, we can use the pointer to the array to save some time. To get the pointer to array elements, we can write:

array = zeros(UInt64, 100)
# convert it to a byte pointer and use pointer to access slices
cPtr = convert(Ptr{UInt8}, pointer_from_objref(array))

To access the data, use unsafe_load.

Set the number of threads

The number of threads used by Julia is decided at startup. To change the value:


Generating the testing dataset

The following code is used to generate the dataset.

import numpy as np

numEntries = 10000000
entrySize = 32

array = np.random.randint(0, 255, size=numEntries * entrySize, dtype=np.uint8)
with open('random_data.bin', 'wb') as outfile:

Julia Notebook

The test shows that we can run the dataset of 10 million samples in around 30 seconds with 6 threads. Compared to my C++ implementation, I would say that this Julia version is roughly 6-8 times slower. But it is much faster to write Julia than C++, even for a Julia beginner like me. I am still digging into the usage of Julia to make sure that I am using all the correct conventions in order to make my code faster.