A Simplified O(n) In-Place Stable Partitioning Algorithm (with Full Implementation)

Heads up: I wrote this in Chinese first and used a translator to get it into English. I tried to make it as smooth as possible, but please forgive any weird phrasing.

This is a pretty complex academic algorithm, and figuring it out was quite a winding journey. I simply couldn't crack the original paper, so I ended up filling in some of the details myself — which, in a way, also helped simplify the original work.

Primary references: Stable Minimum Space Partitioning in Linear Time (opens new window)

# 1. Problem Definition

Given an array a and a predicate pred, arrange all elements for which the predicate is true before those for which it is false, with the following constraints: $O(n)$ time complexity, in-place ( $O(1)$ extra space complexity), and stable (the relative order among true elements is preserved, and likewise for false elements).

This problem is essentially equivalent to $O(n)$ in-place stable 0-1 sorting (where element keys are only 0 and 1). I will present it as 0-1 sorting going forward.

The algorithm adopts the commonly used Word RAM model. The storage unit in the Word RAM model is a Word, which can hold w bits, just enough to represent the address of each memory cell. A single Word can store a pointer, but little else. We will exploit this property in the algorithm.

A note on terminology: in the code, proj maps an element to 0/1 for 0-1 sorting, whereas pred appearing in stable_collect_first_n is a predicate that collects elements for which the predicate evaluates to true. Keep this distinction in mind.

# 2. Preliminaries

# 2.1. Range Rotation

Rotate two adjacent ranges [A B] in-place into [B A] while preserving the internal order of each range. The classic approach is Triple Reversal (or the Juggling Algorithm), which I won't elaborate on here.

In code, I simply call the standard library's std::rotate, with complexity $O(n)$ .

# 2.2. 0-1 Merging

0-1 merging refers to merging two already sorted 0-1 arrays. The method is very simple: use range rotation to swap the 1s on the left with the 0s on the right.

Complexity $O(n)$ .

template <typename T, typename Proj>
void inplace_01_merge(T* first, T* last, Proj proj) {
    T* split_left = std::find_if(first, last, proj);
    T* mid = std::find_if(split_left, last, [proj](T x) { return proj(x) == 0; });
    T* split_right = std::find_if(mid, last, proj);
    std::rotate(split_left, mid, split_right);
    assert_or_throw(std::is_sorted(first, last, [proj](T a, T b) { return proj(a) < proj(b); }));
}

# 3. O(n) In-Place Stable Partitioning

# 3.1. Extracting the Buffer

We need $2\sqrt n$ elements as a buffer, half of them 0 and half 1. The 0s and 1s are paired one-to-one, and they can provide two states (01 and 10 in order), i.e., 1 bit of information. In total there are $\sqrt n$ bits, which will be used in subsequent steps.

The extraction process resembles an inverted rotate-merge algorithm. First, collect 0s: find the 1st 0, rotate it in front of the 2nd 0, then rotate the 1st and 2nd 0s together in front of the 3rd 0. Repeat this until $\sqrt n$ 0s are collected, and then rotate these 0s to the beginning of the array.

The same for 1s.

For the $\sqrt n$ collected elements, they are rotated at most $O(\sqrt n)$ times, yielding an overall complexity $O(n)$ . The remaining $O(n)$ elements are each rotated at most once, also $O(n)$ .

template <typename T, typename Pred>
std::tuple<T*, T*, T*> stable_collect_first_n(T* first, T* last, int64_t n, Pred pred) {
    T* collect = first;
    int64_t count = 0;
    for (T* iter = first; iter < last; iter++) {
        if (count < n && pred(*iter)) {
            std::rotate(collect, collect + count, iter);
            collect = iter - count;
            count++;
        }
    }
    std::rotate(first, collect, collect + count);
    return {first, first + count, last};
}

If the buffer does not have enough 0s or 1s, then after collection a single rotation is enough to sort the entire array, and the algorithm can terminate early.

# 3.2. Counting Sort and Cycle Permutation

Both algorithms are classic.

Counting sort: suppose we want to sort n numbers with a value range of $[0, m - 1]$ . First allocate a count array of length m, iterate through the n numbers and count the frequency of each value. Then compute an exclusive scan (a prefix sum where the first element is 0) or inclusive scan on the count array.

Iterate through the n numbers again. For each number, take its current value in the count array as its destination index, and increment that count by 1.

At this point, without the in-place requirement, allocate a buffer to temporarily store the sorted n numbers, then move them back to the original array. Code:

void counting_sort(std::vector<int>& arr) {
    std::vector<int> count(*std::max_element(arr.begin(), arr.end()) + 1, 0);
    for (int64_t i = 0; i < int64_t(arr.size()); i++) {
        count[arr[i]]++;
    }
    int sum = 0;
    for (int64_t i = 0; i < int64_t(count.size()); i++) {
        int tmp = count[i];
        count[i] = sum;
        sum += tmp;
    }
    std::vector<int> buffer(arr.size());
    for (int64_t i = 0; i < int64_t(arr.size()); i++) {
        buffer[count[arr[i]]] = arr[i];
        count[arr[i]]++;
    }
    arr = buffer;
}

If counting sort must be in-place, we must first store the n destination indices (where to store them will be discussed later) and then perform cycle permutation.

What is cycle permutation? Anyone who has studied permutations or graph theory knows that if you view the n positions as nodes and draw a directed edge from each initial position to its destination index, you get several disjoint cycles.

The algorithm processes each cycle in turn, traversing around the cycle and moving each element to its correct position via swap operations. Whenever a position is correctly placed, its entry in the destination index array is updated to its current position, marking it as handled and preventing redundant swaps.

void cycle_permutation(std::vector<int>& arr, std::vector<int>& dests) {
    for (int64_t i = 0; i < int64_t(arr.size()); i++) {
        for (int64_t j = dests[i]; j != i;) {
            std::swap(arr[i], arr[j]);
            int64_t next = dests[j];
            dests[j] = j;
            j = next;
        }
        dests[i] = i;
    }
}

Since 0-1 sorting has only two values 0 and 1, the count array has only two entries, making it especially simple. Hence complexity $O(n)$ , without worrying about the value range.

# 3.3. Word-Based Storage

Counting sort needs a place to store the destination indices; we can store them in a Word. A Word can only hold a single index of the entire array, but for sub-arrays of length $\frac{\log n}{\log \log n}$ it's different. A Word has $\log n$ bits, and an index in the sub-array requires at most $\log \log n$ bits. Dividing gives $\frac{\log n}{\log \log n}$ indices, exactly enough to hold the destination indices.

Here is the storage implementation; get/set complexity is $O(1)$ .

struct WordStorage {
    int64_t pos_bits;
    uint64_t word = 0;

    WordStorage(int64_t pos_bits) : pos_bits(pos_bits) {}

    uint64_t get(int64_t index) const { return (word >> (index * pos_bits)) & ((uint64_t{1} << pos_bits) - 1); }

    void set(int64_t index, uint64_t value) {
        auto slot_mask = ((uint64_t{1} << pos_bits) - 1) << (index * pos_bits);
        word = (word & ~slot_mask) | (static_cast<uint64_t>(value) << (index * pos_bits));
    }

    void reset() { word = 0; }
};

# 3.4. Buffer-Based Storage

The buffer works similarly. For sub-arrays of size $\frac{\sqrt n}{\log n}$ , the buffer can store $\sqrt n$ bits, while an index in the sub-array requires at most $\log n$ bits. Dividing gives $\frac{\sqrt n}{\log n}$ indices, exactly enough to hold the destination indices.

Here is the storage implementation; get/set complexity is $O(\log B)$ , where B is the sub-array size.

template <typename T, typename Proj>
struct BufferStorage {
    T* buf0;
    T* buf1;
    int64_t buffer_len;
    int64_t pos_bits;
    Proj proj;

    BufferStorage(T* buf0, T* buf1, int64_t buffer_len, int64_t pos_bits, Proj proj)
        : buf0(buf0), buf1(buf1), buffer_len(buffer_len), pos_bits(pos_bits), proj(proj) {}

    uint64_t get(int64_t index) const {
        uint64_t res = 0;
        for (int64_t i = 0; i < pos_bits; i++) {
            res |= uint64_t(proj(buf0[index * pos_bits + i])) << i;
        }
        return res;
    }

    void set(int64_t index, uint64_t value) {
        for (int64_t i = 0; i < pos_bits; i++) {
            if (uint64_t(proj(buf0[index * pos_bits + i])) != ((value >> i) & 1)) {
                std::swap(buf0[index * pos_bits + i], buf1[index * pos_bits + i]);
            }
        }
    }

    void reset() { reset_buffer(buf0, buf1, buffer_len, proj); }
};

# 3.5. Sorting Among Blocks

Based on the counting sort, cycle permutation, and storage above, we can sort multiple blocks (requiring each block to be pure 0 or pure 1). The code below integrates these pieces.

Complexity $O(B(S+T))$ , where B is the number of blocks, S is the block size, and T is the get/set complexity of the storage. For word-based storage, T is $O(1)$ , so there's nothing to worry about. For buffer-based storage, T is $O(\log B)$ , approximately $O(\log n)$ , so buffer-based block sorting has a threshold $S \ge T = O(\log n)$ to ensure the total does not exceed $O(n)$ . (This conclusion is not easy to explain clearly; just keep it in mind for now.)

template <typename T, typename Proj, typename Storage>
void sort_blocks_impl(T* first, T* last, int64_t block_size, Proj proj, Storage& storage) {
    int64_t n_blocks = (last - first) / block_size;
    if (n_blocks <= 1) {
        return;
    }
    int64_t n_zeros = block_count_if(first, last, block_size, [&proj](T x) { return proj(x) == 0; });
    std::array<int64_t, 2> pointers = {0, n_zeros};
    for (int64_t i = 0; i < n_blocks; i++) {
        int64_t key = proj(first[i * block_size]);
        storage.set(i, pointers[key]);
        pointers[key]++;
    }
    for (int64_t i = 0; i < n_blocks; i++) {
        for (int64_t j = storage.get(i); j != i;) {
            std::swap_ranges(first + i * block_size, first + (i + 1) * block_size, first + j * block_size);
            int next = storage.get(j);
            storage.set(j, j);
            j = next;
        }
        storage.set(i, i);
    }
    storage.reset();
}

# 3.6. Block Homogenization

For k blocks, if the elements within each block are already sorted, you can quickly turn the first k - 1 blocks into homogeneous blocks (blocks that are purely 0 or purely 1).

Simply iterate through adjacent block pairs from left to right. If the total number of 0s across the two blocks is at least the block size, use rotation to fill the left block with 0s while keeping the right block sorted. Otherwise, fill the left block with 1s while keeping the right block sorted. Specific details are shown in the code.

Each element is rotated a constant number of times, so the complexity is $O(n)$ .

template <typename T, typename Proj>
void homogenize_blocks(T* first, T* last, int64_t block_size, Proj proj) {
    assert_or_throw((last - first) % block_size == 0);
    int64_t n_blocks = (last - first) / block_size;
    for (int64_t i = 0; i + 2 <= n_blocks; i++) {
        T* left = first + i * block_size;
        T* mid = left + block_size;
        T* right = mid + block_size;
        assert_or_throw(std::is_sorted(left, mid, [proj](T a, T b) { return proj(a) < proj(b); }));
        assert_or_throw(std::is_sorted(mid, right, [proj](T a, T b) { return proj(a) < proj(b); }));
        T* split_left = std::find_if(left, mid, proj);
        T* split_right = std::find_if(mid, right, proj);
        int64_t n_zeros = (split_left - left) + (split_right - mid);
        if (n_zeros >= block_size) {
            // 00000111 00000011 -> 00000[111 | 000000]11 -> 00000000 00011111
            std::rotate(split_left, mid, split_right);
        } else {
            // 00011111 00111111 -> [000 | 11111] 00111111 -> 11111000 00111111
            std::rotate(left, split_left, mid);
            split_left = mid - split_left + left;
            // 11111000 00111111 -> 11111[000 00 | 111]111 -> 11111111 00000111
            std::rotate(split_left, split_right, split_right + (mid - split_left));
        }
    }
}

# 3.7. Block Growth

Combining block homogenization and block sorting, k sorted blocks of size b can be grown into a single sorted block of size $kb$ .

First, homogenize the first k - 1 blocks into pure 0 or pure 1 blocks (homogeneous blocks). Then use counting sort to sort the homogeneous blocks. Finally, use 0-1 merge to handle the last block.

Since there are two kinds of counting sort, there are also two block growth strategies. The word-based block growth factor is $k \approx \frac{\log n}{\log \log n}$ , and the buffer-based block growth factor is $k \approx \frac{\sqrt n}{\log n}$ , but the initial block size must be at least $\log n$ .

In implementation, the growth factor can be computed with the following code:

int64_t len = last - first;
int64_t max_word_bits = ceil_log2(len);
int64_t buffer_len = std::floor(std::sqrt(len));
int64_t max_blocks_for_word = 1;
while ((max_blocks_for_word + 1) * ceil_log2(max_blocks_for_word + 1) <= max_word_bits) {
    max_blocks_for_word++;
}
int64_t max_blocks_for_buffer = 1;
while ((max_blocks_for_buffer + 1) * ceil_log2(max_blocks_for_buffer + 1) <= buffer_len) {
    max_blocks_for_buffer++;
}

Thus the strategy is: first use the threshold-free word-based block growth twice, bringing the block size to $\frac{\log^2 n}{(\log\log n)^2}$ , which meets the buffer-based block growth threshold of $\log n$ . Then use buffer-based block growth three times, reaching $\frac{n^{1.5}}{\log n\cdot (\log\log n)^2}$ . For sufficiently large data, this block size can be proved to be at least n, meaning the entire array has been sorted. Small-scale cases can be verified by brute-force testing.

I initially tried multi-level blocking, but it was too difficult to implement. Since $\log n$ and $\sqrt n$ are not integers, rounding up or down easily causes one layer to have too many blocks, exceeding the Word or buffer capacity.

The array length is not necessarily a multiple of the block size. Handling the unaligned portion is rather tricky work.

Thanks to the block growth design, it suffices to isolate the unaligned portion, perform block homogenization, block sorting, and final block merging, then use 0-1 merge to incorporate the unaligned portion back into the array. (It looks simple, but I actually tried many ideas before arriving at this.)

// block_size is the small block size, merge_size is the large block size
int64_t merge_size = block_size * k;
for (int64_t start = 0; start < len; start += merge_size) {
    int64_t end = std::min(start + merge_size, len);
    int64_t end_l2_aligned = end / block_size * block_size;
    int64_t mid = std::max(start, end_l2_aligned - block_size);
    homogenize_blocks(first + start, first + end_l2_aligned, block_size, proj);
    // call sort_blocks (word / buffer version)
    inplace_01_merge(first + start, first + end_l2_aligned, proj);
    inplace_01_merge(first + start, first + end, proj);
}
block_size = merge_size;

# 3.8. Restoring the Buffer

The buffer is also a sorted range, and can be returned using 0-1 merge.

inplace_01_merge(buf0, last, proj);

# 4. Complete Code

Complete code including tests:

https://gist.github.com/axiomofchoice-hjt/3000e1705d29f2ad92551d6c2e085a52 (opens new window)

# 5. A Point of Contention

Packing multiple numbers into a single Word (Word packing "cheating") is controversial in the in-place algorithms community. Opponents argue that bitwise operations should be banned to prevent this trick.

So can it be achieved without bitwise operations? Surprisingly, yes. There is a paper titled Radix Sorting With No Extra Space. This is a very significant paper that can achieve in-place radix sort in $O(n)$ . I look forward to exploring it in a future post.

# 6. What the Original Paper Says

The original paper is written very tersely. It divides the array into blocks of size $\log n$ , sorts within blocks using Word-stored counters and markers, then performs block homogenization. Block sorting relies on another much more complex paper, which can achieve $O(n\log n)$ complexity, $O(n)$ moves, in-place stable sorting, requiring that element keys be enumerable.

However, the in-block sorting is explained too vaguely to be understood, so I had to replace that step with 2 rounds of block growth. The block sorting is even more complex, so I had to replace that step with 3 rounds of block growth.

Although this approach is not very elegant, it can be fully implemented in just over 200 lines of code, and it is an idea worth recording.