I have N points in D dimensions, where let\'s say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed
I wrote a simple program to populate and contiguously access a data structure with binary data:
std::vector<int>
std::vector<char>
std::vector<bool>
std::bitset
I used my Time measurements. I used -O3 optimization flag, N = 1 mil and D = 100.
This is the code for vectors:
#include <vector>
#include <iostream>
#include <random>
#include <cmath>
#include <numeric>
#include <functional> //plus, equal_to, not2
#include <ctime>
#include <ratio>
#include <chrono>
#define T int
unsigned int hd(const std::vector<T>& s1, const std::vector<T>::iterator s2)
{
return std::inner_product(
s1.begin(), s1.end(), s2,
0, std::plus<unsigned int>(),
std::not2(std::equal_to<std::vector<T>::value_type>())
);
}
std::uniform_int_distribution<int> uni_bit_distribution(0, 1);
std::default_random_engine generator(std::chrono::system_clock::now().time_since_epoch().count());
// g++ -Wall -O3 bitint.cpp -o bitint
int main()
{
const int N = 1000000;
const int D = 100;
unsigned int hamming_dist[N] = {0};
unsigned int ham_d[N] = {0};
std::vector<T> q;
for(int i = 0; i < D; ++i)
q.push_back(uni_bit_distribution(generator));
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
std::vector<T> v;
v.resize(N * D);
for(int i = 0; i < N; ++i)
for(int j = 0; j < D; ++j)
v[j + i * D] = uni_bit_distribution(generator);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double> >(t2 - t1);
std::cout << "Build " << time_span.count() << " seconds.\n";
t1 = high_resolution_clock::now();
for(int i = 0; i < N; ++i)
for(int j = 0; j < D; ++j)
hamming_dist[i] += (v[j + i * D] != q[j]);
t2 = high_resolution_clock::now();
time_span = duration_cast<duration<double> >(t2 - t1);
std::cout << "No function hamming distance " << time_span.count() << " seconds.\n";
t1 = high_resolution_clock::now();
for(int i = 0; i < N; ++i)
ham_d[i] = hd(q, v.begin() + (i * D));
t2 = high_resolution_clock::now();
time_span = duration_cast<duration<double> >(t2 - t1);
std::cout << "Yes function hamming distance " << time_span.count() << " seconds.\n";
return 0;
}
The code for std::bitset
can be found in: XOR bitset when 2D bitset is stored as 1D
For std::vector<int>
I got:
Build 3.80404 seconds.
No function hamming distance 0.0322335 seconds.
Yes function hamming distance 0.0352869 seconds.
For std::vector<char>
I got:
Build 8.2e-07 seconds.
No function hamming distance 8.4e-08 seconds.
Yes function hamming distance 2.01e-07 seconds.
For std::vector<bool>
I got:
Build 4.34496 seconds.
No function hamming distance 0.162005 seconds.
Yes function hamming distance 0.258315 seconds.
For std:bitset
I got:
Build 4.28947 seconds.
Hamming distance 0.00385685 seconds.
std::vector<char>
seems to be the winner.
Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D
coordinates of a single point as a contiguous bitvector. std::bitset<D>
would be a logical choice.
However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N)
, but the second will be much faster.
You may be able to beat O(N*N)
by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c)
. I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N)
.
If the values are independently, uniformly distributed, and you want to find the Hamming distance between two independently, randomly chosen points, the most efficient layout is a packed array of bits.
This packed array would ideally be chunked into the largest block size over which your popcnt
instruction works: 64 bits. The hamming distance is the sum of popcnt(x_blocks[i] ^ y_blocks[i])
. On processors with efficient unaligned accesses, byte alignment with unaligned reads is likely to be most efficient. On processors where unaligned reads incur a penalty, one should consider whether the memory overhead of aligned rows is worth faster logic.