I'm trying to implement in C a data structure that would allow me to manipulate efficiently a**binary** matrix (containing only 1 or 0). I'll explain what operations I have to apply to this matrix, and would like to know what's the best possible data structure to use ?
The operations are done in the field F_2 (which means 1+1 = 0 the other operations remain unchanged). I have one k
*n
matrix (k
< n
) called H
. At most, k
= 2325 and n
= 3009.
The operations that I will have to do over this matrix are :
I will partially diagonalize it using only row swap and row additions. Once it is done, I will not use anymore row operations and will operate a lot (!) of columns additions over this matrix (what I mean by "a lot" is about ((n-k)/2)³ columns additions)
Data structure I was considering for the matrix :
For the matrix coefficient, I was thinking about storing sequences of multiple bits at once in one single unsigned int. For instance, I could store the sequence (11001011)
to the uint8_t
203
(converting from binary to decimal)
- Is it a good idea ?
If I do so, I have two options :
I could use uint16_t
or uint64_t
coefficients to split my matrix H in many 4*4 or 8*8 submatrices.
- Is this a good option (in terms of time efficiency) and if it is, is it better to use
uint16_t
oruint64_t
?
Else I was thinking about storing every row in multiple uint32_t
or uint64_t
, then operate my partial diagonalization. Next switch to a structure that would code the matrix as n
columns vectors to handle the remaining operations.
- Do you think this is more efficient ?
Whatever method I use, I will have to efficiently access the n
'th bit of an unsigned int (uint16
, 32
or 64
). How do I do that ?
For best performance, use an array of row pointers for the row swap and row additions. Use <stdint.h>
, and a fast unsigned integer type of minimum supported word size -- I recommend either uint_fast32_t
, unless you intend to run this on 16- or 8-bit processors.
When all the row swap and row additions are done, transpose the array. Although this operation is "slow", the following column operations will be so fast to offset the transpose cost.
Consider the following:
#include <stdint.h>
#include <limits.h>
typedef uint_fast32_t word;
#define WORD_BITS (CHAR_BIT * sizeof (word))
typedef struct {
int rows; /* Number of row vectors */
int cols; /* Number of defined bits in each row */
int words; /* Number of words per row vector */
word **row; /* Array of pointers */
} row_matrix;
typedef struct {
int rows; /* Number of defined bits in each column */
int cols; /* Number of col vectors */
int words; /* Number of words per col vector */
word **col;
} col_matrix;
Although you could use a single type to describe the two matrix forms, using separate types makes the code and functions easier to maintain. You'll end up having some duplicated code, but that's a minor issue compared to having clear, intuitive types.
On 32-bit systems, uint_fast32_t
is typically a 32-bit type. On 64-bit systems, it is typically 64-bit. The WORD_BITS
macro expands to the number of bits in a word
-- it is NOT always 32!
The easiest way to number the bits is to designate the leftmost bit in the matrix as bit 0, and store the bits in the least significant bits in each word. If you have row_matrix *rm
, then the bit at row row
, column col
is
!!(rm->row[row][col / WORD_BITS] & ((word)1U << (col % WORD_BITS)))
The !!
is the not-not operator: if the argument is nonzero, it yields 1, otherwise it yields 0. Because we mask a single bit from the word, the "bit is set" value would otherwise be some power of two (1, 2, 4, 8, 16, 32, 64, and so on).
To set the bit, use
rm->row[row][col / WORD_BITS] |= (word)1U << (col % WORD_BITS);
To clear a bit, you need to binary-AND with a mask having all except the target bit 1. This is easy to achieve using the not operator ~
:
rm->row[row][col / WORD_BITS] &= ~((word)1U << (col % WORD_BITS));
The corresponding operations for the col_matrix *cm
are
!!(cm->col[col][row / WORD_BITS] & ((word)1U << (row % WORD_BITS)))
cm->col[col][row / WORD_BITS] |= (word)1U << (row % WORD_BITS);
cm->col[col][row / WORD_BITS] &= ~((word)1U << (row % WORD_BITS));
Although the division /
and modulus (or remainder) %
are generally slow-ish (compared to addition, subtraction, and even multiplication), here WORD_BITS
will be a power of two compile-time constant on all widely used architectures. All compilers I know of will turn the above into fast bit shifts and binary-AND operators.
To add row srcrow
to row dstrow
, you simply do a binary exclusive-or on all the words:
{
const word *const src = rm->row[srcrow];
word *const dst = rm->row[dstrow];
int w = rm->words;
while (w-->0)
dst[w] ^= src[w];
}
and analogously, for the column matrix,
{
const word *const src = cm->col[srccol];
word *const dst = cm->col[dstcol];
int w = cm->words;
while (w-->0)
dst[w] ^= src[w];
}
Note that if you combine more than two rows, you can do so very efficiently; it'll be MUCH faster than doing the additions consecutively. Intel and AMD CPUs are very good at predicting the above pattern, so you can just use more than one source row/column. Also, the destination does not have to participate in the result, although if I guess correctly what algorithm you're implementing, I guess you want it to.
If you know the target architectures have SSE2 or better, or even AVX, you can use the emmintrin.h
or immintrin.h
header files, respectively, for compiler built-in types and operators that allow you to XOR 128 bits and 256 bits, respectively, at once; sometimes giving you quite a bit of a boost.
Since the vector types require what the C standards call "excess alignment", you'll then also need to include mm_malloc.h
, and use _mm_malloc()
and _mm_free()
to allocate the row/column vectors for the word
data -- and obviously round words
up so you can access the row/column as a suitable integer word type (__m128i
for SSE*, __m256i
for AVX).
Personally, I always implement the unvectorized version first, then some "nasty" test cases for unit testing, and only then see about vectorizing it. This has the benefit that you can give the unvectorized version as a preliminary version for those who will be using it, and you can compare the test case results between the vectorized and unvectorized cases, to see if one or the other has a bug in it.
The transpose operation is quite straightforward, although I do recommend using a triple loop: innermost looping over the bits in a single word. Also, you'll probably want to check which order -- row or column major -- works best for the outer loop; depending on the matrix size, you may see a huge difference. (This is due to cache behaviour: you want the CPU to be able to predict the access pattern, and not have to reload the same cachelines. In the best case, on at-most-few-years-old AMD and Intel x86-64 processors, you can get near cache speeds if both matrixes fit in the cache.)
All of the above can be implemented in a single header file -- even including the vectorized versions if the target architecture supports SSE2/AVX --, so it should not be too difficult to implement.
Questions?
When you refer to the type uint16_t
, uint64_t
, ... I guess this is for the matrix coefficients.
Therefore you should know better than us which value you are manipulating : if you potentially can generate huge numbers then you want a large type to avoid overflow. About the efficiency I doubt you will feel a difference in term of speed, but you can save some space by choosing a smaller type.
Anyway, this is all about optimization : you should not bother with strong types. To begin with, using char
(or uint8_t
) should be just fine because you are only dealing with 1
and 0
.
I don't see any interest in switching from char matrix[][]
to a typedef struct matrix_columns
, I think you can perform your operation by wisely using the row and columns indexes.
Finally, to get the bit at position i
in an unsigned int coef
:
unsigned int bit = (coef>>i) & 1;
[EDIT] Also, here is a related question about binary matrix operations (multiplication, addition, XOR).
I guess storing every row in multiple uintX_t
's is a good idea, but I would choose X
to match the word size of your processor. In this way you can sum multiple matrix entries at once:
uint8_t a = 3; // 00000011
uint8_t b = 2; // 00000010
b ^= a; //results 00000001
//for XOR is just addition in F_2
I think the bottleneck of your problem is the column operations, right? So after you perfom the partial diagonalization, why not operate the columns additions on its transpose, just in the same way you have done for the partial diagonalization? You would spend some time making this transposition, for it has to be done bit a bit, but after that, columns addition would be much more simple.
Suppose you have a 64x64 matrix:
uint64_t matrix[64]; // This means that matrix[i] is the i-th row
How yould you sum (mod 2) the j
-th column to the l
-th, for j
and l
between 1 and 64? You would have to iterate over all the lines (assuming higher positions in a matrix row are at less significant bits):
int i;
for (i=0; i<64; i++){
matrix[i] ^= (matrix[i] & (1<<(64-j))) << (64-l);
}
But if you have had matrix
transposed to a matrixT
, the above for
loop would be equivalent to:
matrixT[l] ^= matrixT[j];
which would be done in one single step by the processor (I guess, but not sure). So if you spend time doing the transposition after that you would benefit from the processor's power of doing arithmetic operations over data of its word size in one step.
来源:https://stackoverflow.com/questions/25053442/binary-vectors-and-matrix-manipulation-in-c