I have a custom environment, where the state is a 2D matrix of 11 rows (equals to number of users to satisfy) and 3 columns. Each column can take the value of either 0 or 1,