my problem is that when I implement my design using Xilinx ISE 14.7 + XPS I often obtain a very different number of analyzed paths in the static timing analysis, also having ver
The coding style of the multiplexer at this line
data <= data_in((idx+1)*B-1 downto idx*B);
can heavily influence the logic synthesis. This results in very different number of paths to analyze for timing.
I first checked the synthesis of the above line using this small example:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity mux1 is
generic (
B : positive := 32;
M : positive := 7); -- M := ceil(log_2 N)
port (
d : in STD_LOGIC_VECTOR ((2**M)*B-1 downto 0); -- input data
s : in STD_LOGIC_VECTOR (M-1 downto 0); -- selector
y : out STD_LOGIC_VECTOR(B-1 downto 0)); -- result
end mux1;
architecture Behavioral of mux1 is
constant N : positive := 2**M;
signal idx : integer range 0 to N-1;
begin
idx <= to_integer(unsigned(s));
y <= d((idx+1)*B-1 downto idx*B);
end Behavioral;
If one synthesizes this for a Spartan-6, XST reports this (excerpt):
Macro Statistics
# Adders/Subtractors : 2
13-bit subtractor : 1
8-bit adder : 1
...
Number of Slice LUTs: 1516 out of 5720 26%
...
Timing constraint: Default path analysis
Total number of paths / destination ports: 139264 / 32
Thus, no multiplexer was detected and the timing analyzer has to analyze a huge number of paths. The logic utilization is ok.
The same multiplexing can be achieved with: (EDIT: bugfix and simplification)
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity mux2 is
generic (
B : positive := 32;
M : positive := 7); -- M := ceil(log_2 N)
port (
d : in STD_LOGIC_VECTOR ((2**M)*B-1 downto 0);
s : in STD_LOGIC_VECTOR (M-1 downto 0);
y : out STD_LOGIC_VECTOR(B-1 downto 0));
end mux2;
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- !! The entire architecture has been FIXED and simplified. !!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
architecture Behavioral of mux2 is
constant N : positive := 2**M;
type matrix is array (N-1 downto 0) of std_logic_vector(B-1 downto 0);
signal dd : matrix;
begin
-- reinterpret 1D vector 'd' as 2D matrix, i.e.
-- row 0 holds d(B-1 downto 0) which is selected in case s = 0
row_loop: for row in 0 to N-1 generate
dd(row) <= d((row+1)*B-1 downto row*B);
end generate;
-- select the requested row
y <= dd(to_integer(unsigned(s)));
end Behavioral;
Now, the XST report looks much better:
Macro Statistics
# Multiplexers : 1
32-bit 128-to-1 multiplexer : 1
...
Number of Slice LUTs: 1344 out of 5720 23%
...
Timing constraint: Default path analysis
Total number of paths / destination ports: 6816 / 32
It detects that for each output-bit a 128-to-1 multiplexer is required. The optimized synthesis of such a wide multiplexer is built-in to the synthesis tool. The number of LUTs is only reduced slightly. But, the number of paths to be processed by the timing analyzer is reduced dramatically by a factor of 20!
The above examples use a binary-encoded selector signal. I checked also the variant with the one-hot encoded one:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity mux3 is
generic (
B : positive := 32;
N : positive := 128);
port ( d : in STD_LOGIC_VECTOR (N*B-1 downto 0);
s : in STD_LOGIC_VECTOR (N-1 downto 0);
y : out STD_LOGIC_VECTOR(B-1 downto 0));
end mux3;
architecture Behavioral of mux3 is
begin
process(d, s)
begin
y <= (others => '0'); -- avoid latch!
for i in 0 to N-1 loop
if s(i) = '1' then
y <= d((i+1)*B-1 downto i*B);
end if;
end loop;
end process;
end Behavioral;
Now, the XST report is different again:
Macro Statistics
# Multiplexers : 128
32-bit 2-to-1 multiplexer : 128
...
Number of Slice LUTs: 2070 out of 5720 36%
...
Timing constraint: Default path analysis
Total number of paths / destination ports: 13376 / 32
2-to-1 multiplexer are detected, because a priority mux analog to this scheme was described:
if s(127) = '1' then
y <= d(128*B-1 downto 127*B);
else
if s(126) = '1' then
y <= d(127*B-1 downto 126*B);
else
...
if s(0) = '1' then
y <= d(B-1 downto 0);
else
y <= (others => '0');
end if;
end if; -- s(126)
end if; -- s(127)
I have not used elsif
here for didactical reasons. Each if-else
stage is a 32-bit wide 2-to-1 mutiplexer. The problem here is, that the synthesis does not know, that s
is a one-hot encoded signal. Thus, a little more logic is required as in my optimized implementation.
The number of paths to analyze for timing changes again significantly. The number is 10 times lower than in the original implementation, but 2 times higher than in my optimized one.