VHDL Place and route path analysis

后端 未结 1 644
别跟我提以往
别跟我提以往 2021-01-28 14:31

my problem is that when I implement my design using Xilinx ISE 14.7 + XPS I often obtain a very different number of analyzed paths in the static timing analysis, also having ver

相关标签:
1条回答
  • 2021-01-28 15:07

    The coding style of the multiplexer at this line

    data <= data_in((idx+1)*B-1 downto idx*B);
    

    can heavily influence the logic synthesis. This results in very different number of paths to analyze for timing.

    The original multiplexer

    I first checked the synthesis of the above line using this small example:

    library IEEE;
    use IEEE.STD_LOGIC_1164.ALL;
    use IEEE.NUMERIC_STD.ALL;
    
    entity mux1 is
        generic (
            B : positive := 32;
            M : positive := 7); -- M := ceil(log_2 N)
        port (
            d : in  STD_LOGIC_VECTOR ((2**M)*B-1 downto 0); -- input data
            s : in  STD_LOGIC_VECTOR (M-1 downto 0);        -- selector
            y : out  STD_LOGIC_VECTOR(B-1 downto 0));       -- result
    end mux1;
    
    architecture Behavioral of mux1 is
        constant N : positive := 2**M;
        signal idx : integer range 0 to N-1;
    begin
        idx <= to_integer(unsigned(s));
        y <= d((idx+1)*B-1 downto idx*B);
    end Behavioral;
    

    If one synthesizes this for a Spartan-6, XST reports this (excerpt):

    Macro Statistics
    # Adders/Subtractors                                   : 2
     13-bit subtractor                                     : 1
     8-bit adder                                           : 1
    ...
     Number of Slice LUTs:                 1516  out of   5720    26%  
    ...
    Timing constraint: Default path analysis
      Total number of paths / destination ports: 139264 / 32
    

    Thus, no multiplexer was detected and the timing analyzer has to analyze a huge number of paths. The logic utilization is ok.

    Optimized implementation

    The same multiplexing can be achieved with: (EDIT: bugfix and simplification)

    library IEEE;
    use IEEE.STD_LOGIC_1164.ALL;
    use IEEE.NUMERIC_STD.ALL;
    
    entity mux2 is
        generic (
            B : positive := 32;
            M : positive := 7); -- M := ceil(log_2 N)
        port (
            d : in  STD_LOGIC_VECTOR ((2**M)*B-1 downto 0);
            s : in  STD_LOGIC_VECTOR (M-1 downto 0);
            y : out  STD_LOGIC_VECTOR(B-1 downto 0));
    end mux2;
    
    -- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    -- !! The entire architecture has been FIXED and simplified. !!
    -- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    architecture Behavioral of mux2 is
        constant N : positive := 2**M;
        type matrix is array (N-1 downto 0) of std_logic_vector(B-1 downto 0);
        signal dd : matrix;
    begin
        -- reinterpret 1D vector 'd' as 2D matrix, i.e.
        -- row 0 holds d(B-1 downto 0) which is selected in case s = 0
        row_loop: for row in 0 to N-1 generate
            dd(row) <= d((row+1)*B-1 downto row*B);
        end generate;
    
        -- select the requested row
        y <= dd(to_integer(unsigned(s)));
    end Behavioral;
    

    Now, the XST report looks much better:

    Macro Statistics
    # Multiplexers                                         : 1
     32-bit 128-to-1 multiplexer                           : 1
    ...
     Number of Slice LUTs:                 1344  out of   5720    23%  
    ...
    Timing constraint: Default path analysis
      Total number of paths / destination ports: 6816 / 32
    

    It detects that for each output-bit a 128-to-1 multiplexer is required. The optimized synthesis of such a wide multiplexer is built-in to the synthesis tool. The number of LUTs is only reduced slightly. But, the number of paths to be processed by the timing analyzer is reduced dramatically by a factor of 20!

    Implementation using one-hot selector

    The above examples use a binary-encoded selector signal. I checked also the variant with the one-hot encoded one:

    library IEEE;
    use IEEE.STD_LOGIC_1164.ALL;
    use IEEE.NUMERIC_STD.ALL;
    
    entity mux3 is
        generic (
            B : positive := 32;
            N : positive := 128);
        port ( d : in  STD_LOGIC_VECTOR (N*B-1 downto 0);
               s : in  STD_LOGIC_VECTOR (N-1 downto 0);
               y : out  STD_LOGIC_VECTOR(B-1 downto 0));
    end mux3;
    
    architecture Behavioral of mux3 is
    
    begin
        process(d, s)
        begin
            y <= (others => '0'); -- avoid latch!
            for i in 0 to N-1 loop
                if s(i) = '1' then
                    y <= d((i+1)*B-1 downto i*B);
                end if;
            end loop;
        end process;
    
    end Behavioral;
    

    Now, the XST report is different again:

    Macro Statistics
    # Multiplexers                                         : 128
     32-bit 2-to-1 multiplexer                             : 128
    ...
    Number of Slice LUTs:                 2070  out of   5720    36%  
    ...
    Timing constraint: Default path analysis
      Total number of paths / destination ports: 13376 / 32
    

    2-to-1 multiplexer are detected, because a priority mux analog to this scheme was described:

    if s(127) = '1' then
      y <= d(128*B-1 downto 127*B);
    else
      if s(126) = '1' then
        y <= d(127*B-1 downto 126*B);
      else
        ...
                                 if s(0) = '1' then
                                   y <= d(B-1 downto 0);
                                 else
                                   y <= (others => '0');
                                 end if;
      end if; -- s(126)
    end if; -- s(127)
    

    I have not used elsif here for didactical reasons. Each if-else stage is a 32-bit wide 2-to-1 mutiplexer. The problem here is, that the synthesis does not know, that s is a one-hot encoded signal. Thus, a little more logic is required as in my optimized implementation.

    The number of paths to analyze for timing changes again significantly. The number is 10 times lower than in the original implementation, but 2 times higher than in my optimized one.

    0 讨论(0)
提交回复
热议问题