Image Processing Pipelining in VHDL

问题

I am currently trying to develop a Sobel filter in VHDL. I am using a 640x480 picture that is stored in a BRAM. The algorithm uses a 3x3 matrix of pixels of the image for processing each output pixel. My problem is that I currently only know of putting an image into a BRAM where each address of the BRAM holds one pixel value. This means I can only read one pixel per clock. My problem is that I am trying to pipeline the data so I would ideally need to be able to get three pixel values (one from each row of the picture) per clock so after my initial latency, I can load in three new pixel values per clock and get an output pixel on every clock. I am looking for a way to do this but cannot figure it out.

The only way I can think of to fix this is to have the image in 3 BRAMs. that way I can read in values from 3 rows per each clock cycle. However, there is not enough memory space to fit even one more RAM large enough to fit a 640x480 image let alone three. I could lower the picture size to do it this way, but I really want to do it with my current 640x480 image size.

Any help or guidance would be greatly appreciated.

回答1:

A simple solution would be to store 1/4th of the image in 4 separate memories. First memory contain every 4th line, second every 4th line, starting from second line, etc. I would use 4 even if you need 3 lines, since 4 evenly divides 480 and every other standard resolution. Also, finding a binary number modulo 4 is trivial, which is needed to order the memories.

You can use the MSB of the line number to address your RAM, and the LSBs to figure out the relative order of each RAM output (code is only to demonstrate idea, it's not usable as is...):

address <= line(line'left downto 2) & col; -- Or something more efficent on packing
data0 <= ram0(address);
data1 <= ram1(address);
data2 <= ram2(address);
data3 <= ram3(address);

case line(1 downto 0) is
    when "00" =>
        line0 <= data0;
        line1 <= data1;
        line2 <= data2;
    when "01" =>
        line0 <= data1;
        line1 <= data2;
        line2 <= data3;
    when "10" =>
        line0 <= data2;
        line1 <= data3;
        line2 <= data0;
    when "11" =>
        line0 <= data3;
        line1 <= data0;
        line2 <= data1;
    when others => null;
end case;

回答2:

I made a sobel filter few years ago. To do that, i wrote a pipeline that gives 9 pixels at each clock cycle:

architecture rtl of matrix_3x3_builder_8b is
type fifo_t is array (0 to 2*IM_WIDTH + 2) of std_logic_vector(7 downto 0);
signal fifo_int : fifo_t;

begin    

    p0_build_5x5: process(rst_i,clk_i)
    begin
        if( rst_i = '1' )then
            fifo_int <= (others => (others => '0'));
        elsif( rising_edge(clk_i) )then
             if(data_valid_i = '1')then
                for i in 1 to 2*IM_WIDTH + 2 loop
                    fifo_int(i) <= fifo_int(i-1);
                end loop;           
                fifo_int(0) <= data_i;  
            end if;
        end if;
    end process p0_build_5x5;

data_o1 <= fifo_int(0*IM_WIDTH + 0);
data_o2 <= fifo_int(0*IM_WIDTH + 1);
data_o3 <= fifo_int(0*IM_WIDTH + 2);
data_o4 <= fifo_int(1*IM_WIDTH + 0);
data_o5 <= fifo_int(1*IM_WIDTH + 1);
data_o6 <= fifo_int(1*IM_WIDTH + 2);
data_o7 <= fifo_int(2*IM_WIDTH + 0);
data_o8 <= fifo_int(2*IM_WIDTH + 1);
data_o9 <= fifo_int(2*IM_WIDTH + 2);

end rtl;

Here you read the image pixel by pixel to build your 3x3 matrix. The pipeline is longer to fill up but once completed, you have a new matrix each clock pulse.

回答3:

If you want to continue storing the whole image, then I would do as Jonathan Drolet recommended and cycle between four rams while writing and read all 4 at once (muxing the three you care about into 3 registers). This works because your rams will be deep enough that you will still be able to get full BRAM utilization at 1/4 the depth (77k deep still) and your reads can be predictably segmented.

For the specifics of this problem, Nicolas Roudel's method is much cheaper with BRAM, although you can't store the whole image at one time, so wherever you send your results can't backpressure you unless you can backpressure your data source. That may or may not matter for your application.

When you try to do something like this with extremely wide, but fairly shallow (1k deep) rams segmenting will use more block ram (or even start inferring distributed ram). When the reads do not follow a particular pattern (the pattern in your case is that they are all sequential and adjacent locations), the ram cannot be segmented. The best strategy to maintain efficient BRAM use is often to build quad port rams from the natively dual port block rams by clocking them with a 2x clock that is phase aligned with your normal clock, allowing you to do a write and 3 reads every 1x clock cycle.

来源：https://stackoverflow.com/questions/29505213/image-processing-pipelining-in-vhdl

标签

memory

image-processing

vhdl

fpga

pipeline