Executing programs stored in external SPI flash memory on an ARM processor

问题

I have an ARM processor that is able to interface with an external flash memory chip. Written to the chip are programs compiled for the ARM architecture ready to be executed. What I need to know how to do is get this data from the external flash onto the ARM processor for execution.

Can I run some sort of copy routine ahead-of-time where the data is copied into executable memory space? I suppose I could, but the ARM processor is running an operating system and I don't have a ton of space left over in flash to work with. I'd also like to be able to schedule the execution of two or even three programs at once, and copying multiple programs into internal flash at one time isn't feasible. The operating system can be used to launch the programs once they're within accessible memory space, so anything that needs to be done beforehand can be.

回答1:

From reading the existing answers by @FiddlingBits and @ensc I think that I can offer a different approach.

You said that your Flash chip can not be memory mapped. This is a pretty big limitation but we can work with it.

Yes you can run a copy routine ahead of time. So long as you place it into RAM you can execute it.

DMA to make it faster:

If you have a Peripheral DMA Controller (like the one available on the Atmel SAM3N family) then you can use the DMA Controller to copy out chunks of memory while your processor does actually useful things.

MMU to make it simpler:

If you have an MMU available then you can do this easily by just picking out a region of RAM where you want your code to execute, copying the code into it and on every page fault, reloading the correct code into the very same region. However, this was already proposed by @ensc so I'm not adding anything new yet.

Note: In case it's not clear, an MMU is not the same as an MPU

No MMU solution but an MPU is available:

Without an MMU the task is a little trickier but it is still possible to do. You will need to understand how your compiler generates code and read up about Position Independent Code (PIC). Then you will need to allocate a region in RAM that you will execute your external flash chip code from and copy parts of it in there (making sure that you start executing it from the correct location). The MPU will need to be configured to generate a fault any time that task tries to access memory outside of its assigned region and you will then need to fetch the correct memory (this could become a complicated process), reload and continue execution.

No MMU and no MPU available:

If you don't have an MMU this task now becomes very difficult to do. In both cases you have a severe restriction on how big the external code can be. Basically, your code that is stored on the external Flash chip now must be able to fit exactly inside the allocated region in RAM where you will execute it from. If you can split that code up into separate tasks that don't interact with each other than you can do it but otherwise you can not.

If you are generating PIC then you can just compile the tasks and place them in memory sequentially. Otherwise, you will need to use the linker script to control the code generation such that each compiled task that will be stored in external flash will execute from the same predefined location in RAM (which will either require you to learn about ld overlays or compile them separately).

Summary:

To answer your question more completely I would need to know what chip and what operating system you are using. How much RAM is available would also help me better understand your constraints.

However, you asked if it was possible to load more than one task at a time to run. If you use PIC like I suggested it should be possible to do so. If not, then you would need to decide ahead of time where each of the tasks will run and that would enable to load/run some of the combinations simultaneously.

And finally, depending on your system and chip this could be easy or hard.

EDIT 1:

Additional information given:

The chip is SAM7S (Atmel)
It does have a Peripheral DMA Controller.
It doesn't have a MMU or MPU.
8K of internal RAM, which is a limitation for us.
It has roughly 28K of flash left over after the operating system, which is custom-written, has been installed.

Additional questions posed:

Ideally, I'd like to copy the programs over into flash memory space and execute them from there. Theoretically this is possible. Would it be impossible to execute the programs instruction by instruction?

Yes it is possible to execute a program instruction by instruction (but there is a limitation with that approach too that I will get to in a sec). You would start by allocating a (4 byte aligned) address in memory where your single instruction would go. It is 32 bits (4 bytes) wide and immediately following it you would place a second instruction that you would never change. This second instruction would be a supervisor call (SVC) that would raise an interrupt allowing you to fetch the next instruction, place it in memory and start again.

Though possible it isn't recommended because, you will spend more time context switching than executing code, you can't actually use variables (you need to use RAM for that), you can't use function calls (unless you manually process branch instructions, ouch!) and your flash will be written to so much that it will be made useless very fast. With that last one, about Flash being made useless, I will assume that you wanted to execute instruction by instruction from RAM. On top of all of these restrictions you will still have to use some RAM for your stack, heap and globals (see my Appendix for details). This area can be shared by all the tasks running from external flash but you will need to write a custom linker script for this, otherwise you will waste your RAM.

What will make this clearer for you is understanding how C code is compiled. Even if you're using C++ start by asking yourself this, where are the variables and instructions on my device compiled to?

Basically what you MUST know before attempting this is:

where the code will execute (Flash/RAM)
how this code is linked to its stack, heap and globals (you would allocate a separate stack for this task, and separate space for globals but you can share the heap).
where this external code's stack, heap and globals reside (with this I'm trying to hint at how much control you will need to have over your C code)

Edit 2:

How to utilize the Peripheral DMA Controller:

For the microcontroller I'm working with, the DMA controller is actually not connected to the Embedded Flash for either reading or writing. If this is the case for you too you cannot use it. However, your datasheet is unclear in this regard and I suspect that you will need to run a test using the Serial Port to see if it can actually work.

In addition to this, I am concerned that the write operation when using the DMA controller may be more complicated than you doing it manually because of cached page writes. You will need to ensure that you only do the DMA transfers within pages and that a DMA transfer never crosses the page boundary. Also, I'm not sure what happens when you tell the DMA controller to write from flash back into the same location (which you might need to do to ensure you only overwrite the correct parts).

Concerns about the available flash and RAM:

I am concerned with your earlier question about executing it one instruction at a time. If that is the case, then you might as well write an interpreter. If you don't have enough memory to contain the entire code of the task you need to execute then you will need to compile the task as PIC with the Global Offset Table (GOT) being placed in ram along with all the required memory for that task's globals. That's the only way to get around not having enough space for the whole task. You will also have to allocate enough space for its stack too.

If you don't have enough RAM (which I suspect you won't) you can swap your RAM memory out and dump it into Flash every time you need to change between tasks on the external Flash chip but again I would strongly advise against writing to your flash memory many times. That way you can make the tasks on the external flash share a piece of RAM for their globals.

For all other cases you will be writing an interpreter. I have even done the unthinkable, I have tried to think of a way to use the Abort Status of your microcontroller's memory controller (section 18.3.4 Abort Status in the datasheet) as an MPU but have failed to find even a remotely clever way to use it.

Edit 3:

I would suggest reading the section 40.8.2 Non-volatile Memory (NVM) Bits in the datasheet which suggests that your flash has a maximum of 10,000 write/erase cycles (it took me a while to find it). That means by the time you've written and erased the flash region where you will be context switching the tasks 10,000 times that part of Flash will be rendered useless.

APPENDIX

Please have a short read of this blog entry before continuing to read my comments below.

Where C variables live on an embedded ARM chip:

I learn best not from abstract concepts but concrete examples so I will try and give you code to work with. Basically all the magic happens in your linker script. If you read and understand it you will see what happens to your code. Let's dissect one now:

OUTPUT_FORMAT("elf32-littlearm", "elf32-littlearm", "elf32-littlearm")
OUTPUT_ARCH(arm)
SEARCH_DIR(.)

/* Memory Spaces Definitions */

MEMORY
{
  /* Here we are defining the memory regions that we will be placing
   * different sections into. Different regions have different properties,
   * for example, Flash is read only (because you need special instructions
   * to write to it and writing is slow), while RAM is read write.
   * In the brackets after the region name:
   *   r - denotes that reads are allowed from this memory region.
   *   w - denotes that writes are allowed to this memory region.
   *   x - means that you can execute code in this region.
   */

  /* We will call Flash rom and RAM ram */
  rom (rx)  : ORIGIN = 0x00400000, LENGTH = 0x00040000 /* flash, 256K */
  ram (rwx) : ORIGIN = 0x20000000, LENGTH = 0x00006000 /* sram, 24K */
}

/* The stack size used by the application. NOTE: you need to adjust  */
STACK_SIZE = DEFINED(STACK_SIZE) ? STACK_SIZE : 0x800 ;

/* Section Definitions */
SECTIONS
{
    .text :
    {
        . = ALIGN(4);
        _sfixed = .;
        KEEP(*(.vectors .vectors.*))
        *(.text .text.* .gnu.linkonce.t.*)
        *(.glue_7t) *(.glue_7)
        *(.rodata .rodata* .gnu.linkonce.r.*)  /* This is important, .rodata is in Flash */
        *(.ARM.extab* .gnu.linkonce.armextab.*)

        /* Support C constructors, and C destructors in both user code
           and the C library. This also provides support for C++ code. */
        . = ALIGN(4);
        KEEP(*(.init))
        . = ALIGN(4);
        __preinit_array_start = .;
        KEEP (*(.preinit_array))
        __preinit_array_end = .;

        . = ALIGN(4);
        __init_array_start = .;
        KEEP (*(SORT(.init_array.*)))
        KEEP (*(.init_array))
        __init_array_end = .;

        . = ALIGN(0x4);
        KEEP (*crtbegin.o(.ctors))
        KEEP (*(EXCLUDE_FILE (*crtend.o) .ctors))
        KEEP (*(SORT(.ctors.*)))
        KEEP (*crtend.o(.ctors))

        . = ALIGN(4);
        KEEP(*(.fini))

        . = ALIGN(4);
        __fini_array_start = .;
        KEEP (*(.fini_array))
        KEEP (*(SORT(.fini_array.*)))
        __fini_array_end = .;

        KEEP (*crtbegin.o(.dtors))
        KEEP (*(EXCLUDE_FILE (*crtend.o) .dtors))
        KEEP (*(SORT(.dtors.*)))
        KEEP (*crtend.o(.dtors))

        . = ALIGN(4);
        _efixed = .;            /* End of text section */
    } > rom /* All the sections in the preceding curly braces are going to Flash in the order that they were specified */

    /* .ARM.exidx is sorted, so has to go in its own output section.  */
    PROVIDE_HIDDEN (__exidx_start = .);
    .ARM.exidx :
    {
      *(.ARM.exidx* .gnu.linkonce.armexidx.*)
    } > rom
    PROVIDE_HIDDEN (__exidx_end = .);

    . = ALIGN(4);
    _etext = .;

    /* Here is the .relocate section please pay special attention to it */
    .relocate : AT (_etext)
    {
        . = ALIGN(4);
        _srelocate = .;
        *(.ramfunc .ramfunc.*);
        *(.data .data.*);
        . = ALIGN(4);
        _erelocate = .;
    } > ram  /* All the sections in the preceding curly braces are going to RAM in the order that they were specified */

    /* .bss section which is used for uninitialized but zeroed data */
    /* Please note the NOLOAD flag, this means that when you compile the code this section won't be in your .hex, .bin or .o files but will be just assumed to have been allocated */
    .bss (NOLOAD) :
    {
        . = ALIGN(4);
        _sbss = . ;
        _szero = .;
        *(.bss .bss.*)
        *(COMMON)
        . = ALIGN(4);
        _ebss = . ;
        _ezero = .;
    } > ram

    /* stack section */
    .stack (NOLOAD):
    {
        . = ALIGN(8);
        _sstack = .;
        . = . + STACK_SIZE;
        . = ALIGN(8);
        _estack = .;
    } > ram

    . = ALIGN(4);
    _end = . ;

    /* heap extends from here to end of memory */
}

This is an automatically generated linker script for the SAM3N (your linker script should only differ in the memory region definitions). Now, let's go through what happens when your device boots after being powered off.

The first thing that happens is that the ARM core reads the address stored in the FLASH memory's vector table that points to your reset vector. The reset vector is just a function and for me it is also autogenerated by Atmel Studio. Here it is:

void Reset_Handler(void)
{
    uint32_t *pSrc, *pDest;

    /* Initialize the relocate segment */
    pSrc = &_etext;
    pDest = &_srelocate;

    /* This code copyes all of the memory for "initialised globals" from Flash to RAM */
    if (pSrc != pDest) {
        for (; pDest < &_erelocate;) {
            *pDest++ = *pSrc++;
        }
    }

    /* Clear the zero segment (.bss). Since it in RAM it could be anything after a reset so zero it. */
    for (pDest = &_szero; pDest < &_ezero;) {
        *pDest++ = 0;
    }

    /* Set the vector table base address */
    pSrc = (uint32_t *) & _sfixed;
    SCB->VTOR = ((uint32_t) pSrc & SCB_VTOR_TBLOFF_Msk);

    if (((uint32_t) pSrc >= IRAM_ADDR) && ((uint32_t) pSrc < IRAM_ADDR + IRAM_SIZE)) {
        SCB->VTOR |= 1 << SCB_VTOR_TBLBASE_Pos;
    }

    /* Initialize the C library */
    __libc_init_array();

    /* Branch to main function */
    main();

    /* Infinite loop */
    while (1);
}

Now, bear with me for a little longer while I explain how C code that you write fits into all of this.

Consider the following code example:

int UninitializedGlobal; // Goes to the .bss segment (RAM)
int ZeroedGlobal[10] = { 0 }; // Goes to the .bss segment (RAM)
int InitializedGlobal[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 11 }; // Goes to the .relocate segment (RAM and FLASH)
const int ConstInitializedGlobal[10] = { 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 }; // Goes to the .rodata segment (FLASH)

void function(int parameter)
{
    static int UninitializedStatic; // Same as UninitializedGlobal above.
    static int ZeroedStatic = 0; // Same as ZeroedGlobal above.
    static int InitializedStatic = 7; // Same as InitializedGlobal above.
    static const int ConstStatic = 18; // Same as ConstInitializedGlobal above. Might get optimized away though, lets assume it doesn't.

    int UninitializedLocal; // Stacked. (RAM)
    int ZeroedLocal = 0; // Stacked and then initialized (RAM)
    int InitializedLocal = 7; // Stacked and then initialized (RAM)
    const int ConstLocal = 91; // Not actually sure where this one goes. I assume optimized away.

    // Do something with all those lovely variables...
}

回答2:

It depends on kind of flash and/or the cpu. NOR flash is usually mapped into memory so you can jump directly into it. NAND flash must be read (which depends on SOC) into local memory (SRAM, DRAM (--> needs extra initialization!)).

EDIT:

SPI can not be mapped to RAM either. You have to program the SPI controller of the SOC and the SPI flash. The protocol to be used for the SPI flash is usually described in its manual; it is very likely that a common protocol so you can probably reuse an existing driver.

来源：https://stackoverflow.com/questions/20205944/executing-programs-stored-in-external-spi-flash-memory-on-an-arm-processor

标签

arm

execution

spi

flash-memory