问题

In a previous Stackoverflow answer Margaret Bloom says:

Waking the APs

This is achieved by inssuing a INIT-SIPI-SIPI (ISS) sequence to the all the APs.

The BSP that will send the ISS sequence using as destination the shorthand All excluding self, thereby targeting all the APs.

A SIPI (Startup Inter Processor Interrupt) is ignored by all the CPUs that are waked by the time they receive it, thus the second SIPI is ignored if the first one suffices to wake up the target processors. It is advised by Intel for compatibility reason.

I've been doing multi processing code for years and my observation of hardware has been that on some processors it seems different than stated. I'm pretty sure I've observed Application Processors (AP) have their Instruction Pointer modified upon receipt of Startup IPI even when it was active (not in a Wait-for-Startup-IPI).

Is there any Intel documentation that states what an AP will do upon a receipt of a Startup IPI when not in a Wait-for-Startup-IPI state, or documents the behaviour as undefined? I can't seem to find a definitive answer in the Intel Software Documentation Manuals or the supplementary Intel document Minimal Boot Loader for Intel® Architecture.

Generally I write the initialization code to initialize and start an AP by assuming that the AP may get a SIPI and have its Instruction Pointer reset while in an active state (not in a Wait-for-Startup-IPI state).

I'm trying to determine the accuracy of Margaret Bloom's statement that a second Startup IPI will be ignored by an AP that has been previously awoken.

回答1:

I consider my statement correct, up to bugs.

I don't claim that buggy hardware should be ignored but that their impact must first evaluated.
I'd like the remind the reader that while I have an opinionated position on the matter, I wanted this answer to be as neutral as possible.
To full fill this purpose I tried to provide sources for my statements.

While I do trust other users experiences I cannot base my belief on memories alone (for they cannot be verified)¹ and I'm looking forward for someone to correct my quoted statement with proofs.

I understand this is an unpopular view, I hope it just won't pass as totally wrong.

First of all, as usual with computers it all boils down to standards. While Intel documents the MP behaviour of their CPUs in the manuals, there went a step further and made a proper MultiProcessor specification.
The importance of this specification is its role in the industry, this is not how Intel's CPUs work, this is, as far as i known, the only x86 SMP industry reference.
AMD and Cyrix pushed the OpenPIC specification but quoting Wikipedia:

No x86 motherboard was released with OpenPIC however.[3] After the OpenPIC's failure in the x86 market, AMD licensed the Intel APIC Architecture for its AMD Athlon and later processors.

In the Appendix B4 of the MP-specification is present the line

If the target processor is in the halted state immediately after RESET or INIT, a STARTUP IPI causes it to leave that state and start executing. The effect is to set CS:IP to VV00:0000h.

As noted in the comment I've parsed the if as a stronger *iif.

Unfortunately, the quoted sentence, as stated, is only a sufficient condition. So it cannot be used to deduce the behaviour of a SIPI on a running CPU.

However I personally believe this is a mistake, the intent of the authors of the specification is to use the SIPI to wake up a CPU in the wait-for-SIPI state.

The SIPI was specifically introduced with the advent of integrated APICs, along with a revision of the INIT IPI, to manage the booting of the APs.
The SIPI has no effect on the BSP (which never enters the wait-for-SIPI state according to Intel's manuals) and it's clear that is should have no affect on a running CPU.
The usefulness of the SIPI, besides being non-maskeable and not requiring the LAPIC to be enabled, is that is avoid running from the reset vector and the need for the warm boot flag for APs.

It makes no sense, from a design perspective, to let SIPI act on running CPUs. CPUs are always restarted with an INIT IPI as the first IPI.

So, I'm confident in parsing the quoted statement as colloquial English with the tacit agreement that it is also a necessary condition.

I believe this sets the official behaviour of SIPI on a woke-up CPU, namely to ignore them.

Fact 1: There is a industry-standard MP specification followed by all major x86 manufacturers, though being ambiguous, it's intent is to set the behaviour of SIPIs.

Page 98 of the Pentium Spec Update seems to confirm that, at least for the Pentium (an presumably for later Intel generations, which may include AMDs since they have bought a license for the LAPIC from Intel):

If an INIT IPI is then sent to the halted upgrade component, it will be latched and kept pending until a STARTUP IPI is received. From the time the STARTUP IPI is received the CPU will respond to further INIT IPls but will ignore any STARTUP IPls. It will not respond to future STARTUP IPls until a RESET assertion or an INIT assertion (INIT Pin or INIT IPI) happens again.

The 75-, 90, and 100-MHz Pentium processors, when used as a primary processor, will never respond to a STARTUP IPI at any time. It will ignore the STARTUP IPI with no effects.

To shutdown the processors the operating system should only use the INIT IPI, STARTUP IPls should never be used once the processors are running.

This doesn't settle the question if there are CPUs where subsequent IPIs are not ignored.
While this question is still to be addressed, we have, by now, turned it into the question "Are there buggy CPUs that ... ?".
This is an huge leap-forward because we can now see how existing OSes deal with it.

I won't discuss Windows, while I recognise this is a big absence I'm not in the mood of digging into Windows binaries right now.
I may do it later.

Linux

Linux sends two SIPIs and I don't see any feedback in this loop. The code is in smpboot.c where we clearly see that num_starts is set to 2.
I won't discuss the difference between the LAPIC and the 82489DX APIC, particularly that the latter didn't have SIPI².

We can however see how Linux follow the Intel's algorithm and it is not worried by the second SIPI.
In the loop, executed num_starts time, a SIPI is sent to the target AP.

In the comments has been pointed out that the trampoline is idempotent and that Linux as a synchronisation mechanism.
That doesn't match with my experience, of course Linux synchronises code between CPUs but that's done later in the boot after the AP is running.
In fact the trampoline the first C code the AP executes is start_secondary and it doesn't seem idempotent (set_cpu_online is called later in the body, if that counts).

Finally, if the programmers wanted to prevent a double SIPI they'd put the synchronisation logic as early as possible to avoid dealing with complex situations later.
The trampoline goes as far as dealing with SME and vulnerabilities fixes, why would one want to do that before dealing with the SIPI-SIPI issue?

It makes no sense to me to have such a critical check so late.

Free BSD
I wanted to include a BSD OS because BSD code is known to be very clean and robust.
I was able to found a GitHub (unofficial) repository with the Free BSD source and while I'm less confident with that code I've found the routine that starts an AP in mp_x86.c.

Free BSD also uses the Intel's algorithm. To my amusement, the source also explains why there is the need for a second SIPI: the P5 processor (The P54C Pentium family?) did ignore the first SIPI due to a bug:

/*
* next we do a STARTUP IPI: the previous INIT IPI might still be
* latched, (P5 bug) this 1st STARTUP would then terminate
* immediately, and the previously started INIT IPI would continue. OR
* the previous INIT IPI has already run. and this STARTUP IPI will
* run. OR the previous INIT IPI was ignored. and this STARTUP IPI
* will run.
*/

I was unable to find the source for this statement, the only clue I have is the errata AP11 of the Pentium Specification Update found on an old android (i.e. Linux) kernel.
Today Linux seems to have dropped the support for those old buggy LAPICs.

Considering the detailed comments I don't see the need to check for the idempotency of the code up to an hypothetical check.
The BSD code is clearly written with the commented assumptions in mind.

Fact 2: Two mainstream OSes don't consider SIPI bugs occurring often enough to be worth handling.

While searching the Internet I've found a commit in the gem5 simulator with the title X86: Only recognize the first startup IPI after INIT or reset.
Apparently, they got it wrong at first and then fixed it.

Next step is trying to find some online documentation.
I first searched in Google Patents and while a lot of interesting results pop up (including how the APIC IDs are assigned), regarding SIPIs I only found this text in the patent Method and apparatus for initiating execution of an application processor in a clustered multiprocessor system:

STARTUP IPIs do not cause any change of State in the target processor (except for the change to the instruction pointer), and can be issued only one time after RESET or after an INIT IPI reception or pin assertion.

Wikipedia lists VIA as the only other x86 manufacturer still present..
I tried looking for VIA manuals, but it seems they are not public?

About the past manufacturers, I was unable to find if any ever produced MP CPUs at all. E.g. Cyrix 6x86MX didn't have an APIC at all, so they may have been put in a MP system only by an external APIC (which couldn't support SIPIs).

Next step would be to look at all of the AMD and Intel errata and see if there's something about the SIPIs.
However, errata are bugs and so the question turns into a search for a proof of non-existence (i.e. do bugged LAPICs exist?) which is hard to find (simply because bugs are hard to find and there are many micro-architectures).

My understanding is that the first integrated APIC (an LAPIC as known today) shipped with the P54C, I've consulting the errata but found nothing regarding the handling of SIPIs.
However understanding the errata in their full consequences is not trivial.

I've then moved to the Pentium Pro Errata (which is the next uarch, the P6) and found an incorrect handling of the SIPIs though not exactly what we are looking for:

3AP. INIT_IPI After STARTUP_IPI-STARTUP_IPI Sequence May Cause

AP to Execute at 0h**
PROBLEM: The MP Specification states that to wake up an application processor (AP), the interprocessor interrupt sequence INIT_IPI, STARTUP_IPI, STARTUP_IPI should be sent to that processor. On the Pentium Pro processor, an INIT_IPI, STARTUP_IPI sequence will also work. However, if the INIT_IPI, STARTUP_IPI, STARTUP_IPI sequence is sent to an AP, an internal race condition may occur in the APIC logic which leaves the processor in an incorrect state. Operation will be correct in this state, but if another INIT_IPI is sent to the processor, the processor will not stop execution as expected, and will instead begin execution at linear address 0h. In order for the race condition to cause this incorrect state, the system’s core to bus clock ratio must be 5:2 or greater.

IMPLICATION: If a system is using a core to bus clock ratio of 5:2 or greater, and the sequence INIT_IPI, STARTUP_IPI, STARTUP_IPI is generated on the APIC bus to wake up an AP, and then at some later time another INIT_IPI is sent to the processor, that processor may attempt to execute at linear address 0h, and will execute random opcodes. Some operating systems do generate this sequence when attempting to shut the system down, and in a multiprocessor system, may hang after taking the processors offline. The effect seen will be that the OS may not restart the system if ‘shutdown and restart’ or the equivalent is selected upon exiting the operating system. If an operating system gives the user the capability to take an AP offline using an INIT_IPI (Intel has not identified any operating systems which currently have this capability), this option should not be used.

WORKAROUND: BIOS code should execute a single STARTUP_IPI to wake up an application processor. Operating systems, however, will issue an INIT_IPI, STARTUP_IPI, STARTUP_IPI sequence, as recommended in the MP specification. It is possible that BIOS code may contain a workaround for this erratum in systems with C0 or subsequent steppings of Pentium Pro processor silicon. No workaround is available for the B0 stepping of the Pentium Pro processor.

STATUS: For the steppings affected see the Summary Table of Changes at the beginning of this section.

This AP3 erratum is interesting because:

It confirms that an INIT-SIPI sequence is enough to startup an AP. This was evident from the MP specification and from the Free BSD code.
It may lead to a behaviour similar to a restart. The bug will make an INIT IPI (after the INIT-SIPI-SIPI sequence) restart the AP at 0h (linear, presumably after the initialisation).
If the BIOS uses the INIT-SIPI-SIPI to use the APs and later the OS attempts to use that sequence again, the first INIT will start the AP.
However, this won't lead to a predictable behaviour unless the LAPIC is left in a corrupted state where any SIPI will be accepted.

Funny enough, in the same errata there is even a bug causing "the opposite behaviour": 8AP. APs Do Not Respond to a STARTUP_IPI After an INIT# or INIT_IPI in Low Power Mode

I've also checked the Pentium II, Pentium II Xeon, Pentium III, Pentium 4 errata and found nothing new about SIPIs.

To my understanding, the first AMD processor capable of SMP was the Athlon MP based on the Palomino uarch.
I've checked the revision guide for the Athlon MP and found nothing, checked the revisions in this list and found nothing.

Unfortunately I have little experience with non AMD non Intel x86 CPUs. I was unable to find which secondary manufactures included an LAPIC.

Fact 3: Official documentation from non AMD/Intel manufacturers is hard to find and errata are not easily searchable. No errata contains a bug related to the acceptance of the SIPI on a running processor but numerous LAPIC bugs are present making plausible the existence of such bugs.

Final step would be a hardware test.
While this test cannot rule out the presence of other behaviour, at least is documented (crappy) code.
Documented code is good because it can be used to repeat an experiment by other researchers, it can be scrutinised for bugs and constitute a proof.
In short, it is scientific.

I have never seen a CPU where subsequent SIPIs restarted it but this doesn't matter because it suffices to have a single buggy CPU to confirm the presence of the bug.
I'm too young, too poor and too human to have conducted an extensive, bug-free, analysis of all the MP CPUs.
So, instead, I made a test and run it.

Fact 4: Whiskey lake, Haswell, Kaby lake and Ivy Bridge all ignore subsequent SIPIs.
Other people are welcome to test on AMD's and older CPUs.
Again this doesn't constitute a proof but it's important to frame the state of the matter correctly.
The more data we have the more accurate knowledge of the bug we get.

The test consist in bootstrapping the APs and making them increment a counter and enter an infinite loop (either with jmp $ or with hlt, the result is the same).
Meanwhile the BSP will send a SIPI each n seconds, where n is at least 2 (but it may be more due to the very imprecise timing mechanism), and print the counter.

If the counter stays at k-1, where k is the number of APs, then the secondary SIPI are ignored.

There are some technical details to address.

First, the bootloader is legacy (not UEFI) and I didn't want to read another sector so I wanted it to fit in 512 bytes and so I shared the booting sequence between the BSP and the APs.

Second, some code must be executed only by the BSP but before entering in protected mode (e.g. video mode setting) so I used a flag (init) instead of checking the BSP flag in the IA32_APIC_BASE_MSR register (which is done later to diverge the APs from the BSP).

Third, I've took some shortcuts. The SIPI bootups the CPU at 8000h so I put a far jump there to 0000h:7c00h. The timing is done with the port 80h trick and it is very imprecise but should suffice. The GDT uses the null entry. The counter is printed a few lines below the top to avoid being cropped by some monitor.

If the loop is modified to include the INIT IPI, the counter is incremented regularly.

Please note that this code is without support.

BITS 16
ORG 7c00h

%define IA32_APIC_BASE_MSR 1bh
%define SVR_REG 0f0h
%define ICRL_REG 0300h
%define ICRH_REG 0310h

xor ax, ax
mov ds, ax
mov ss, ax
xor sp, sp      ;This stack ought be enough

cmp BYTE [init], 0
je _get_pm

;Make the trampoline at 8000h
mov BYTE [8000h], 0eah
mov WORD [8001h], 7c00h
mov WORD [8003h], 0000h

mov ax, 0b800h
mov es, ax
mov ax, 0003h
int 10h
mov WORD [es:0000], 0941h

mov BYTE [init], 0

_get_pm:
;Mask interrupts
mov al, 0ffh
out 21h, al
out 0a1h, al

;THIS PART TO BE TESTED
;
;CAN BE REPLACED WITH A cli, SIPIs ARE NOT MASKEABLE
;THE cli REMOVES THE NEED FOR MASKING THE INTERRUPTS AND
;CAN BE PLACED ANYWHERE BEFORE ENTERING PM (BUT LEAVE xor ax, ax
;AS THE FIRST INSTRUCTION)

;Flush pending ones (See Michael Petch's comments)
sti
mov cx, 15
loop $   

lgdt [GDT]
mov eax, cr0
or al, 1
mov cr0, eax
sti

mov ax, 10h
mov es, ax
mov ds, ax
mov ss, ax
jmp 08h:DWORD __START32__

__START32__: 
 BITS 32

 mov ecx, IA32_APIC_BASE_MSR
 rdmsr
 or ax, (1<<11)          ;ENABLE LAPIC
 mov ecx, IA32_APIC_BASE_MSR
 wrmsr

 mov ebx, eax
 and ebx, 0ffff_f000h    ;APIC BASE

 or DWORD [ebx+SVR_REG], 100h

 test ax, 100h
 jnz __BSP__

__AP__: 
 lock inc BYTE [counter]

 jmp $            ;Don't use HLT just in case

__BSP__:
 xor edx, edx 
 mov DWORD [ebx+ICRH_REG], edx 
 mov DWORD [ebx+ICRL_REG], 000c4500h        ;INIT

 mov ecx, 10_000
.wait1:
 in al, 80h
 dec ecx
jnz .wait1 

.SIPI_loop:
 movzx eax, BYTE [counter]
 mov ecx, 100
 div ecx 
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5], ax

 mov eax, edx 
 xor edx, edx
 mov ecx, 10
 div ecx
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5 + 2], ax

 mov eax, edx
 xor edx, edx
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5 + 4], ax

 xor edx, edx 
 mov DWORD [ebx+ICRH_REG], edx 
 mov DWORD [ebx+ICRL_REG], 000c4608h        ;SIPI at 8000h

 mov ecx, 2_000_000
.wait2:
 in al, 80h
 dec ecx
jnz .wait2

jmp .SIPI_loop


GDT dw 17h
    dd GDT
    dw 0

    dd 0000ffffh, 00cf9a00h
    dd 0000ffffh, 00cf9200h

counter db 0
init db 1

TIMES 510-($-$$) db 0
dw 0aa55h

Conclusions

No definitive conclusion can be draw, the matter is still open.
The reader has been presented with a list of facts.

The intended behaviour is to ignore subsequent SIPIs, the need for two SIPI is due to a "P5 bug".
Linux and Free BSD don't seem to mind about buggy SIPI handling.
Other manufacturers seems to provide no documentation on their LAPICs if they produce any on their own.
Recent Intel's hardware ignore subsequent SIPIs.

¹With due respect to all people involved and without attacking anyone credibility. I do believe there are buggy CPUs out there but there are also buggy software and buggy memories. As I don't trust my own old memories I think I'm still within the bounds of a respectful conversation to ask others to no trust their vague ones.

² Possibly because MP in those days was done with regular CPUs packed together and asserting their INIT# with an external chip (the APIC) was the only way to start them up (along with setting a warm reset vector). However in those years I was too young to have a computer.

According to my testing, SIPIs are ignored when not in a wait-for-SIPI state. I've tested a Whiskey-lake 8565U, of course real-hardware test doesn't constitute a proof.
I'm confident that all the Intel's processors since the Pentium 4 also have the same behaviour but this is just my view.
In this answer I solely want to present the result of a test. Everyone will draw their own conclusions.

回答2:

Short Answer

Some CPUs do restart on the second SIPI
I don't know which CPUs restart on the second SIPI because I've been guarding against it for too long
I haven't checked, but I don't think Intel's documentation specifies the behavior for the "SIPI received by running CPU" case
If Intel's documentation does specify the behavior for Intel CPUs, then that doesn't mean CPUs from other vendors (AMD, VIA, SiS, Cyrix, ...) behave the same as Intel CPUs. Intel's manual is only "guaranteed" (excluding errata/specification updates) to apply to Intel's CPUs.

Longer Answer

When I first started implementing multi-CPU support (over 10 years ago) I followed Intel's startup procedure (from Intel'sMultiProcessor Specification, with the time delays between INIT, SIPI and SIPI), and after the AP started it incremented a number_of_CPU_running counter (e.g. with a lock inc).

What I found is that some CPUs do restart when they receive the second SIPI; and on some computers that number_of_CPU_running counter would be incremented twice (e.g. with BSP and 3 AP CPUs, the number_of_CPU_running counter could end up being 7 and not 4).

Ever since I've been using memory synchronization to avoid the problem. Specifically, the sending CPU sets a variable (state = 0) before trying to start the receiving CPU, if/when the receiving CPU starts it changes the variable (state = 1) and waits for the variable to be changed again, and when the sending CPU sees that the variable was changed (by receiving CPU) it changes the variable (state = 2) to allow the receiving CPU to continue.

In addition; to improve performance, during the delay after sending the first SIPI the sending CPU monitors that variable, and if the receiving CPU changes the variable it will cancel the delay and won't send a second IPI at all. I also significantly increase the last delay, because it only expires if there's a failure (and you do not want to assume the CPU failed to start when it started too late, and end up with a CPU doing who-knows-what as the OS changes the contents of memory, etc. later).

In other words, I mostly ignore Intel's "Application Processor Startup" procedure (e.g. from section B.4 of Intel's MultiProcessor Specification) and my code for the sending CPU does:

    set synchronization variable (state = 0)
    send INIT IPI
    wait 10 milliseconds
    send SIPI IPI
    calculate time-out value ("now + 200 microseconds")
    while time-out hasn't expired {
        if the synchronization variable was changed jump to the "CPU_started" code
    }
    send a second SIPI IPI
    calculate time-out value ("now + 500 milliseconds")
    while time-out hasn't expired {
        if the synchronization variable was changed jump to the "CPU_started" code
    }
    do "CPU failed to start" error handling and return

CPU_started:
    set synchronization variable (state = 2) to let the started CPU know it can continue

My code for the receiving CPU does this:

    get info from trampoline (address of stack this CPU needs to use, etc), because sending CPU may change the info after it knows this CPU started
    set synchronization variable (state = 1)
    while synchronization variable remains unchanged (state == 1) {
        pause (can't continue until sending CPU knows this CPU started)
    }
    initialize the CPU (setup protected mode or long mode, etc) and enter the kernel

Note 1: Depending on the surrounding code (e.g. if the synchronization variable is in the trampoline and the OS recycles the trampoline to start other CPUs soon after); the sending CPU might need to wait for the receiving CPU to change the synchronization variable one last time (so that the sending CPU knows that it's safe to recycle/reset the synchronization variable).

Note 2: a CPU "almost always" starts on the first SIPI, and it's reasonable to assume that the second SIPI only exists in case the first SIPI got lost/corrupted and reasonable to assume that the 200 microsecond delay is a conservative worst case. For these reasons, my "cancel the time-out and skip the second SIPI" approach is likely to reduce the pair of 200 millisecond delays by a factor of 4 (e.g. 100 uS instead of 400 uS). The 10 millisecond delay (between INIT IPI and first SIPI) can be amortized (e.g. send INIT to N CPUs, then delay for 10 milliseconds, then do the remaining stuff for each of the N CPUs one at a time); and you can "snowball" the AP CPU startup (e.g. use BSP to start a group of N CPUs, then use 1+N CPUs in parallel to start (1+N)*M CPUs, then use 1+N*M CPUs to start (1+N*M)*L CPUs, etc. In other words; starting 255 CPUs with Intel's method adds up to 2.64 seconds of delays; but with sufficiently advanced code this can be reduced to less than 0.05 seconds.

Note 3: The "broadcast INIT-SIPI-SIPI" approach is broken and should never be used by an OS (because it makes detecting "CPU failed to start" hard, because it can start CPUs that are faulty, and because it can start CPUs that were disabled for other reasons - e.g. hyper-threading disabled by the user in the firmware's settings). Sadly, Intel's manual has some example code that describes the "broadcast INIT-SIPI-SIPI" approach that is intended for firmware developers (where the "broadcast INIT-SIPI-SIPI" approach makes sense and is safe), and beginners see this example and (incorrectly) assume that OS can use this approach.

来源：https://stackoverflow.com/questions/56384291/what-happens-to-a-startup-ipi-sent-to-an-active-ap-that-is-not-in-a-wait-for-sip

标签

x86