From a software point of view, what is the latency between an instruction that dirties a memory page and when the core actually marks the page dirty in the Page Table Entry (PTE
From the AMD's manual (circa 2005), Volume 2: System Programming:
5.4 Page-Translation-Table Entry Fields ... Dirty (D) Bit. Bit 6. This bit is only present in the lowest level of the page-translation hierarchy. It indicates whether the pagetranslation table or physical page to which this entry points has been written. The D bit is set to 1 by the processor the first time there is a write to the physical page.
Ditto from Intel (circa 2006), Volume 3-A: System Programming Guide, Part 1:
3.7.6 Page-Directory and Page-Table Entries ... Dirty (D) flag, bit 6 Indicates whether a page has been written to when set. (This flag is not used in page-directory entries that point to page tables.) Memory management software typically clears this flag when a page is initially loaded into physical memory. The processor then sets this flag the first time a page is accessed for a write operation.
UPDATE:
From the latest Intel manual (vol 3A, System Programming Guide):
8.1.2.1 Automatic Locking The operations on which the processor automatically follows the LOCK semantics are as follows: ... When updating page-directory and page-table entries — When updating page-directory and page-table entries, the processor uses locked cycles to set the accessed and dirty flag in the page-directory and page-table entries.
From the rest of the text in sections 8.1 and 8.2 it follows that once the CPU sets the dirty bit using the locked operation, the other CPUs should start seeing the updated value.
Of course, you may have a race condition in that you first read the dirty bit as 0 on one CPU (or in one of its threads) and later another CPU (or another thread on the same CPU) causes this bit to be set to 1, but that isn't any unusual.