How should I read Intel PCI uncore performance counters on Linux as non-root?

问题

I'd like to have a library that allows 'self profiling' of critical sections of Linux executables. In the same way that one can time a section using gettimeofday() or RDTSC I'd like to be able to count events such as branch misses and cache hits.

There are a number of tools that do similar things (perf, PAPI, likwid) but I haven't found anything that matches what I'm looking for. Likwid comes closest, so I'm mostly looking at ways to modify it's existing Marker API.

The per-core counters are values are stored in MSR's (Model Specific Registers), but for current Intel processors (Sandy Bridge onward) the "uncore" measurements (memory accesses and other things that pertain to the CPU as a whole) are accessed with PCI.

The usual approach taken is that the MSR's are read using the msr kernel module, and that the PCI counters (if supported) are read from the sysfs-pci hierarchy. The problem is that both or these require the reader to be running as root and have 'setcap cap_sys_rawio'. This is difficult (or impossible) for many users.

It's also not particularly fast. Since the goal is to profile small pieces of code, the 'skew' from reading each counter with a syscall is significant. It turns out that the MSR registers can be read by a normal user using RDPMC. I don't yet have a great solution for reading the PCI registers.

One way would be to proxy everything through an 'access server' running as root. This would work, but would be even slower (and hence less accurate) than using /proc/bus/pci. I'm trying to figure out how best to make the PCI 'configuration' space of the counters visible to a non-privileged program.

The best I've come up with is to have a server running as root, to which the client can connect at startup via a Unix local domain socket. As root, the server will open the appropriate device files, and pass the open file handle to the client. The client should then be able to make multiple reads during execution on its own. Is there any reason this wouldn't work?

But even if I do that, I'll still be using a pread() system call (or something comparable) for every access, of which there might be billions. If trying to time small sub-1000 cycle sections, this might be too much overhead. Instead, I'd like to figure out how to access these counters as Memory Mapped I/O.

That is, I'd like to have read-only access to each counter represented by an address in memory, with the I/O mapping happening at the level of the processor and IOMMU rather than involving the OS. This is described in the Intel Architectures Software Developer Vol 1 in section 16.3.1 Memory Mapped I/O.

This seems almost possible. In proc_bus_pci_mmap() the device handler for /proc/bus/pci seems to allow the configuration area to be mapped, but only by root, and only if I have CAP_SYS_RAWIO.

static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma)
{
        struct pci_dev *dev = PDE_DATA(file_inode(file));
        struct pci_filp_private *fpriv = file->private_data;
        int i, ret;

        if (!capable(CAP_SYS_RAWIO))
                return -EPERM;

        /* Make sure the caller is mapping a real resource for this device */
        for (i = 0; i < PCI_ROM_RESOURCE; i++) {
                if (pci_mmap_fits(dev, i, vma,  PCI_MMAP_PROCFS))
                        break;
        }

        if (i >= PCI_ROM_RESOURCE)
                return -ENODEV;

        ret = pci_mmap_page_range(dev, vma,
                                  fpriv->mmap_state,
                                  fpriv->write_combine);
        if (ret < 0)
                return ret;

        return 0;
}

So while I could pass the file handle to the client, it can't mmap() it, and I can't think of any way to share an mmap'd region with a non-descendent process.

(Finally, we get to the questions!)

So presuming I really want have a pointer in a non-privileged process that can read from PCI configuration space without help from the kernel each time, what are my options?

1) Maybe I could have a root process open /dev/mem, and then pass that open file descriptor to the child, which then can then mmap the part that it wants. But I can't think of any way to make that even remotely secure.

2) I could write my own kernel module, which looks a lot like linux/drivers/pci/proc.c but omits the check for the usual permissions. Since I can lock this down so that it is read-only and just for the PCI space that I want, it should be reasonably safe.

3) ??? (This is where you come in)

回答1:

maybe the answer is a little late. The answer is using likwid. As you said read MSR/sysfs-pci has to be done by root. Building likwid accessDaemon and giving it the right to access the MSR would bypass this issue. Of course, due to some inter-process communication, performance values could have some delay. This delay is not very high.
(For small code sections, the performance counters are unprecise in some how, in any way.)

Likwid can also with uncore events. Best

来源：https://stackoverflow.com/questions/20120812/how-should-i-read-intel-pci-uncore-performance-counters-on-linux-as-non-root

标签

linux-kernel

mmap

pci