I was skimming through K&R C and I noticed that to read the entries in a directories, they used:
while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf))
getdents
will return struct linux_dirent
. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known only to the given filesystem driver, so a simple userspace read call could not work. That is, getdents
may convert from the native format to fill the linux_dirent
.
couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver.
The discontiguous file data in handled by the VFS ["virtual filesystem"] layer. Regardless of how a FS chooses to organize the block list for a file (e.g. ext4 uses "inodes": "index" or "information" nodes. these use an "ISAM" ("index sequential access method") organization. But, an MS/DOS FS can have a completely different organization).
Each FS driver registers a table of VFS function callbacks when it's started. For a given operation (e.g. open/close/read/write/seek
), there is corresponding entry in the table.
The VFS layer (i.e. from the userspace syscall) will "call down" into the FS driver and the FS driver will perform the operation, doing whatever it deems necessary to fulfill the request.
I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented.
Yes. For example, if the read request is to read the first three blocks from the file (e.g. 0,1,2), the FS will look up the indexing information for the file and get a list of physical blocks to read (e.g. 1000000,200,37) from the disk surface. This is all handled transparently in the FS driver.
The userspace program will simply see its buffer filled up with the correct data, without regard to how complex the FS indexing and block fetch had to be.
Perhaps it is [loosely] more proper to refer to this as transferring inode data as there are inodes for files (i.e. an inode has the indexing information to "scatter/gather" the FS blocks for the file). But, the FS driver also uses this internally to read from a directory. That is, each directory has an inode to keep track of the indexing information for that directory.
So, to an FS driver, a directory is much like a flat file that has specially formatted information. These are the directory "entries". This is what getdents
returns. This "sits on top of" the inode indexing layer.
Directory entries can be variable length [based on the length of the filename]. So, the on disk format would be (call it "Type A"):
static part|variable length name
static part|variable length name
...
But ... some FSes organize themselves differently (call it "Type B"):
,...
,,...
So, the type A organization might be read atomically by a userspace read(2)
call, the type B would have difficulty. So, the getdents
VFS call handles this.
couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?
That is what getdents
is for.
Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes.
getdents
did not always exist. When dirents were fixed size and there was only one FS format, the readdir(3)
call probably did read(2)
underneath and got a series of bytes [which is only what read(2)
provides]. Actually, IIRC, in the beginning there was only readdir(2)
and getdents
and readdir(3)
did not exist.
But, what do you do if the read(2)
is "short" (e.g. two bytes too small)? How do you communicate that to the app?
My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent?
read
on a dir isn't intercepted and converted to getdents
because the OS is minimalist. It expects you to know the difference and make the appropriate syscall.
You do open(2)
for files or dirs [opendir(3)
is wrapper and does open(2)
underneath]. You can read/write/seek for file and seek/getdents for dirs.
But ... doing read
for returns EISDIR
. [Side note: I had forgotten this in my original comments]. In the simple "flat data" model it provides, there isn't a way to convey/control all that getdents
can/does.
So, rather than allow an inferior way to get partial/wrong info, it's simpler for the kernel and an app developer to go through the getdents
interface.
Further, getdents
does things atomically. If you're reading directory entries in a given program, there may be other programs that are creating and deleting files in that directory or renaming them--right in the middle of your getdents
sequence.
getdents
will present an atomic view. Either a file exists or it doesn't. It's been renamed or it hasn't. So, you don't get a "half modified" view, regardless of how much "turmoil" is happening around you. When you ask getdents
for 20 entries, you'll get them [or 10 if there's only that much].
Side note: A useful trick is to "overspecify" the count. That is, tell getdents
you want 50,000 entries [you must provide the space]. You'll usually get back something like 100 or so. But, now, what you've got is an atomic snapshot in time for the full directory. I sometimes do this instead of looping with a count of 1--YMMV. You still have to protect against immediate disappearance but at least you can see it (i.e. a subsequent file open fails)
So, you always get "whole" entries and no entry for a just deleted file. That is not to say that the file is still there, merely that it was there at the time of the getdents
. Another process may instantly erase it, but not in the middle of the getdents
If read(2)
were allowed, you'd have to guess at how much data to read and wouldn't know which entries were fully formed on in a partial state. If the FS had the type B organization above, a single read could not atomically get the static portion and variable portion in a single step.
It would be philosophically incorrect to slow down read(2)
to do what getdents
does.
getdents
, unlink
, creat
, rmdir
, and rename
(etc.) operations are interlocked and serialized to prevent any inconsistencies [not to mention FS corruption or leaked/lost FS blocks]. In other words, these syscalls all "know about each other".
If pgmA renames "x" to "z" and pgmB renames "y" to "z", they don't collide. One goes first and another second but no FS blocks are ever lost/leaked. getdents
gets the whole view (be it "x y", "y z", "x z" or "z"), but it will never see "x y z" simultaneously.