Prev: redundant-arrays-of-inexpensive-disks-raids Next: file-system-implementation
Two key abstractions for virtualizing storage includes the file and
the directory. Each file has a low-level name, called an
inode number
.
Directories are an abstraction which contains a list of files and folders.
The directory hierarchy starts at a root directory, and uses a separator to name subsequent sub-directories until the desired file or directory is named. There are absolute and relative pathnames.
Directories and files can have the same name as long as they are in different locations. File names generally have two parts – a name and a file extension, to indicate the file type.
To create a file, we use the open
system call, and pass
O_CREAT
to create a new file if it doesn’t exist.
This code creates a file using the open system call.
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC,
|S_IWUSR); S_IRUSR
This can also be done with creat
.
int fd = creat("foo", S_IRUSR|S_IWUSR);
open
returns a file descriptor – an integer, private per
process, that acts as a handle to the file.
The OS would keep the number of open files in the proc
struct on unix systems.
To read and write files, lets try to cat
a file.
> strace cat foo
("foo", O_RDONLY|O_LARGEFILE) = 3
open(3, "hello\n", 4096) = 6
read(1, "hello\n", 6) = 6
write
hello(3, "", 4096) = 0
read(3) = 0 close
Going through the code, the call to open foo returns the int 3. This is because each process has stdin (0), stdout (1), and stderr (2) as open files. Thus, the first new file to be opened starts out at the integer 3.
Then, the program reads from file descriptor 3, the contents “hello” which returns 6. (the size of the file). This is then written to stdout (fd 1), which is shown. Then, the read stops as the call to read returns 0, indicating nothing was read. Finally, the file is closed.
To read randomly from a file, the system call lseek
is
used:
(int fildes, off_t offset, int whence); off_t lseek
The first argument is the file descriptor to seek into: The second argument is the offset, which puts the offset to a particular location within the file. The third argument determines how the seek is performed.
SEEK_SET
means to offset to offset.
SEEK_CUR
means the offset is set to its current location
plus offset bytes. SEEK_END
sets the offset to the size of
the file plus offset bytes.
The OS thus caches an offset for each open file in the file system:
Examples might make this clearer:
The read
system call reads the number of bytes requested
in the file, and increments the current offset by that amount, so the
next read
goes to the right place.
System Calls | Return Code | Current Offset |
---|---|---|
fd = open(“file”, O_RDONLY); | 3 | 0 |
read(fd, buffer, 100); | 100 | 100 |
read(fd, buffer, 100); | 100 | 200 |
read(fd, buffer, 100); | 100 | 300 |
read(fd, buffer, 100); | 0 | 300 |
close(fd); | 0 | - |
In another example, two file descriptors are allocated, each a different entry in the open file table, which read the same file:
They both get an independent offset, so they read the same data.
System Calls | Return Code | OFT[10] | OFT[11] |
---|---|---|---|
fd1 = open(“file”, O_RDONLY); | 3 | 0 | - |
fd2 = open(“file”, O_RDONLY); | 4 | 0 | 0 |
read(fd1, buffer1, 100); | 100 | 100 | 0 |
read(fd2, buffer2, 100); | 100 | 100 | 100 |
close(fd1); | 0 | - | 100 |
close(fd2); | 0 | - | - |
Another example involves using lseek
to seek to a
different location in the file:
System Calls | Return Code | Current Offset |
---|---|---|
fd = open(“file”, O_RDONLY); | 3 | 0 |
lseek(fd, 200, SEEK_SET); | 200 | 200 |
read(fd, buffer, 50); | 50 | 250 |
close(fd); | 0 | - |
The write
system call buffers to the OS, which actually
writes to the file at some point in the future. For some applications
this is unacceptable, so the OS provides fsync()
, which
takes a file descriptor, and returns after it has committed the write to
disk.
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);
(fd > -1);
assertint rc = write(fd, buffer, size);
(rc == size);
assert= fsync(fd);
rc (rc == 0); assert
Also remember to fsync
the directory the file is in as
well.
To rename a file, rename
is provided.
rename
is also atomic, so it either succeeds or fails, with
no in-between state.
int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
(fd, buffer, size); // write out new version of file
write(fd);
fsync(fd);
close("foo.txt.tmp", "foo.txt"); rename
This allows for atomic renaming of files.
To get information about files, stat
is frequently used.
On linux, the struct might look like this:
struct stat {
; // ID of device containing file
dev_t st_dev; // inode number
ino_t st_ino; // protection
mode_t st_mode; // number of hard links
nlink_t st_nlink; // user ID of owner
uid_t st_uid; // group ID of owner
gid_t st_gid; // device ID (if special file)
dev_t st_rdev; // total size, in bytes
off_t st_size; // blocksize for filesystem I/O
blksize_t st_blksize; // number of blocks allocated
blkcnt_t st_blockstime_t st_atime; // time of last access
time_t st_mtime; // time of last modification
time_t st_ctime; // time of last status change
};
To remove a file, dtruss
or strace
to find
the system call:
prompt> strace rm foo
unlink("foo") = 0
To make a directory, the mkdir
system call is used. This
creates an empty directory with the given permissions:
prompt> strace mkdir foo
mkdir("foo", 0777) = 0
To read a directory, there are a set of different system calls:
opendir
to open a dir, readdir
to read a
dir, and closedir
to close the directory.
int main(int argc, char *argv[]) {
*dp = opendir(".");
DIR (dp != NULL);
assertstruct dirent *d;
while ((d = readdir(dp)) != NULL) {
("%lu %s\n", (unsigned long) d->d_ino, d->d_name);
printf}
(dp);
closedirreturn 0;
}
The dirent struct looks like this:
struct dirent {
char d_name[256]; // filename
; // inode number
ino_t d_ino; // offset to the next dirent
off_t d_offunsigned short d_reclen; // length of this record
unsigned char d_type; // type of file
};
To delete a directory, rmdir()
is provided, but it comes
with the requirement that the directory is empty before it works.
To create a file, we use the link
system call, which
takes two arguments – an old pathname and a new one, which links the two
together. link
creates a pointer to the inode of the old
file for the new file.
When you rm
a file, it decrements the reference count of
the inode that the file refers to. When unlink
is called
and the reference count drops to 0, the inode is collected.
Symbolic links allow you to create a soft link
, which
doesn’t have the restriction that it links two files together, which is
done with ln -s
.
However, they also allow for dangling references:
prompt> echo hello > file
prompt> ln -s file file2
prompt> cat file2
hello
prompt> rm file
prompt> cat file2
cat: file2: No such file or directory
Unix permission bits are the first way to make access to files and directories more granular.
prompt> ls -l foo.txt
-rw-r--r-- 1 remzi wheel 0 Aug 24 16:29 foo.txt
-rw-r--r--
can be divided into three parts:
rw-
(owner), r--
(group), r--
(anyone).
This file is readable and writable by the owner, readable by the group, and readable by anyone.
The last bit, x
, on files indicates if it is executable.
However, for directories, it indicates whether a user can cd and create
files in the directory.
To make a file system, mkfs
and the mount
system call are available.
This normally takes the file system type, and mount point, and creates a new filesystem:
prompt> mount -t ext3 /dev/sda1 /home/users
Prev: redundant-arrays-of-inexpensive-disks-raids Next: file-system-implementation