Linux and Unix Terminology

There are quite a few terms which are fairly specific to Linux, though really just to Unix. I've chosen to group them here rather than the getting started page as that page is more about computers in general and is not as specific to a Unix computer.

Files & File Handles (File Descriptors)

When one opens a file (for reading, writing, or both), one gets what is called a file handle or a file descriptor. It is, in fact, just a number (an integer, specifically). This number corresponds to information stored in the kernel as to what file is opened, and things like that.

Files aren't the only things that one can get a file handle for, though. When you get down to it, all input and output in user space is done through file handles. Network connections, inter-process communication, opening an X display, printing to a terminal and reading user input, hardware devices, all of it is done through file handles.

This had the advantage of simplifying the interface for a unix program. Once you set up an input source (network, file, keyboard, sound card, etc.), it is handled just as every other device is handled. This cuts down on the number of commands that a programmer must learn (everything is just read and write). It also just makes one feel like one is dwelling in a more sane environment since everything is done in a consistent way.

Permissions & Other Attributes

A unix system, being inherently a multi-user system, has as an integral concept the idea of permissions. Users are not supposed to be able to affect other users, while they are supposed to have free reign over their own resources. Consequently, every file has what are known as file permissions.

There are three normal permissions that a file can have, and three groups that these permissions may apply to. This on a normal file, there are nine different permission flags which may be set.

The three normal permissions for are read, write, and execute permissions. The three groups that these can apply to are the owner (user), group (group), and everyone else (other). Thus a file which allows anyone to do anything would have a permissions string which looked like rwxrwxrwx. One that allowed the user to read and write to it and both the group and user to simply read it would have a permissions string which was rw-r--r--. A file which could be executed by anyone and written only by the own would look like: rwxr-xr-x

If you examine the output of ls -l file for a given file, you will notice that its permissions string has a leading - in it. This is because there are a number of special attributes which may be set on a file. One of these attributes is the directory attribute. This indicates that a file is in fact a directory (remember the idea that in unix everything's a file? well, that applies to directories, too). If you were to execute the command ls -ld /usr you would get the output drwxr-xr-x 16 root root 4096 Feb 8 05:26 /usr/ (or something fairly similar).

The other interesting special file attribute is the setUID attribute. This attribute causes the executable to be run as the owner of the file rather than as the user who is running it. This is primarily done with applications owned by root so that users can perform actions that normally only root could. For example, the program chfn (used to changes one's finger information) is setUID root. It requires you to have the password of the user whose finger information you are trying to change for security reasons, but once you satisfy that requirement, it lets you change the finger information which normally only root could do. As with this program, all setUID programs are responsible for handling their own security. As a result, this should not be done lightly and should only be used by people who are very knowledgeable about security issues.

Oh, the setUID attribute is normally displayed in place of the x in the user's access permissions, though it is in fact a separate flag.

Other file attributes include:

The file Structure

From a userland perspective, nearly all local resources of a computer are available through a coherent file structure. This file structure has a symbolic highest point called the root or root directory. The root of the filesystem is represented simply as a "/" with nothing preceding it.

In the root directory, there are files and directories. Normally, the only files kept in the root directory are kernels that the system can be booted with and their support files (System.map, etc.).

The subdirectories of the root directory are reasonably standard. Here are some of the standard ones:

/bin

This is where executables that are needed very early in system usuage are kept. In general there aren't really too many programs here, and most of them are the very basic utilities like cat, grep, ls, rm, and the like. Command shells also go here.

/boot

This is where some bootup information is kept. Many distributions put their kernel images here, though it is just as standard (at least in practice if not in authorization by a standards body) to put your kernel in the root directory.

/dev

This is where device files are kept. Device files are files which provide access to device drivers. For example, the com ports are /dev/ttyS0 through /dev/ttyS3 in most normal distributions. IDE hard drives are /dev/hda through /dev/hdh, typically, with the first partition on /dev/hda being /dev/hda1 etc.

It is common practice to do create symlinks from devices to the actual device that they are using. For example, on my system, I have a symlink called /dev/cdrom which points to /dev/hdc. I have a symlink /dev/modem which points to /dev/ttyS0 and a symlink /dev/mouse which points to /dev/psaux.

/etc

This is the normal resting place of system-wide configuration files. The initialization scripts that init runs to get the system up are also normally stored here (in /etc/init.d or /etc/rc.d/init.d). In general, if you want to change the system-wide configuration of a particular program, this is where its config files will be.

/home

One very important thing to realize about Linux is that like all unices, it is an inherently multi-user system. This means that there is normally a division between the superuser (root) and normal users. Normal users are supposed to be able to do whatever they want with their own files, but not be able to do anything with other peoples files (unless given permission). For this to happen, each user needs a place to put his files where noone else can get at them unless he grants them permission. This place is normally a directory whose name is the same as the user's username in the home directory. For example, on my machine, my username is raistlin so my home directory is /home/raistlin. On the computer science server my username is lansdoct, so my home directory is /home/lansdoct.

In addition to being a place where users store their files, it is also the place where programs store per-user settings. This is typically done in files that begin with a "." so that ls won't show them by default (the other option is using regular files in a directory which starts with a "." in the user's home directory). While this idea of every program storing its configuration in the user's home directory in separate files may sound cumbersome, it is actually about the best way to do it. First, it bypasses security concerns that a central repository would have. The program being run need have no more permissions than the user does to use this method. Second, it makes preferences much more portable. When I want to replicate my preferences from one computer that I'm on to another, I simply copy the files beginning with a . in my home directory over to my home directory on the new computer. While this could also be achieved with a database system, in normal circumstances life just tends to work out better the fewer complicated tools are necessary. This method also keeps programs isolated from each other. There's basically no chance that a bug in one program could screw up my settings in another program. This also allows one to maintain a revision history and backups of settings. For example, I have a copy of my emacs configuration file, .emacs, which is called .emacs.old which had some old settings that I wanted to test changing. I've also made copies of config files called .whatever.working. By being stored in regular files, I have all the functionality of the command line at my disposal rather than simply what the programmer of a central database method thought that I would need.

/lib

System libraries which are necessary to get a system up to at least minimal functionality are stored here. Examples are libc (which basically every Linux program links against), pam (an authentication library), dns libraries, and other extremely important libraries go here. The general rule is that /usr/lib may not exist when a system is booting up, but /lib must. Any libraries which must exist from the start in a system are thus placed in /lib.

/proc

This is a filesystem which is dynamically generated by the kernel. It contains all sorts of interesting information. Just go look around in it using cat or less to see what's there. The names are generally fairly self-explanatory and when they're not the content is. Well, most of the time. Some of them contain either continuous streams of binary data or cryptic numbers, but in general they're pretty intelligible.

As a note, you will notice a lot of directories in /proc which are simply numbers. Those are the PIDs of running processes and those directories contain all sorts of interesting though cryptic information about those programs. This is where top gets its information to display.

/root

This is the home directory of root or the superuser. I am not sure why it's not part of /home, but I think that the reason is that /home may well be on a different physicall medium from the root directory and /root (which is normally on whatever physical medium stores /). This way data or programs important to the bootup time that for some reason don't belong in the normal places for them can be stored in /root.

/sbin

This is where basic system executables are kept. Programs like lilo, reboot, halt, shutdown, swapon, fdisk, getty, hdparm, ifconfig, route etc. are kept here. These are basically the programs that are going to be necessary to configure the system config as well as to config hardware early on in a system.

/tmp

This is the location where temporary data is supposed to be stored. /tmp has full write permissions assigned to it, so whoever wants to can write whatever they want to to /tmp. Normally, there is a cleanup procedure which periodically cleans out everything from /tmp, so programs can be a little more sloppy in leaving files around in /tmp than they have to elsewhere, though they are supposed to delete unused files here as in everywhere else.

/usr

This is the resting place of a large portion of the system, especially of programs, their data, and system libraries. It is quite possible that /usr won't be available to the system until partway through the boot process, so only programs, libraries, and data which can be done without for a while is to be stored in /usr. Thankfully, this normally comprises the overwhelming majority of those things.

/usr/bin

This is where most system executables are stored. Typically the programs here are placed here by a package management system and thus come as part of the system. In general, unless there's a reason for an executable that came with the system or was installed by a package manager to go somewhere else, this is where it goes.

/usr/sbin

The distinction between /usr/bin and /usr/sbin isn't the clearest in the world. In general, daemons, user management software, less important hardware configuration tools, and other systemish tools go in this directory that aren't going to be needed very early on.

/usr/share

Data that is shared between users of applications generally goes here. Each program has its own directory. For example, the shared pixmaps to my checkbook program would go in /usr/share/checkbook.

There is also normally a directory /usr/share/doc/ which stores documentation in a similar fashion on a directory-per-program basis (if you can't find it in /usr/share/doc, try just /usr/doc).

/usr/lib

This is the directory where most system libraries are stored. While many of these libraries are extremely common, none of them (if things are done properly) would be necessary to get a minimally working system functional.

/usr/local

This is the place where a sysadmin is supposed to put programs and their associated data which he manually installed in the system. Package managers put regular binaries in the /usr directories rather than the /usr/local directories to preserve this distinction. The reason for this is slightly historical but mostly functional. It is best to preserve the distinction to make switching from the one to the other cleaner. It also allows one to back up custom programs separately from standard ones that one can easily get in an installation.

/var

This is, as the name implies, where various files are kept. There is a convention that only files which are generated by programs while they are running go here, though. Of if you look in this directory you should be able to figure out what most of these directories are for. For example, /var/run is where programs store files with their process ID. /var/spool is where programs that need to spool data to the disk do it (usually the printer daemon and the mail daemon). /var/log is where system logs are kept and /var/lock is where lock files are kept.

System Call

A system call is a method of requesting some sort of service from the kernel. It is done for all sorts of things - opening files, opening network connections, getting more memory, executing a command, and many other things. It is done through a direct communication between the program and the kernel.

Processes

A process is another name for a running program. More technically, however, it is the name for what might be called an execution context. An execution context consists of the program's code, its data, memory allocated to it, resources allocated to it (such as open file descriptors, etc.), its environment (see the page on the command shell), and other important things which aren't universal to all programs.

All processes have what is known as a Process ID or PID (usually written as pid though because typing lower case characters is easier than typing upper chase ones). This PID uniquely identifies a process within a system and a particular point in time. The PIDs are reused, however, and a new program may get the PID of a program which was terminated earlier. This is unlikely to happen in the short term, however. To avoid confusion the operating system normally assigns PIDs with a random interval between them so that one isn't likely to accidentally kill one program while trying to kill another which just terminated.

When a program is launched, a new process is formed for it. It can start other processes. These are usually either children or threads.

Child Processes

A child process is a process that was created by another process to be exactly identical to the creator process, except that it can't access any of the data of the "parent" process. It does inherit all open file descriptors, though.

Creating children is done through a process called forking (it is called this because it is done through a system call called fork). One normally creates child processes to do autonomous activities. For example, Apache creates child processes to hand extra http requests. Other uses of forking is for a child to be dissociated from the controlling terminal. Daemons (see below) will sometimes do this to free themselves from a controlling terminal.

Threads

A thread is similar to a child except that it has full access to the data of its parent (and a few other less important things). Threads are normally used to parallelize related tasks. For example, a web browser may use one thread to handle the user interface, another thread to handle the fetching of desired web pages, images, etc., and a third thread to process HTML pages for display. This usually results in greater responsiveness on the part of the program at the least, and faster execution time on computers with multiple CPUs.

Return Value

Every program that is run returns a value. It returns a number, specifically. A return value of 0 means true and anything else is an error condition. The rationale behind this is that there is only one way to be successful but there are lots of ways to fail. Different non-zero values are used to inidicate the reason for failure.

Links & Symlinks

There are two types of links possible in a unix system: hard links and symbolic links. They are both method of creating what are in essence aliases to a file.

A hard link actually creates another entry in the file system which points to the exact same data on the hard disk. They are very rarely used as they cannot point from a file on one disk to a file on the other. On the other hand, the target of a hard link is always guaranteed to exist. There is no real reason for you to use them as they have no great benefits over symlinks and can be more trouble.

A symbolic link is kind of like a map. It is not itself directly attached to the target file, but it contains the path to the target file. When a normal program tries to open it, it will get the contents of the target file. When a normal program writes to the symlink, what it writes is written to the target file. A program can choose to care whether or not a file that it is dealing with is a symlink or not, though.

Symlinks do not suffer from any of the restrictions of hard links, they can span hard drives and even point to network drives. They do suffer from the problem of it being possible to have a symlink to a non-existant file, though. Also, a symlink always has full permissions turned on. The permissions actually used are those of the target file.

Directories may also be symlinked, they behave pretty-much as you would expect them to.

Socket

There are several forms of sockets. They are all means to communicate between programs.

A network socket is something that one sets up when one desires a network connection. One makes a socket system call, then gets a file descriptor to the socket. One can then either set the socket up to wait for an incoming connection or one can set it up to make an outgoing connection. In either case, once the connection is made, one sends and recieves data through the socket file handle that was created.

The other major sort of socket is called a Unix Domain Socket. It is a special sort of socket which is accessed through a special file created for that purpose in the file system. It is designed specifically for programs which wish to communicate with other programs in the system but not take the chance of letting programs from other computers talk to it. As well, there are usually performance benefits in the implementation of Unix Domain Sockets over network sockets.

Signals

A signal is a brief message from the kernel that something has happened. Signals are a little complicated, but not very. Basically, a process installs signal handlers for those signals that it wants to handle. A signal handler is a section of code that will get called when the process recieves that particular signal. There are three common uses of signals. The first and most common is to indicate to a program that it should terminate in a generic way (for example, when shutting down the shutdown scripts need a generic way to tell all programs to die, they can't pick the quit options off of menus). The second is to indicate error conditions. The third reason signals are normally sent are as a timer mechanism. Programs can request getting a signal (SIGALRM) every so often and then use that to act as a timer.

In general, when a program doesn't handle a signal it dies.

Here is a list of common signals by number:

1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD
18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN
22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO
30) SIGPWR
Interesting Signals
SIGHUP

SIGHUP is used to tell a program to "Hang Up". While this idea is basically useless in a modern context, it has be re-interpreted to mean "Reload Data Files" by at least one application (inetd).

SIGINT

SIGINT is sent to a program when you hit Ctrl-c. Most programs are kind enough to die when you send them this signal, though some handle it a bit differently and a few prompt you to ask if you mean to (mostly editors). See the command shell section for more info on this.

SIGKILL

Supposedly there was a manual to some Unix operating system which described sigkill as "Kill with extreme prejudice". It can neither be handled nor ingored. When a process gets sigkill, it simply dies.

SIGSEGV

SIGSEGV is an all too frequent sight that means that a program has suffered from a segmentation violation. That means that it tried to access memory not allotted to it. While this description makes it seem malicious, it is normally caused either by a program forgetting that it returned memory to the operating system or by a program accidentally screwing up the address of memory that it owns. It generally results in the death of the offending program, there isn't much that can be done when things have gotten to the point of a segmentation violation.

Zombies

When a child process quits, it returns a value to its parent. The parent is supposed to ask for this value and deal with it appropriately. If the return value isn't going to mean much, the programmer is supposed to at least provide a bit of code which acknowledges the childs death (a signal handler for the SIGCHLD signal which calls the system call wait()). Unfortunately, this doesn't always happen. This results in what are known as zombie processes. Most of their resources have been deallocated but they are still present in the process table with their return value. The only way to put them to their final rest is to kill their parent or for their parent to finally wait() on them. In a practical sense, you have to kill the parent.

Note, while annoying, zombies usually aren't detrimental to the system. You usually only have to worry about them on programs that you yourself are writing as most programs that come as part of your system aren't going to spawn children, and those that do will handle them properly.

Daemons

A daemon is a program that behaves in a special way from normal programs. Firstly, it does not have a controlling terminal as normal programs do. Second, it is generally for performing some sort of service indirectly. There are generally two ways to do this. One is as a server, as ftpd (the ftp daemon), inetd (the internet daemon), telnetd (the telnet daemon), httpd (the http daemon - usually Apache), etc. do. The other is as a program which performs some sort of service. For example, crond checks every minute for jobs to execute and if there are any, executes them (cron is the periodic scheduler for scheduling jobs to run at specific times, read about it more in the command shell section).

Relative vs. Absolute Paths

An absolute path is a path which specifies every subdirectory of the root directory necessary to get to a file. For example, if there is a file called startup.jpg in the directory /usr/share/checkbook/, then the absolute path to that file is /usr/share/checkbook/startup.jpg. If we were in the directory /usr/share, then the relative path would be checkbook/startup.jpg - notice that there is no leading / on the path. That is how an absolute path is distinguished from a relative path. Any path which starts with a / is treated as an absolute path.

This is implemented at the system-call level in the open() system call. For this reason, this behavior is common to essentially all applications (except those wich muck around with the filenames that you give them before called open() on them). For this reason, you can assume this behavior of all linux applications unless there is good reason to assume otherwise.

Device files

As you are probably already aware, every resource is a file in unix. It's not 100% true, but it's close enough. Hardware devices are no different. The drivers which provide an interface to them do so through special device files. As was mentioned about, these device files are normally kept in the /dev directory.

A device file provides a direct and usually generic way for a program to deal with the driver for a device. This is done by reading to and writing from the device, as well as using what are known as ioctls (I/O controls) on the device. Ioctls are valid on any file, but they only do something on special files. Ioctls are normally used for special operations like opening the CD drive or setting the rate on the sound card.

Because all devices are accessed through their device files, it is typical to refer to a given device by its device file if there is no more significant name for the device. This is most common on hard drives. IDE hard drives are of the form /dev/hd while scsi hard drives are of the form /dev/sd. After that comes a letter which indicates which drive it is on the chain. /dev/hda is the first ide hard drive while /dev/sdd is the fourth SCSI device on the chain. Finally a partition number is given. /dev/hda1 is the first partition on the first hard drive. /dev/sdb4 is the fourth partition on the second SCSI device.