CQUEST » Systems » TechNotes » How CQUEST installs generic machines

How CQUEST installs generic machines

CQUEST workstations (and a few simple sorts of servers) are generic, interchangeable machines, and they come in significant numbers. Naturally we don't want to install each by hand, so we have automated the process.

Our first attempt at automation of machine setup was to configure a single master machine and then copy its disk image to each workstation, changing a few machine-specific things afterward. This approach turns out to have a number of problems in practice:

Rather than try to work around these issues, CQUEST was able to jump straight to what we really wanted: automated installation of machines, both of the operating system and of the necessary local customizations and additions.

The Overview

There are two important jobs to be done in automated system installs: installing the base operating system and then adding the necessary local customizations. Both are necessary; no matter how many options the vendor's install process has, sooner or later there will be something one needs to change that is not controlled by it.

CQUEST uses Fedora Core Linux from Red Hat, which has an automated operating system installation method called Kickstart. The relevant mechanics of Kickstart are described in more detail later on.

After Kickstart has finished installing the operating system, it passes control to any post-install scripts that one has specified. CQUEST uses such a script to start a flexible and powerful system that drives all further machine setup and localization.

Flexible Localization

CQUEST has found that we need to change and customize the just installed system in many different ways, from adding Red Hat operating system updates to changing what daemons are run to adding local software and more.

CQUEST uses vendor-supplied mechanisms to make our changes whenever feasible, because it simplifies our lives (and not only in that we don't have to write tools to make the changes). Most time one steps outside of vendor supplied tools to make changes, one winds up in a situation where you and the vendor tools have conflicting views of how the system is or should be. If nothing else, such differences of opinion often make otherwise valuable vendor supplied consistency-checking tools relatively useless; they also tend to significantly complicate the process of applying vendor-supplied system updates.

With a diversity of changes to make to just installed systems, and a consequent diversity of mechanisms (vendor supplied and otherwise) to make them with, our localization process must be highly flexible. Such flexibility is not easily achieved through monolithic, all encompassing programs and so CQUEST has made no attempt to use such systems; instead we have built our environment out of a collection of small programs and shell scripts.

Kickstart

As previously mentioned, Kickstart is the Fedora Core method of automatic installation. It's not a separate program, but instead a special mode of the normal (interactive) installer, where the answers to questions are taken from a configuration file (the Kickstart file). If the Kickstart file doesn't contain all of the necessary information, the process stops to prompt normally for it (although some questions are just given default answers).

The Kickstart process (and the normal install process) needs a source for the data files that will be unpacked and installed to create the actual operating system. For normal installs, typical sources are CD-ROM or DVD drives; for automated installs or for people with good network connections, the installer also supports getting at the necessary data via ftp, http, or NFS.

NFS is the method CQUEST uses, because it has worked out to be the simplest. Our main fileserver, the logical place to put all of the data files, is already doing NFS. We don't have to start a web or a ftp server on it, or try to overload our normal web server for users with a pile of data only of interest to install programs.

CQUEST's Kickstart control files contain nothing particularly novel or interesting, and is essentially generic except for a few configuration issues. It helps that we run all of our machines with the same monitors and at the same resolution, and that Kickstart supports reasonably flexible disk partitioning (including the ability to expand a partition to use up all the otherwise unused space).

The one complication is that as a lab environment, CQUEST uses statically assigned IP addresses based on where a machine is physically located in the lab. This means that Kickstart has no hope of automatically discovering the machine's IP address and associated information, and in turn that we have to tell it when we start it up.

Demystifying the installer

What the Fedora Core installer actually is is a somewhat specialized Linux environment: a kernel and a ramdisk (an initrd, short for initial ramdisk, a ramdisk that is loaded by the bootloader and passed to the kernel as the kernel boots). When the kernel boots, it starts up the programs in the ramdisk; such programs are ordinary Linux programs, running in a relatively ordinary Linux environment.

One important consequence is that one need do nothing special to start the installer environment: you just boot the installer Linux kernel with some command line options and the installer ramdisk. Another is that one can add to the contents of the ramdisk (or even change the contents, if necessary).

The installer kernel can be given various parameters on its command line that will be passed on to the installer program itself. The installer sets various parameters based on those arguments, such as deciding whether to use a graphical interface or a text-based one. One important use of this is to specify that you want an automated Kickstart install and where to find the Kickstart configuration file to use for it.

How we start Kickstart

CQUEST has two distinct cases of machines which we want to Kickstart, and thus two different ways of doing it: existing, running machines that are being reinstalled, and new machines that have just been shipped to us.

Existing machines being reinstalled are the simpler case. Such machines already have IP address information assigned and have an environment that is set up to boot Linux kernels. What we do is:

When the machine reboots it starts up the installer kernel and ramdisk, fishes the Kickstart configuration file out of the ramdisk, and reinstalls. Because all of the setup can be done in Linux and nothing requires visiting the machines in person, we can reinstall or upgrade an entire lab of workstations from our desks.

New machines present the problem that they neither know their IP address nor have anything in particular on their hard drives. They need everything supplied from outside. CQUEST's current solution is a bootable USB pen drive with the Fedora Core installation environment on it. The install environment has been augmented in two ways.

First, the installer ramdisk has our generic Kickstart configuration file added. This specifies everything except the machine's IP address information. It is a slightly altered version of the normal reinstall Kickstart, designed to be safer in the possible presence of a USB drive on the system during install, as the last thing we want is for the system to decide that the USB pen drive is a usable system hard drive and to erase it, partition it, and put stuff there.

Second, the bootloader's configuration file has been replaced with one where the labels are the names of CQUEST machines, such as argon.esc. Each label boots the installer environment with the network information (IP address, gateway, nameserver, etc) appropriate for that machine specified on the kernel's command line. A final entry called askme causes the installer to ask the user for network information, so we can handle machines not in the list for some reason.

The net effect is that one can boot a new PC from the USB pen drive and type a name in to have the PC be installed as that machine. This significantly simplifies the setup of new PClones. While we could get away with just the askme entry, using machine names drastically lowers the possibility of errors, both because it bundles all of the information in one spot and because a mistake in a machine name is much less likely to still be valid than a mistake in an IP address.

It sometimes happens that our install process has bugs. Usually what results is a machine that installed the basic operating system but failed to apply our localizations; such a machine is on the network, but has no NFS mounts and can't be accessed remotely. Such machines could be installed again via the USB pen drive, but we prefer to run a shell script fetched from our web server (with lynx -dump ... | sh) that NFS-mounts the necessary data areas and runs the setup scripts for a normal reinstall. Not only does this save us from having to shuffle USB pen drives around, it serves to reduce the chance of errors by eliminating redundant specification of information: we cannot accidentally re-do a machine with the wrong name.

The CQUEST post-install system

After Kickstart has finished with the basic operating system installation, it runs any post-install scripts specified in the Kickstart configuration file. CQUEST's post install script is where all of our customization and setup work is done.

The post-install script runs essentially in the just-installed system itself (it is chroot()ed to the installed system's root). As a result, it sees an environment essentially identical to the normal system in single-user mode. It needs do nothing special or out of the way to manipulate the system, although some things need a little bit of setup work. This vastly simplifies the work of writing and testing the customization steps, since one needs no special environment.

Because the Kickstart configuration file itself has only a limited amount of space for scripts, all that the actual post-install script there does is to use NFS to mount our setup area from our main fileserver and start the real post-install shell script that lives there, unimaginatively called ks-post-install.

ks-post-install's ultimate job is to run a series of small programs, each of which will perform one step in the machine's setup. In order to do this, it goes through four steps: identifying characteristics of the machine, generating a basic list of customization stages to expand, expanding the stages into commands to execute, and executing the commands.

Identifying machine characteristics

The first job is to determine five characteristics of the machine being customized: the lab it is in, its hostname, the operating system release it's running, the type of hardware it has (if its hardware is of a known type), and its role (which comes in two flavours; explicitly specified, or determined by default by the hardware characterization). Technically there is a sixth characteristic, the overall domain, but for us this is a constant.

This is done by running all of the programs found in the identify subdirectory of the setup area. If a program successfully identifies some aspect of the machine, it echoes out information on what it has identified and what the value is; unsuccessful programs are expected to be silent.

Because we can have as many programs as we need, each program is free to incompletely cover its domain, and thus can be small and simplified. For example, a single program that tried to identify all of our known hardware types would soon become large and convoluted. Having a separate program to identify each known type of hardware means that each program can be small and simple.

To make hardware identification programs simpler and easier to write, we have an underlying Python support module to parse various sources of hardware information. The actual process of identifying each machine type then consists of simple code that asserts that the machine must have or not have various things; various PCI devices, certain sorts of CPUs, etc. The library code supports running the framework in a special mode which produces detailed reports of what does and doesn't match, for ease of seeing just why a machine is not properly identified.

At the end of the identification stage, most unknown values are filled in with typeunknown, such as hwunknown, and if there is a default role and no specific role the default role becomes the role.

Generating the stages list

CQUEST organizes the customization process into stages, jobs that need to be done. In this step, ks-post-install accumulates the stages list by trying to find a series of configuration files, and if a file exists allowing it to append entries.

Entries can be positive (jobs to do) or negative (jobs not to do; negative entries win over positive entries). Positive entries are simple names, like dnsserver; negative entries are simple names that start with a '-', like -workstation. The result of configuration file reading is a list of positive and negative entries mixed together; negative entries take effect only during stage list expansion, later.

Configuration files have names of the form config-cquest-OS-LAB-HWTYPE-ROLE-HOSTNAME, where each italicized machine characteristic (and its preceding '-') may be omitted; however, if they're present they have to be in that order. Any configuration files with names that properly match some or all of the machine characteristics of the current machine will be read (in lexical order) to add stages.

Expanding the stages list

Once ks-post-install has read the configuration files to generate the stages list, it must turn it into a list of commands to run, handling negative stages as appropriate. There are two parts to this.

The names in the stages list can stand for two things: for actual commands to be run, or for 'macros' to be recursively expanded to put additional stages on the stages list. 'Macro' stage names are created by putting a file in a magic directory (stages/exp/OS) that contains a (possibly commented) list of additional stages, one per line; the stages can be normal positive entries, or negative entries.

Negative entries are never macro expanded; they block the positive version of themselves, but not what the positive version would have expanded to. This allows us to safely reuse the same lower-level step in multiple higher-level contexts. For example, it would be safe for both workstation and labmaster to enable USB without having to worry that someday the hwunknown configuration file will specify -workstation and wipe out USB on a labmaster with unknown hardware.

Once no further macro expansion is possible, the expansion program attempts to turn what's left into the names of programs to run. It does so by reading a specific directory (stages/cmd) and looking for matches against the remaining names. Matches may be as-is, or they may consist of the command name prefixed by something- (where the something cannot have a dash itself). When everything has been successfully turned into a program, they are listed off in lexical order.

The prefix matching is used to create the ordering of programs to be run, while not forcing us to put that order into all of the places that refer to them. High-level stages can refer to simple accessible names like daily-update, localmnt, upgrade-rh, or cquest-post without having to care what the ordering is, or to contort the names to get the right ordering. Changing the ordering is done in one place by use of mv, and ls on the directory suffices to show the order.

By convention, all of the prefixes are two numeric digits, resulting in full command names like 05-localmnt, 65-cquest-post, and 98-daily-update. So far we have not needed more than two digits, and indeed the space of priorities is still somewhat sparse.

Executing the commands

The result of expanding the stages list is a list of commands to run in order. The final stage is to do just that, exporting to each of them the various machine characteristics. If any command fails, the entire ks-post-install run aborts, hopefully leaving the system in a state where the exact failure is readily apparent.

It has proven useful to give ks-post-install an option to skip to a specific command and start running things from there. The Kickstart environment gives one a shell on a virtual terminal during the installation, and the combination means that it is possible to iteratively retry failing commands as we correct problems.

Many of the commands are strongly data-driven: they do generic operations to data they find in known places in the setup directory tree. For example, all of the vendor supplied updates for a particular operating system version are dropped in rpms/OS/upgrade and the list of files to adjust permissions on is files/OS/fixup-files. This means that the command that applies vendor updates and the command that fixes permissions after the install are generic; they look for work in the magic place, and if they find work they do it. This is in contrast to an approach that would see us writing one script to fix permissions on Red Hat 9 and another to fix permissions on Fedora Core 2; clearly that approach would be more work.

This data-driven nature has two important consequences. First, it means we only have to make a change in one place to change the behavior of the system, and it is a place we would have had to update anyways. If we need to install a hardware specific RPM package on HP D530s machines running Fedora Core 2, we just put it into rpms/fc2/cquest/hw/d530s/; we don't have to put it there, add a command in stages/cmd, and update or create config-cquest-fc2-d530s.

The second consequence is that it is quite easy and fast to upgrade to a new operating system version. There is no wholesale need to rewrite old commands or write new ones and no duplication of past work; all we have to do is to populate a few trees, much of which can be copied straight from the last version.

Code Availability

We haven't tried to generalize the code, but much of it is not particularly specific to CQUEST. We'd be happy to give copies to anyone who's interested. Contact the CQUEST system staff, but you should know that this is about all of the documentation that's available.

Scribbled by Chris Siebenmann