Working with package/depot/emt et al

Why?

We're all busy people. Why do any of us want to spend time learning our esoteric software management system? The goal is to minimize the amount of time we spend working on individual systems. Rewashing software for the same platform again and again (or manually installing the software over and over) takes a lot of effort, allows for more mistakes to creep in, and all around is just a bummer. We simply don't have enough people to do individual management of all of our unix machines.

The Andrew environment is designed to allow all of us to share the grunge work of making systems work, without preventing per-machine flexibility. The downside is that it has a learning curve all its own--but one which you will hopefully find well worth it.

What is this junk?

depot is primarily responsible for managing collection versioning. depot takes over the management of a directory hierarchy (in our environment, depot manages /usr/local, /usr/contributed, and /usr/host). No changes happen inside of this hierarchy without depot making them, ensuring that changes are reversible and reproducible. depot works by linking or copying various collections to the target directory and ensuring that these collections don't conflict. Individual collections can then be upgraded or installed and each file belongs to one and only one collection. depot by itself understands very little of versioning or per-machine customization; we use dpp, the depot pre-processor, for per-machine customization.

package is responsible for management of the operating system and other boot time configuration. It is the primary method by which we customize individual or classes of machines. package itself is very stupid; it merely knows how to make a filesystem resemble its configuration file. package is also usually the most irritating program on our system, since it will delete files that don't match its configuration file--all of us have seen package delete something we wanted to keep. Again, package by itself is fairly stupid and doesn't allow any simple inheritance. We use yet another pre-processor, mpp, to provide these features for package. Along with our use of mpp, we use a large set of conventions to make our package environment comprehensible.

wsadmin or /afs/andrew.cmu.edu/wsadmin is the directory hierarchy on AFS which holds large numbers of fragments of package (and occasionally depot) configuration files. The mpp processor knits these fragments together to form a complete package configuration. These convention allows us to configure Apache on a machine with a single %define doesapache in a /etc/package.proto instead of manually inserting the tens to hundreds of lines of package configuration Apache would normally need.

emt works with adm to perform delegated software management. emt manages a set of environments (a "beta" and a "gamma" environment for each systype) and allows collection maintainers to release software to those environments. Since emt uses fairly long and annoying commands, the Perl script carpe generates the appropriate command to run after a simple interactive dialog and automatically e-mails it to a bboard (these bboards start with org.acs.asg.request). Individual maintainers can generally affect the beta environment directly. Gatekeepers are generally responsible for releases to the gamma environment.

How can we use the process to our advantage?

most of these things are ideas on what we should be doing, not what we're necessarily doing now.

Beta releases are at the discretion of the collection commanders.
Gamma releases every Wednesday afternoon. All queued release requests go out every wednesday unless there are known problems with the version in beta.
Any software that sits in beta for more than 4 weeks without a gamma request or a new beta release causes the collection commanders to be bugged. (how to automate this or make sure it happens?)
Gamma software does not change besides on wednesday unless there is an emergency fix (see below).
All software should go to gamma before it is deployed on any production machine.
Backup /usr/{local,host,contributed}/depot/depot.pref for all production servers (for last resort restore when a root disk crashes). depot.pref is the result of dpp running on the depot.pref.proto and is sufficient for depot to rebuild the trees exactly as they last were. It reflects the current state of released software with any overrides in the depot.pref.proto.
Use beta machines for development. Compile all software besides "emergency" fixes on beta machines. If your software relies on another collection that is different between beta and gamma, make every effort to push that collection out instead of just compiling against gamma.
Verify some sort of functionality before releasing to beta (program starts, doesn't reformat hard disk, etc.)
Testing machines may be beta (early testing) or gamma (late testing and verification before pushing to production servers). Some internal services (such as asg.web.cmu.edu or bugzilla.andrew.cmu.edu) may run as beta machines to try to ensure greater testing of the beta environment.
Use gamma machines for production use. Production machines should copy as much software as possible. create and use some sort of "%define copymost"?
Avoid referencing specific versions in depot.pref.proto. It's tempting to require a production machine to use specific versions of software, but experience shows that unless the machine maintainer is paying close attention, version skews between collections can start causing subtle problems.

Specific examples

Major upgrades

Minor upgrades

Root disk crash

Emergency infrastructure fixes

bboard posts, release, upgrades, etc.

Emergency application fixes

Where should I configure these things?

Workstation configuration

Clusters are gamma machines.

Computing service desktop machines are generally beta machines. You might want to have /usr/local/depot/depot.pref.proto:

%define beta
%define tree local
%include /afs/andrew.cmu.edu/wsadmin/depot/src/depot.include

searchpath * ${local}
collection.installmethod copy netscape,lemacs,kerberos,maplev,com_err,gnucc,gdb
collection.installmethod copy gnome,xfree86
collection.installmethod copy openssl,mozilla

Add to the list depending on what applications you use frequently. (This is only for performance.)

Your workstation will depot nightly. You can cause depot to copy a specific version of a collection with a line like

path cyrus ${dest}/cyrus/064

which will cause the Cyrus version 064 to by installed on your computer. This is useful for testing new versions before beta release or examining how old versions worked.

You want to reboot whenever new OS versions are put into beta (see bboards); probably around once a month is a good choice, or after you run package. (Always reboot after running package!)

Production servers

The primary question for production machines is "how often should they update"? The more frequent they update, the more times something may break--and frequent updates means that people are probably not paying close attention to each update. On the other hand, less frequent updates cause each update to be much bigger, which means tracking down what change caused a bustage can be much more complicated. Infrequent updates can also complicate security fixes--ideally, security fixes would require a very small software change but if a machine is too far behind the times, it will require a special version or a large update to stabilize.

If possible, production machines should reboot weekly, causing depot and package to run at each reboot. Generally, redundant services such as SMTP servers, Unix servers, or DNS servers should have no problems meeting this requirement, since they can reboot on a staggered schedule and cause little or no user visible outages. (Our users are remarkably tolerant of daily outages: the Unix servers are unavailable for 10-30 minutes every day with few complaints.) A single redundant server can be down for an extended period of time, so if an environment change has broken the server it is not a catastrophe.

Non-redundant servers need to balance the need for uptime versus the resources we want to spend as system administrators. While we've made some changes to package and depot to have them run faster, our server hardware tends to reboot slowly. Non-replicated file servers (such as Cyrus backends or AFS user servers) can cause interesting questions. Lately, we've rebooted AFS servers weekly (with little complaint) but have attempted to minimize the downtime for Cyrus backends. Non-replicated services can also suffer from the "unintended upgrade" effect--a seemingly unrelated change causes downtime, and causes downtime when no system administrator is immediately available to fix it. Possible remedies to this include:

careful monitoring of release logs by system administrators
automated testing by a regression suite on a beta machine

Most of our modern servers use Unix as a substrate but provide user access through a well-defined protocol served by specific application software. Since this application software is usually not run by ordinary users, it is generally put in the /usr/host tree (to provide versioning using depot and emt) and then copied from host onto the local disk by package at boot time. While this provides generally good flexibility, it suffers from the lack of versioning in the wsadmin area. wsadmin can tell if there's a beta or gamma environment but is unaware of the exact version of the software being run. (one possibility is package fragments in /usr/host areas?)