![]() |
| January 2002 | Get BSD | New to BSD? | Search BSD | Submit News | FAQ | Contact Us | Join Us |
|
Unixish operating systems (BSD, GNU/Linux, Solaris etc.) are quite complex. Although a large amount of documentation exists for these systems, the coverage is incomplete in several fundamental ways. This article describes the problem, a proposed remedy, and a supporting business model!
The FreeBSD distribution (to pick an example) contains hundreds of programs, thousands of library functions, tens of thousands of files, and thousands of manual pages. And that's just the base operating system; the FreeBSD Ports Collection, at 5000+ packages, is even larger!
FreeBSD is, by and large, a well-documented body of software. It has a comprehensive and actively maintained set of manual pages, a reasonable smattering of books, and a fair number of articles and papers. Most of the programs in the Ports Collection also have man pages and/or info files; many have supporting articles, books, FAQs, HOWTOs, manuals, papers, READMEs, and/or tutorials.
Moving away from FreeBSD, an enormous body of documentation is available for Free, Open Source, and (generic) Unix software. Most of this, being written by the software's developers, is succinct and authoritative. In summary, there is no overall shortage of Unix documentation. There are, however, some peculiar (and occasionally annoying) omissions.
The "FreeBSD File Formats Manual" (Section 5) contains 100 or so file descriptions. These cover the files in /etc pretty thoroughly, but largely give up at that point. Consequently, most of the 50K system files on my FreeBSD system have no documentation at all.
Even when a man page exists, the format description may be unusably vague. The descriptions contained in many Section 5 man pages are insufficient for creating the file in question, let alone writing a parser for it.
Man pages relegate relationships between items (e.g. files and programs) to a few FILES and SEE ALSO references, along with occasional mentions in the text. If the reader wants to know how and why a program accesses a file, s/he will probably have to go to the program's source code. If the reader wants to know which program(s) use a particular file, s/he may be in for quite a search...
Finding examples of library function or system call use can be even more challenging. nm(1) could be used to find out which programs use a given function or system call, except that binaries are typically stripped of symbols (to save space). Even if this were not the case, running nm(1) over hundreds of files is far from instantaneous.
Most man pages cover specific items like commands, files, library functions, and system calls. As a result, they aren't very useful as a source of general information. Nor do most system administration books help much. Even the best books like "Unix System Administration Handbook", by Nemeth et al, concentrate mostly on "how to", rather than trying to explain "what is" and "why".
Programmers aren't much better off. "The Design of the Unix Operating System" (Bach) and "The Design and Implementation of the 4.4BSD Operating System" (McKusick et al) are excellent "what is" books, but they only cover kernel issues. What about user mode subsystems, say, error logging, printing, mail handling?
As noted above, we have access to many thousands of documents. They may have to be located, downloaded, unpacked, and/or formatted, but they do exist. Moreover, the source code is also available and extremely authoritative. Consequently, most researchers can find all the information they need. Eventually. On the other hand, the weaker souls among us might find it comforting and useful to have a bit of assistance from time to time.
One possibility might be a web-based service that could retain, index, and format this information, locating and presenting it for us at the click of a button. Some structural information (e.g. subsystem overviews), tied to an on-line repository of indexed and pre-formatted reference materials -- documentation, example files, and source code -- could provide an invaluable resource. In cases where the material cannot be found on-line, bibliographic references may still be useful.
In summary, we have a great deal of well-written documentation, but certain types of information can be very difficult to find. This causes all of us (users, administrators, and programmers) to waste time and effort. We can do better than this.
I have been working on these sorts of issues for a decade or so, but little of this work has ever seen the light of day. Back in 2000, however, I started actively promoting the notion of Integrating System Metadata and Documentation, forming the Meta Project as a vehicle for this activity.
The basic premise of the Meta Project is as follows: It is quite feasible to collect and integrate a great deal of information on the structure and relationships embodied in Unix systems, leveraging system metadata, existing documentation, and additional annotations. The resulting "knowledge base" could be a great help to administrators, programmers, and users.
As a "proof of concept", I prototyped a web-based demonstration program. The Meta Demo (aka the "FreeBSD Browser") accepts keywords, man page names, and/or full path names as input. It then produces a list of related files, which can be followed further (as links) by the user.
I was pleased to find that my rather naive implementation gives quite acceptable results. The Demo finds most of the items that a seasoned Unix user might find by an extended manual search, plus a few items that would have been unlikely to emerge. It produces very few erroneous results.
For those who are interested, the Demo is built in two parts: a batch-mode "system analyzer" and a CGI-based "front end". The system analyzer walks over the file tree and man pages, building a graph of relationships. The front end performs a constrained breadth-first search, starting at the specified item. Several hundred annotation files and a handful of rules make up for some of the deficiencies in the mechanically-generated graph.
I was able to create the Meta Demo with only a few months of part-time effort, but building a production version is an entirely different matter. Years of full-time effort will be needed to build real infrastructure, research and write annotations and overviews, etc. Clearly, if the Meta Project is to proceed further, it will need a reliable source of income.
Looking around for a business model, I considered the idea of turning the Demo into a subscription based service. Unfortunately, very few of us (myself included!) are willing to pay for on-line access to information. This doesn't make much sense, but it's reality. Nor are banner ads particularly profitable. So much, then, for a straight "online" play.
Eventually, however, I started to get an idea. The "knowledge base" created by the Meta Project could support both on-line and printed access. Why not produce topical "documentation collections", based on freely redistributable materials, for sale to the Free and Open Source software community?
Despite the many benefits of on-line access, printed volumes retain a favored status with many of us. They are economical, portable, and have a user interface which has been refined over millennia. And, in any case, there is no reason that on-line and printed documentation cannot work together.
So, laying aside my work on the Meta Demo, I created a specialized publishing system for Free and Open Source documentation collections. These would be a useful service to the community, and a real part of the Meta Project, but they would also bring in revenue.
Books, as published by traditional publishers, generally contain a great deal of proprietary content. This must be written, edited, formatted, and proofed (a process which can take months or even years). In addition, offset printing requires large and expensive production runs.
The resulting volumes then spend months bouncing around warehouses and sitting on reseller's shelves. Some are purchased, but the rest get returned to the publisher. This model works, more or less, for large publishers. It would not meet my needs, however.
So I developed a new model. My "documentation collections" would contain very little proprietary content. Using demand publishing and a highly-mechanized editing process, they could be created and published in a matter of weeks, for relatively small amounts of capital. Because customer orders would be fulfilled from the printing firm's inventory, volumes would arrive within days.
As a result, I can produce large numbers of volumes covering specialized and/or rapidly-changing topics. SourceForge currently lists 31,086 projects. If only 1% of these projects has produced enough material to fill a volume, we're looking at 300+ volumes!
"Prime Time Freeware's Open Source Documentation Collections" is an awkward title for a book series, so I chose DOSSIER, "a collection or file of documents on the same subject", instead. Once I invented a suitable "retronym" (Documenting Open Source Software for Industry, Education, and Research), the decision was easy...
DOSSIER's business model is somewhat untraditional. It uses highly mechanized editing, demand printing, and Internet-based ordering. These work together to allow publication of volumes on topics that a traditional publisher could not even consider.
DOSSIER volumes are edited by a set of Perl scripts. These scripts convert incoming documents into PostScript, adjust margins, add descriptive footers, and so forth. Corrections and suggestions, in true Open Source tradition, are reported back to the documents' author(s).
The resulting files are then concatenated, along with some prefatory and index material (e.g. Table of Contents, Permuted Index), into a single PostScript file. Adobe Acrobat "distills" this into PDF (Portable Document Format), which is sent, along with the cover art, to the demand-printing firm.
After a proof copy has been created and approved, the printing firm produces and inventories a few dozen copies of the volume. When an order is received, the demand-printing firm pulls the desired volume(s) out of inventory, shipping the order directly to the customer. When a volume's inventory level falls low enough, the printing firm simply produces another batch.
Because each batch is small, and the editing is highly mechanized, incremental changes are easy and economical. This allows us to correct errors and omissions, and to keep up with rapidly-changing documents, even when a volume's sales are relatively modest.
DOSSIER is a very new series; the initial production run finished around Christmas! Consequently, our list of titles is still quite limited. Nonetheless, it suggests the kinds of topics we intend to cover:
A number of other volumes are in development and some prospective editors are starting to show up. With a bit of luck and a lot of work, DOSSIER should soon cover a substantial range of topics. If you would like to hear about upcoming volumes, drop a line to our announcement list, putting "subscribe" in the body of the e-mail.
If you would like to play a more active role in DOSSIER's development, please join our discussion list. (Don't forget to put "subscribe" in the body of the e-mail.) In particular, please contact me if there are any documents that you would like to see in print. What good is demand printing, after all, if we can't print the volumes that our customers demand?
Assuming that these volumes sell in sufficient quantity, the Meta Project will receive the funds it needs to continue development of the on-line system. Next month, I'll sketch out how that might function, hinting a bit at the underlying technology.
Ed. note: DOSSIER volumes are available for purchase at the BSDmall.