New BRL-CAD Database Format, Version 5: Appendix

VIII. Appendix: Earlier Proposals for Database Organization

VIII(a). First Proposal for Database Organization

The database stores objects regardless of whether they are combination nodes or leaf nodes. The object is the smallest granularity of an item in the database; objects must be read from and written to the database in single atomic operation. All objects share certain common properties, as part of a standardized object wrapper.

Every object has an object name; each object must have a unique name drawn from a single global name space.
Every object has an object length which indicates the total number of bytes required for this object including the magic number.
Every object has an object type, which is always 16 bits wide. There are reserved values for database headers and free storage objects. When the database objects are used to hold BRL-CAD geometry, there is a special type for combinations, all other objects are leaves (solids).
Every object has an object body which contains a contiguous block of data. From the point of view of the object/database interface specification the format of this data is opaque. When the object contains a BRL-CAD solid or combination the precise structure of this block of data will be as specified further along in this document.
Every object has an object comment which contains a string of an arbitrary number of newline terminated lines of text. None of the BRL-CAD software will interpret these strings in any way, they are provided for the modeler to place notes in. These are strings for use by humans only.
Every object has an object attribute field which contains an ASCII string of unlimited length, which has the format of "attribute_name=attribute_value ...". These attributes are intended for direct use by programs. There will be a WWW registry of attribute names presently in use, to prevent two application developers from using the same attribute_name for different purposes. One such attribute would be the national stock number (NSN), another would be the ISO (or ASME?) standard material code.
??? Does this belong at the database object level, or the Combination/Region level. The semantics of the two choices are very different....
Lee has suggested that it might be beneficial to have a binary version of object attributes, for more compact storage of bulky attribute_values.

The on-disk version of each object on the disk has the following general form, with two distinct parts, the Object Wrapper and the Object Body:

	+--------------------+
	| Magic Number1      |
	| HFlags             |
	| IFlags             |
	| Object Type        |   (Non-compressible basic header)
	| Object Length      |
	| Name Length        |   conditional on N bit
	| Object Name        |   conditional on N bit
	| +------------------+
	| | Object Comment   |   "Object Interior"
	| | Object Attribs   |   (Compressible portion)
	| | Replacement List |		<-- deferred implementation
	| |  +---------------+
	| |  | Object Body   |
	| |  +---------------+
	| | (Future Ext)     |
	| +------------------+
	| Magic Number2      |
	+--------------------+

The routines db_get_external() and db_put_external() are used to move objects in external format between memory and the database disk file.

The routines db_wrap_external() and db_unwrap_external() are used to wrap and unwrap the Object Body (already in external form) in a standardized database object's wrapper.

The external format has several important properties, especially with regards to the Object Body:

Numbers are stored in binary, for storage efficiency, for speed of reading and writing, and to prevent errors from creeping in due to repetitive conversion between binary and an ASCII decimal printed representation. This eliminates the need to use the old g2asc and asc2g to move databases between different machine architectures.
All data in the object wrapper are stored in a machine-independent format:
- All floating point numbers are stored as IEEE double precision, in Big Endian order, where byte 0 is on the left end of the word.
- All integers are stored as twos-compliment binary numbers in either 8, 16, 32, or 64 bits, in Big Endian order.
- All strings are stored in ASCII, with a length field which includes an explicitly represented NULL byte at the end.

The HFlags byte contains flag bits that pertain to the non-compressible basic header, and the IFlags byte contains flag bits that pertain to the "object interior" which is potentially compressed.

		                       1 1 1 1 1 1
		 0 1 2 3 4 5 6 7   8 9 0 1 2 3 4 5 
		+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
		|Wid|N|     |ZZ | |Wid|C|A|R|B|   |
		+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
		      HFlags            IFlags

There are two width fields, Header_Wid and Interior_Wid. Each of the width (Wid) fields are interpreted in this manner:

Wid=00 indicates 8-bit wide unsigned length fields.
Wid=01 indicates 16-bit wide unsigned length fields.
Wid=10 indicates 32-bit wide unsigned length fields.
Wid=11 indicates 64-bit wide unsigned length fields.

The Header_Wid code specifies the width of the Object_Length field. The rationale for making this separate is to save space on objects where there might be many fields close to the maximum size at that width of length field, where the sum thus requires the next larger width. For example, 4 interior fields of 255 bytes could all have Interior_Wid=00, but the overall length of 1020 would require Header_Wid=01.

The Interior_Wid code specifies the width of the the Name_Len field (when the Name field is present. See "N" bit, below.), object comment, object attribute, object replacement list(deferred implementation), and object body fields. Because the width of all these fields is specified by the one code, they must all be the same width. Thus, when encoding an object into it's database record, the widest of all these length values determines the Interior_Wid value required; the Interior_Wid value affects the widths of all length fields simultaneously.

Both Header_Wid and Interior_Wid may vary from object to object. It is expected that the routines which write an object to the disk will use the narrowest width possible for each object.

The remaining bits when set indicate the presence of the corresponding field. This eliminates the need to specify a length of zero for optional fields which are not present in a given object.

Header flags (HFlags):

When set, N bit indicates the presence of the Name_Length and object name fields in the non-compressible basic header, immediately following the Object_Length field. The width of the Name_Length field is specified by the Header_Wid field.
The 2-bit ZZ field indicates whether the object interior has been compressed, and if so, how:
1. ZZ=00, object interior is uncompressed.
2. ZZ=01, object interior is compressed using GNU GZIP algorithm.
3. ZZ=10, RESERVED to mark 2nd algorithm (future expansion).
4. ZZ=11, RESERVED to mark 3rd algorithm (future expansion).

Interior flags (IFlags):

When set, C bit indicates the presence of the object comment field.
When set, A bit indicates the presence of the object attribute field.
When set, R bit indicates the presence of the replacement variable list field. [Deferred implementation: see end of this document]
When set, B bit indicates the presence of the object body field.
The remaining bits are reserved for the design committee to waste on additional optional fields in the object.

There are no alignment constraints on which byte the comment length and object body start on.

A sample object in external form might have the following byte layout, with the Header_flags byte set to {Wid=01, N=1, ZZ=00}, and Interior_flags set to {Wid=01, C=1, B=1}, denoting name, comment, and body all present.

		 0            15 16           31
		+-------+-------+-------+-------+
		| Magic1| HFlags| IFlags| Obj...|
		|-------+-------+-------+-------|
		|..Type | Object Length | Name..|
		|-------+-------+-------+-------|
		| ..Len | Object Name...        |
		|-------+-------+-------+-------|
		|       ....Object Name         |
		|-------+-------+-------+-------|
		| Comment Len   |  Comment...   |
		|-------+-------+-------+-------|
		| ...Comment    |  Body Length  |
		|-------+-------+-------+-------|
		|Object Body... | [Pad] |Magic2 |
		+-------+-------+-------+-------+

Padding and Length Rounding

The minimum size Free object is 7 bytes long:
Magic1,
HFlags (Wid=00, 1byte),
IFlags,
ObjType=Free (2 bytes),
ObjLen=7 (1 byte),
Magic2.

Rounding the minimum length of 7 bytes up to the next power of two makes the minimum object length 8 bytes. Given that free storage is represented as a object, and that the total length of an object must be a multiple of 8 bytes in length, objects must be padded to a length which is a multiple of 8 bytes. The pad bytes should be inserted between the end of the object body and the second magic number so that the final byte of the object is the Magic2 byte. The pad bytes are not counted as part of the body_length, but are counted as part of the object_length.

The minimum size database object is 12 bytes long:
Magic1,
HFlags (Wid=00 for 1 byte lengths, 1byte),
IFlags, (Wid=00 for 1 byte lengths, 1byte),
Object Type = solid (2 bytes),
Object Length = (1 byte),
Name Length = (1byte),
Object Name (2 bytes: 1 character name + null byte)
Body Length (1byte)
Object Body (1byte)
Magic2.

To comply with the rounding requirement, this will be padded out to occupy 16 bytes:
Magic1,
HFlags (Wid=00 for 1 byte lengths, 1byte),
IFlags, (Wid=00 for 1 byte lengths, 1byte),
Object Type = solid (2 bytes),
Object Length = (1 byte),
Name Length = (1byte),
Object Name (2 bytes: 1 character name + null byte)
Body Length (1byte)
Pad (4 bytes)
Object Body (1byte)
Magic2.

                 0            15 16           31
                +-------+-------+-------+-------+
                | Magic1| HFlags| IFlags| Obj...|
                |-------+-------+-------+-------|
                |..Type |ObjLen |NameLen| Name  |
                |-------+-------+-------+-------|
                |NameNUL|BodyLen|  Body |Magic2 |
                |-------+-------+-------+-------+

One byte of magic number at each end of an object suffices, because the byte before the Magic1 byte of an object should be the Magic2 byte of the previous object. That fact plus the 8-byte alignment constraint will make it possible to reliably locate the start of each undamaged object in the database even if portions of the database have been corrupted.

Extension mechanism: If additional database fields need to be added in subsequent revisions, they would be placed after the object body -- to older versions of the database reader the extensions would appear to be an excessively large pad area, and would be silently skipped.

Compressed Objects

The nice layout of the internals of each object makes it possible to apply data compression techniques to the internals of each object. Each object's internals will be compressed independently, as specified by ZZ; one database may have a mix of compressed and uncompressed objects. Only the Magic 1&2, Object Flags, Object Name, Object Type, and Object Length fields would need to remain in uncompressed format, so that the object could be located and/or skipped without needing to decompress it.

When the ZZ bits are 00, the object has not been compressed, and needs no special processing. When the ZZ bits are non-zero, the on-disk object has been compressed. When the interior of a compressed object is needed, an extra processing step is added: after reading the object into memory in a db_external structure, the decompression algorithm is run on the interior of the object, resulting in a new db_external structure. The table of values for ZZ was given earlier.

Before decompression:

		+---------------+
		| Magic Number1 |
		| HFlags        |   N=1, ZZ=01
		| IFlags        |
		| Object Type   |   Non-compressible header
		| Object Length |
		| Name Length   |
		| Object Name   |
		+---------------+
		| Obj Interior  |   Compressed portion
		+---------------+
		| Pad           |
		| Magic Number2 |
		+---------------+

After decompression the object would look as described earlier, but the value of ObjectLength in the decompressed object would be different than the value of ObjectLength in the compressed object.

Decompression of compressed objects will be automatic, and will be performed only when needed. For example, objects will not be decompressed when scanning the database (e.g. to build the in-memory table of contents). Compression will only be performed by user command; it is envisioned that mged will be given compress and decompress commands, to allow the user to have complete control over the speed/space tradeoffs.

Compression support is mandatory for any implementation of a database reader library, it is optional for an implementation of a database writer library.

VIII(b). Second Proposed Database Organization

After a rather intense late-night "beautification" session, Lee and I arrived at a slightly different organization which is probably less efficient in both speed and space, but is slightly more elegant. (The API is more elegant than the on-disk implementation, and that's probably the most inefficient portion).

		+---------------+
		| Magic Number1 |
		| Flags         |
		| DBObject Type |   Non-compressible header
		| DBObject Len  |
		| Name Length   |
		| Name          |
		| Interior Len  |
		+---------------+
		| Obj Interior  |   Possibly compressed portion
		+---------------+
		| Pad           |
		| Magic Number2 |
		+---------------+

The DBObject_Type field would have only 3 values: DB_Header, DB_Free, or DB_Allocated. (These could be stored in 2 bits of the flags byte). This would completely decouple this object type from any object typing desired by the application. All DB_Allocated objects would have to be named, and only DB_Allocated objects could be retrieved from the database by applications.

Within the "Obj Interior", all data would be stored as a collection of named attributes with values. For any given object, 0 or more attributes might be present. Each attribute would have a name-length, a name, a value-type, a value-length, and a block of storage to hold the actual value.

This would be a very generic database system, which could be used for other applications (who would want to?), and which could be easily replaced with a different database system (e.g. a commercial database package) with little effort (although that would doubless be even slower).

For BRL-CAD applications, a set of attribute names would be adopted by convention, with several being integral to the proper functioning of LIBRT:

The BRL-CAD database object type (combination, sphere, RPP, etc.) would be stored on an integer attribute ("t=").
Similarly, the BRL-CAD "object body" would be stored on a binary-block attribute ("b=").
The proposed BRL-CAD "comment" field would be stored on a text-string attribute ("comment=").
The proposed BRL-CAD "replacement list" field would be stored on a binary-block attribute ("rlist=").
All other attributes, from whatever source, would be stored similarly, including application-specific and end-user-created attributes.
(The database object name wouldn't be stored on disk as an attribute, but would be available for retrieval on attribute name "name=").

The main issue for the database API becomes one of how a retrieved and "unwrapped" in-memory database object looks to an application. The most natural way would be to have a variable-length array of structures something like this:

	struct attributes  {
		char	*aname;
		char	type;
		long	len;
		genptr_t value;
	} attrib[];

where type would be either ASCII-string, binary-block, or binary-integer (perhaps in several widths).

Then, to retrieve a BRL-CAD object (in external form) from the database, one would merely call:

	db_get_external(ext, dp);
	db_unwrap_v5_external(atab, ext);
	id = db_get_integer_attribute( atab, "t" );
	buf = db_get_block_attribute( atab, "b" );

and then proceed with the BRL-CAD level external-to-internal conversion from there.

The main questions are:
(1) What is the runtime penalty for having to build up a "symbol table" of attribute names for each object as it's being pulled in off the disk, and then searching it with strcmp() calls for every retrieval,
(2) What is the disk storage penalty for storing frequently recurring attribute names on the disk using the same straightforward and general encoding used for all the other attributes?
(3) How much inefficiency (in terms of number of data copies) is engendered by having to marshal up lots of small, variable length attribute name and value strings, to build the object interior, which is then (optionally) compressed, which is then marshalled up with the variable-length name string and finally provided in external form? Would a kernel mbuf-like data structure help? Enough?

$Header: /vld/mike/public_html/papers/brlcad5.0/RCS/appendix.html,v 1.1 1996/11/06 04:14:49 mike Exp mike $
Collected by Mike Muuss
E-mail comments to <acst@arl.army.mil>