Space Bank Design

This note describes the new design of the Space Bank, which is in response to the new conglomerated design for Object Ranges. Readers should already be familiar with the structure of EROS volumes, which is described in the Disk Formatting and Layout design note.

1 Object Identifiers and Range Keys

Every object has a unique object identifier, which consists of an object frame number and an object number within that frame. The number of valid objects per frame depends on the type of the objects currently residing in that frame. The object number field is currently 5 bits, but this will probably be revised to 8 soon.

|             Object Frame #              | Obj # w/i Frame |

To create a new object, the space bank hands such object identifier to a range capability and asks for an object capability. The created object's true OID will be relative to the base of the range. Since the space bank uses the superrange capability, it proves that it deals in true object identifiers, but this is not necessary to the design of the space bank; a correct design could be implemented using multiple range capabilities. The KeyKOS space bank was in fact implemented that way.

A range capability conveys the authority to create and recind keys to objects of any type within its range by passing the offset OID. If asked to create a key to an object in a frame of a different type than that of the current type of the frame, all objects currently in the frame are rescinded, and the frame is reinitialized to the new type. Caveat emptor. While it is common practice to have a one-to-one correspondance between range capabilities and disk ranges, adjacent ranges can be coalesced into a single range capability.

The space bank holds a range capability for each range that it is to manage. More precisely, the space bank is handed a range capability for each range it is to manage when when the range is fabricated. To simplify the internal design (which is plenty complicated enough, as you will see), the space bank used a single superrange capability conveying the authority to fabricate any object, and keeps track internally of which areas are ``live.''

2 Functional Overview

Before describing the space bank design, it may be useful to have a sense of what function the space bank provides and what requirements the design must meet. The current design (which is the one described here) is a compromise on these requirements.

2.1 Space Bank Services

The Space Bank handles allocation and deallocation of all EROS objects, which includes pages, nodes, cappages, and processes. Collectively, all of the space banks in the system conspire to keep a globally up to date list of the available object frames, and dole these out to individual banks as needed. Every space bank has a parent, all of which are ultimately descended from the prime space bank.

    From a functional perspective, all banks are rooted at the prime bank. For convenience of internal implementation, there is a co-equal internal bank known as bank0 that is used by the space bank itself to allocate internal storage.

The most common operations requested of a space bank are the allocate and deallocate operations. The allocate operation is on the fault handling path for heap and stack growth. Since heap growth is a performance critical operation for many programs, the allocate operation is especially time critical.

Every bank retains a list of the objects that it has allocated. This list is used for two purposes:

  • To verify that returned objects were actually allocated by this bank.

  • To allow all of the objects allocated by the space bank to be rescinded when the bank is destroyed.

The second item brings us to the second facility that space banks provide: the destroy operation. When a space bank is destroyed, it (and all of its children) becomes unusable, and all capabilities to the bank are invalidated. The objects allocated by these banks can be handled in two ways, depending on how the bank was destroyed:

  • They can be rescinded.

  • They can be inherited upwards to the parent of the destroyed bank.

The first case (everything rescinded) is a useful way to perform recovery cleanup when an orderly cleanup has become impossible. The second provides a way to do speculative allocation of objects where the result once constructed is to be retained, but allows the interim construct to be easily discarded if the available space proves insufficient.

2.2 Bank Restrictions

Space banks impose two restrictions: limits and nodestroy.

Every bank imposes a limit that describes the total number of object frames that can be allocated via a given bank. When an object frame is allocated by a bank, it is logically allocated (recursively) by all of its parent banks as well. The applicable limit value is therefore the most constraining of the limits of all of the parent banks up to the prime bank. The limit field provides a way to restrict the resource allocation of subsystems. A common use is for there to exist a limited bank whose children impose no limit.

The limit is simply a counter, and can be reset on demand. Every bank therefore provides the ability to create a restricted capability to the same bank that precludes changes to the limit field. The limit of a limit-restricted bank cannot be altered.

For similar reasons, it is sometimes useful to ensure that a space bank cannot be destroyed. For example, the space bank containing all of the space available to a given user should probably be nondestructible, lest the user accidentally destroy all of their storage (which would be nonrecoverable). As with limit-restricted banks, every bank can create a restricted capability to the same bank that precludes bank destruction.

The two restrictions are independent; a given bank capability can enforce one, the other, or neither.

2.3 Other Requirements

While the description given so far makes the bank functionality seem straightforward, there are a number of complications that make the actual implementation complex:

  • The number of available frames in a large system cannot be described within a single address space; the space bank must carefully husband its virtual address space to maximize the amount of space it is able to handle. The current design does not go as far in this direction as we would like.

  • The global free list must be coordinated across banks. This requires a sleazy implementation trick.

  • Fabrication of a child bank should be relatively light weight and require minimal essential overhead, since the bank mechanism is used to simplify storage reclamation in the face of an allocation error.

  • Allocating space needs to be *FAST* -- allocation is a frequent operation on the critical path of many programs. Once in to the space bank proper, the time to allocate an object really wants to be in the 1 to 2 microsecond range.

  • All allocated objects must be accounted for -- no loosing objects for efficiency's sake.

  • Objects allocated by a given space bank should have good locality of reference. Disk locality is a significant contributor to overall system performance.

  • not implemented There should be ways for a program to reserve a large amount of space for efficiency purposes (database programs, etc)

Some other items should be noted as not being requrements:

  • The space bank is *not* a garbage collector. If you allocated an object, and lost the key, that is your problem.

  • The space bank is *not* a real bank. You cannot hold up the space bank.

3 The Implementation

The bank itself is fairly straightforward but for our sensitivity to address space size limits. The major complexity of the space bank comes from the requirement for high-performance, clustered allocation. We will describe the overall design and the individual bank mechanisms, and then discuss how fast, well-localized allocation is managed.

3.1 Structure

While there are many space banks, the actual implementation uses only one process. Each individual space bank is actually a red segment node containing a start key to the bank process and some auxiliary information identifying the bank instance and the applicable restrictions. In effect, the format capability of the red segment node is used to gain the effect of a large number of objects (the banks) implemented by a single server (the bank process). In actual practice, the distinguishing "bank identifier" stored in the red segment node is simply a pointer to a bank structure within the virtual address space of the bank process.

A side effect of this design is mutual exclusion -- because there is only one bank process, it is guaranteed that access to the global free frame list is single-threaded. Further, there is no need to manage the master list across multiple processes.

A single prime space bank is not a requirement of this design. It would be quite possible to implement a design in which there were multiple banks representing the roots of distinct bank hierarchies.

3.2 Per-Bank Information

Internally, the space bank process maintains a number of Bank structures allocated from the heap. Every bank has a pointer to its parent bank, the next sibling in it's generation, and it's first child bank (if any).

Each bank has several restricted variants, and a separate red segment node is fabricated for each. The bank keeps track of which restricted variants have been fabricated and what the node OID was for that variant, and reuses these nodes rather than fabricating new ones for each request for restriction.

The most essential information carried by the bank is the list of objects that it has allocated. This is kept in a balanced binary tree, but here overhead becomes a problem. If every OID is stored in a separate tree node, the node overhead is 200%. Obviously this will not do.

The first saving grace lies in the fact that nodes and processes are smaller than a page frame, and that a given page frame can only hold objects of a single type. This allows us to implement tree nodes containing a page frame OID and a bitmap describing which of the contained objects is allocated. This somewhat reduces the tree overhead.

A larger savings comes from ``clumping'' the tree nodes into multiple page frames. We know that the bank must obtain well-localized allocation for performance reasons. We bet that it will, and design the tree nodes to describe 8 adjacent frames using a single OID. This yields a substantial reduction in overhead, but its effectiveness will need to be examined.

Finally, the bank must manage the fact that a given disk frame can only hold objects of a single type. When a node is allocated into a new frame, the bank keeps track of the current frame for node allocation, and continues to use that frame until it is exhausted.

3.3 Range Management

The space bank maintains a list of active ranges. In the current implementation, this is a statically allocated list, but future versions will provide means for the list to grow. Each element in the range list describes a starting and ending OID. The current implementation can manage up to 4096 ranges.

In order to simplify the management of the master list of available frames, ranges are coalesced into a single numbering scheme; each new range is added at a start number that is congruent mod 32K to 0. This ensures that the bitmap of available frames in that range can come from within the range itself. The space bank preallocates a master bitmap that holds all ranges, and plugs the new objects into this master range as they become available.

3.4 Free Frame Management

3.5 Minor Complications

How it does all this

NOTE ON BANK STRUCTURE (parent/child) The underpinnings of the space bank are two structures, the Global Range Directories and the per-bank allocation maps. These function together to keep track of all storage (allocated and unallocated) managed by the Space Bank.

Global Range Directories

Each range is seen by the space bank to be a number of object frames. In order to keep track of what has been allocated, it needs to have a bitmap, called a "range map", which has one bit for each frame in the range, which keeps track of its allocation status. Since searching through a large bitmap is very slow, the range map is divided into page-size chunks, called "sub-range map"s. In order to quickly find a sub-range map which is not completely allocated, there is a "Range directory", which has one bit for every sub-range map which keeps track of whether it is completely allocated.
figure "RangeMap.fig" here

In the Range, the Directory occupies the first couple frames, and the sub-range maps take up the following few frames, and the allocatable frames take up the rest of the disk. (It would take a 4 Terabyte disk to get the Range Directory larger than a single frame)


Per-bank allocation map

Each bank has to keep track of all storage allocated by it, for two reasons:
  • A bank can be asked to destroy itself and all of its contents.
  • If someone asks a bank to deallocate an object which wasn't allocated from the bank, it must be refused.

Complication: Bank destruction

Complication: Parent/Child interactions and Limits

Other issues

On-disk clustering

In-core address space size limits

Re-allocation of "Hot" Objects

The Global Storage Map


Per-Bank Structures

Bank Destruction

Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License