a better allocator api for rust

11 minute read Published: 2023-05-15

recently, i've been working on a custom "standard library" for rust. my primary motivation was the need for custom allocators on stable rust. initially, i decided to simply re-implement rust's nightly allocator api in my own standard library. but soon, i discovered that i quite disliked rust's proposed api. its design lead to lots of code duplication, and the api had various features that weren't exactly useful in practice. in this post, i'll explain rust's current allocator api proposal and share what changes i made to it and why.

alloc & free

when i say "allocator api", in this article, i'm referring to an interface that provides at least two methods: a way to allocate memory, and a way to free memory.

here's a rust trait, that implements such an interface:

trait Allocator {
    fn allocate(&self, size: usize) -> *mut u8;
    fn deallocate(&self, ptr: *mut u8);
}

allocate takes a size in bytes and returns a raw pointer, which is null, if the allocation failed. deallocate takes a raw pointer and frees the allocation internally. this is analogous to C's malloc and free.

these two methods, also form the core of rust's nightly allocator api, though with some slight modifications:

trait Allocator {
    // Required methods
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError>;
    unsafe fn deallocate(&self, ptr: NonNull<u8>, layout: Layout);

    // ...
}

instead of taking a size in bytes, allocate takes a Layout. a Layout is both a size and an alignment. additionally, instead of plain raw pointers, NonNull pointers are used. (which enable some size optimizations for Result and Option).

however, the most interesting change is that deallocate also takes a Layout. this layout must be the same Layout as was passed to allocate. at first, this may seem strange and error prone - we're now relying on the user to pass in the correct layout to deallocate. but if you think about it, the user already has to know how large their allocation is, in order to use it correctly. (and of course, we're writing unsafe code here, so it's fine to require extra diligence from our users.)

but why we would want the user to pass the layout to deallocate in the first place? the benefit of doing so is that the allocator now isn't technically required to keep track of the size of allocations internally. this can be especially useful for large allocations, which are typically implemented by forwarding them directly to the operating system's virtual memory functions, but those functions also require the user to pass in the size. so if the allocator wasn't given the size by the caller, it would be required to do some additional internal bookkeeping to track the sizes of large allocations, in order to free them correctly.

handling ZSTs

rust has this concept of zero sized types or ZSTs. they are types that don't require any runtime storage, because they only have a single value. examples include the unit type (), zero length arrays [u8; 0], and empty structs struct Foo {}.

because ZSTs don't take up any space, any properly aligned pointer can be returned to satisfy an allocation request for a ZST. the logic may look something like this:

impl Allocator for MyAllocator {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> {
        if layout.size() == 0 {
            // return a pointer aligned to `layout.align()`.
        }
        else {
            // regular allocation logic.
            // eg: search for some free memory block, etc.
            // ...
        }
    }
}

the issue is, practically every implementation of Allocator::allocate would look like this, because allocators should not waste any memory on ZSTs.

this is one example of how rust's allocator api leads to code duplication. in my allocator trait, i split up the allocate and free functions into two parts each:

pub trait Alloc {
    // this is the method implemented by the allocator.
    /// # safety:
    /// - `layout.size() > 0`
    unsafe fn alloc_nonzero(&self, layout: Layout) -> Option<NonNull<u8>>;

    // user facing method that handles ZSTs.
    fn alloc(&self, layout: Layout) -> Option<NonNull<u8>> {
        if layout.size() != 0 {
            let result = unsafe { self.alloc_nonzero(layout) };

            if let Some(result) = result {
                // ensure the implementation returned a properly aligned pointer.
                debug_assert!(result.as_ptr() as usize % layout.align() == 0);
            }

            return result;
        }
        else {
            return Some(dangling(layout));
        }
    }

    // analogously for `free`
}

splitting allocate and free into these two parts has several benefits: 1) the implementers of Alloc don't have to handle zero sized allocations, those are already handled by the default implementation of alloc and free. and 2) we can put extra sanity checks into the default implementation. for example, the alloc function validates that the alloc_nonzero function has actually returned a properly aligned pointer.

realloc

the ZST changes are nice minor improvements. but on their own, they're not enough to write an article about, or perhaps, to suggest to the allocator work group for adoption into rust's allocator api. so now let's talk about resizing :^)

some use cases can benefit from resizing allocations in place. a Vec, for example, occasionally needs to grow its underlying buffer, in order for more elements to be pushed into it. ideally, if there's enough free memory after the current allocation, we'd like to be able to resize the allocation in place, so we can avoid copying the elements to a new location. to do that, we need to add another method to the allocator api: realloc.

realloc tries to resize an allocation in place. or if that fails, it instead makes a new allocation, copies the bytes over, and frees the old allocation. because realloc may change the memory location of the allocation, it has to return a pointer, indicating the new address. that's how it works in C.

for some reason, in rust's allocator api, realloc was split into two functions, grow and shrink:

pub unsafe trait Allocator {
    // ...

    unsafe fn grow(&self,
        ptr: NonNull<u8>,
        old_layout: Layout,
        new_layout: Layout
    ) -> Result<NonNull<[u8]>, AllocError> { ... }

    unsafe fn shrink(&self,
        ptr: NonNull<u8>,
        old_layout: Layout,
        new_layout: Layout
    ) -> Result<NonNull<[u8]>, AllocError> { ... }
}

to be honest, i don't really understand why it's done this way. often the logic for growing and shrinking is exactly the same, and even if it's not, it just takes a single branch to figure out which case you're in. (performance arguments are tricky, because reallocation is a fairly rare occurrence, and getting rid of the branch comes at the cost of increased code size.)

for my api, i've taken the more conventional realloc approach, also taken by the allocator apis of the zig and odin languages:

pub trait Alloc {
    /// resize an allocation.
    unsafe fn realloc(&self,
        ptr: NonNull<u8>,
        old_layout: Layout,
        new_layout: Layout,
    ) -> Option<NonNull<u8>> { ... }
}

but like with ZSTs, this api has code duplication issues: determining whether an allocation can be resized is allocator dependent logic. but if resizing is not possible, virtually all allocators resort to the standard routine of making a new allocation, copying the old bytes over, and freeing the old allocation.

therefore, initially, i split realloc into try_realloc and realloc:

pub trait Alloc {
    /// try to resize an allocation.
    ///  - the allocator dependent logic.
    ///  - `ptr` remains unchanged if the call fails.
    ///  - this is similar to zig's `resize` allocator method.
    unsafe fn try_realloc(&self,
        ptr: NonNull<u8>,
        old_layout: Layout,
        new_layout: Layout,
    ) -> Result<(), ()>;

    /// resize an allocation.
    unsafe fn realloc(&self,
        ptr: NonNull<u8>,
        old_layout: Layout,
        new_layout: Layout,
    ) -> Option<NonNull<u8>> {
        if unsafe { self.try_realloc(ptr, old_layout, new_layout).is_ok() } {
            return Some(ptr);
        }
        else {
            // alloc, copy, free.
        }
    }
}

later, i added try_realloc_nonzero to match the {alloc free}_nonzero pattern, meaning that allocators always deal exclusively with non-zero sizes.

resizing joined allocations

aside from reducing code duplication, this split between try_realloc and realloc actually comes with another upside: it enables resizing for "joined allocations".

when some container uses multiple arrays, it can be beneficial to use a single allocation for those arrays - a "joined allocation". picture something like this:

struct Joined<const N: usize> {
    xs: [X; N],
    ys: [Y; N],
}

struct Container {
    len: usize,
    ptr: *mut Joined<len>, // this doesn't work, but that's the idea.
}

unlike with Vec, an in-place resize of a Joined isn't actually a no-op: the ys need to be moved towards the right to make space for the now larger xs array:

[XXXYYY]
[XXX...YYY...]

this makes the regular C-style realloc api sub-optimal: if the joined allocation can't be resized in place, all bytes are first copied to a larger allocation, then the ys are copied again to make space for the xs.

in practice, this could be useful for hash maps, because those often use two joined arrays for the "hash" part and the "values" part. (though because the "hash" part changes after a resize, its unclear whether the extra implementation complexity is worth it.)

embracing fragmentation

rust's nightly allocator api has two more "features", which i got rid of completely: memory fitting and zero initialization.

when i said, "deallocate also takes a Layout. this layout must be the same Layout as was passed to allocate", that wasn't the truth. as you may have observed, rust's allocate method actually returns a NonNull<[u8]>, which is a fat pointer, consisting of both a raw pointer and a length.

allocators often group request sizes into buckets. for example, all sizes below 1024 bytes could be rounded up to the next multiple of 8. the idea behind allocate returning a pointer plus an "actual size" is that the user may be able to make use of these extra bytes, potentially reducing internal fragmentation.

for example, in a Vec, you could update the capacity to actual_size/size_of::<T>() after the call to allocate. but because the capacity is used to determine the size passed to deallocate, deallocate actually has to be a bit more flexible. in rust's allocator api, deallocate allows the user to pass in any size between the requested size and the actual size. in the standard library, this is called memory fitting.

memory fitting makes the whole allocator api quite a bit more complex. it's also unclear whether there even is any significant benefit to it. with an allocation granularity of 8 bytes, memory fitting would cease to be beneficial for (most) types larger than just 4 bytes. these two concerns were enough for me to ditch memory fitting.

rust does not endorse ZII

the other feature i removed was zero initialization.

when requesting memory directly from the operating system, the returned pages are always zero initialized. so if the user needed zero initialized memory, it would be nice, if they could ask the allocator for zeroed memory, so the memory didn't have to be zeroed twice, if it came directly from the operating system. to support this, rust's nightly allocator api has *_zeroed variants of allocate and grow.

in other languages with more emphasis on Zero Is Initialization, i think having such versions of allocate and grow could make sense. but in rust, zeroed memory is practically useless: variables always have to be explicitly initialized. Default is an explicit trait, not an inherent property of types. many common types like Vec or Box use NonNull internally, meaning they can't be initialized directly from zeroed memory.

but even if Zero played a larger role in rust, i'm still not sure that zeroed allocation would be a good idea. the point of using general purpose allocators is that we don't have to constantly ask the operating system for memory - reuse of memory is the common case. meaning, we've effectively moved the memset call into the allocator, which now also has to keep track of which parts of its memory is still zeroed. again, the added complexity of zero initialization didn't seem worth the speculative gains, so i removed the feature.

the end

here's a gist with the latest version of my allocator api.

that's about it 👋