by Andreas Fredriksson (dep/TBL) on August 20, 2019

17 minute read

The Tech Behind Eon: Tooling

Eon represents four years of work and during this we made several iterations to various tools and pipelines. I’ll save the 3D scene packer and track loader for separate posts, but this post will cover the other aspects of our pipeline.

Code Builds

First things first: we use cross-compilation. Our build environment works on Mac, Linux and Windows and produces both ADF floppy images and stand-alone test programs, as well as a multitude of host tools. This section will cover some of the details of our setup, as well as some history.

We started with a stripped down, assembly-only version of our A1200 demo system (called NewAge for some reason). We got rid of all the C code and simplified things down to the very basics. This is how early prototyping of things like the rubber cube started, in plain 68k assembly.

Build System origins

We were already using Tundra as our build system, which is a great choice for us as it supports complicated cross-platform builds, it’s fast, robust and extensible, and, well, I wrote it so we have a pretty good understanding of how to use it.

One of the things we wanted early on was a way to generate some boilerplate code automatically, like generating a lookup table during build time. The first attempt at this was a Lua-based frontend to the assembler which ran as a prepass, so you could “escape” into Lua to generate a bunch of assembly. While this got the job done, it was problematic for a number of reasons, and the biggest problem was that you had no idea what actual source line a piece of assembly came from. For example, if you had generated a lookup table of 1,024 dc.w statements, all the subsequent assembler error messages might be offset by hundreds of lines and you’d have to go rummage through the generated assembly (in the output directory) to figure out what you did wrong.

Unsatisified with this state of affairs, we cloned vasm and made a few modifications so we could use something similar to C preprocessor #line directives. The assembler would then point back to the original source code, and things were OK for a while.

Deluxe68

The Lua preprocessor thing we had didn’t actually solve the hardest problem with maintaining a large assembly codebase, which is the cognitive load of making changes to older code. We kept trashing registers and screwing with calling conventions and tracking all this down in the emulator is a pain to say the least. Often these bugs were introduced because the programmer missed the fact that some register was precious or already in use, and would only manifest occasionally.

We also weren’t really using the “macro” part of the Lua preprocessor as much as we had anticipated, because we were already building data in a separate pipeline and most of the demo was art driven and not code driven.

I took a stab at solving the maintainability problem, and the result was deluxe68 which is a different kind of assembly preprocessor (again leveraging our #line directive vasm feature). Deluxe68 is a simple kind of register allocator that complains if you make mistakes with your registers. The tool evolved a little bit over the years, but it has essentially stayed true to this single purpose.

The way it works is that you use symbolic names for registers, and the tool will allocate and free those registers from a register pool automatically. It also has limited support for spilling and restoring registers, as well as reserving some registers from the pool (like a6) so you can use them for hardwired ABI purposes or other things.

With this tool in hand we ported almost all our plain assembly code to this style of programming and we were very pleased with the results. It’s much easier to make changes and not break things, because d68 will automatically save and restore modified registers (generating movem.l sequences) for you.

Here is an example routine from the demo so you can get a sense of what the code looks like:

        XDEF	ScenePlayerDecompressBitplaneMasks
        @proc	ScenePlayerDecompressBitplaneMasks (a6:statep)

        IFGT	SP_PROFILE_LEVEL-0
          TBL_BEGIN_SCOPE "ScenePlayerDecompressBitplaneMasks"
        ENDC

        ; Fetch count of bitplane masks

        @areg	countp
        move.l	SP_COUNT_PTR(@statep),@countp

        @dreg	bitplane_masks_count
        move.b	(@countp)+,@bitplane_masks_count
        lsl.w	#8,@bitplane_masks_count
        move.b	(@countp)+,@bitplane_masks_count

        move.l	@countp,SP_COUNT_PTR(@statep)
        @kill	countp

        IFNE	SP_PRINTF_DEBUG
        Printf	"Num bitplane masks: %lx",@bitplane_masks_count
        ENDC

        ; Decompress bitplane masks

        tst.w	@bitplane_masks_count
        beq.s	.noBitplaneMasks

        @reserve d0,a0-a1
        lea	SP_3BIT_DECODERSTATE(@statep),a0
        lea	SP_BITPLANEMASKS(@statep),a1
        move.w	@bitplane_masks_count,d0
        bsr	DecodeBitStream_3Bits_Words_Decode
        @unreserve d0,a0-a1
    .noBitplaneMasks

        @kill	bitplane_masks_count

        IFGT	SP_PROFILE_LEVEL-0
          TBL_END_SCOPE
        ENDC

        @endproc

And here is what comes out from deluxe68. Note the use of tbl_line directives to point back to the original source file, as well as the automatic register preserving that has been inserted:

        XDEF	ScenePlayerDecompressBitplaneMasks
        ; 		@proc	ScenePlayerDecompressBitplaneMasks (a6:statep)
        ; live reg a6 => statep
    ScenePlayerDecompressBitplaneMasks:
        movem.l d0/d7/a0/a5,-(sp)
        tbl_line 396 /Users/dep/repo/newage_a500/Engine/A500/ScenePlayer/ScenePlayer.d68

        IFGT	SP_PROFILE_LEVEL-0
          TBL_BEGIN_SCOPE "ScenePlayerDecompressBitplaneMasks"
        ENDC

        ; Fetch count of bitplane masks

        ; 		@areg	countp
        ; live reg a5 => countp
        tbl_line 404 /Users/dep/repo/newage_a500/Engine/A500/ScenePlayer/ScenePlayer.d68
        move.l	SP_COUNT_PTR(a6),a5

        ; 		@dreg	bitplane_masks_count
        ; live reg d7 => bitplane_masks_count
        tbl_line 407 /Users/dep/repo/newage_a500/Engine/A500/ScenePlayer/ScenePlayer.d68
        move.b	(a5)+,d7
        lsl.w	#8,d7
        move.b	(a5)+,d7

        move.l	a5,SP_COUNT_PTR(a6)
        ; 		@kill	countp

        IFNE	SP_PRINTF_DEBUG
        Printf	"Num bitplane masks: %lx",d7
        ENDC

        ; Decompress bitplane masks

        tst.w	d7
        beq.s	.noBitplaneMasks

        ; 		@reserve d0,a0-a1
        lea	SP_3BIT_DECODERSTATE(a6),a0
        lea	SP_BITPLANEMASKS(a6),a1
        move.w	d7,d0
        bsr	DecodeBitStream_3Bits_Words_Decode
        ; 		@unreserve d0,a0-a1
    .noBitplaneMasks

        ; 		@kill	bitplane_masks_count

        IFGT	SP_PROFILE_LEVEL-0
          TBL_END_SCOPE
        ENDC

        ; 		@endproc
        movem.l (sp)+,d0/d7/a0/a5
        rts
        tbl_line 438 /Users/dep/repo/newage_a500/Engine/A500/ScenePlayer/ScenePlayer.d68

tbllink

At the end of the project we were facing some problems that were hard to tackle. The individual parts are HUNK executables that our layout tool strips and lays out on the floppy for the final demo. We had been using vlink for this, and things were mostly working fine.

However, we needed one “killer feature” that we couldn’t really get with vlink. We wanted “uninitialized bss”, which is not zero-initialized by the loader. This means that it’s garbage at runtime, but that’s fine for things you’re going to clear and render to anyway! On a background processing loader like our streamer just clearing all these large BSS buffers can be a significant time waster.

We were also unhappy with the debug information we could get out of vlink–we wanted all symbols (including local ones) for tools that I’ll describe later.

So I sat down and wrote a custom linker to replace vlink so we could have these features. In the end it was a very quick job (just a few nights) and I was surprised how little code it was in the end.

A return to C

About 18 months before we shipped the demo we had amassed a lot of individual programs that made up the parts. While they shared code where possible, there was still a lot of boilerplate to write. It’s also unnecessary to write “artsy” parts in assembly. Most of the high-level code in each part just kicks off a few blits or calls out to highly optimized routines for 3D.

So we explored various options for C compilation. We discarded VBCC as we weren’t happy with the code generation in many cases, and we had to fix plenty of bugs in it when we shipped A1200 demos.

We ended up making a slightly custom build of bebbo’s GCC port. We simply changed the ABI to use more registers to make function calls quicker.

For example, here is the high-level init code for one of the parts:

static void Part_init() {
    CopperList_set_bitplanes(s_copperlist_template.bitplanes, &g_i2f_image, I2F_IMAGE_DEPTH, I2F_IMAGE_BPLSIZE);
    CopperList_set_palette(s_copperlist_template.colors, (u16*)&g_i2f_image[I2F_IMAGE_SIZE], I2F_IMAGE_COLORS);

    UAEMemProtect(g_morph_data, 9852, MEMPF_READ);

    // set upper colors to white for bitplane 5
    for (u16 i = 16; i < 32; ++i) {
        s_copperlist_template.colors[i].data = 0xfff;
    }

    _Static_assert((I2F_IMAGE_BPLSIZE * 3) < SHARED_SCREEN_SIZE, "ImageToFlatNew: Out of shared screen memory");

    u8* shared_buffer = DemoSys_get_shared_screen();
    s_render_buffers[0] = shared_buffer;
    s_render_buffers[1] = shared_buffer + I2F_IMAGE_BPLSIZE;
    s_fill_scratch = shared_buffer + (I2F_IMAGE_BPLSIZE * 3);

    DoubleCopper_create(
        (CopperListInstruction*)s_copper_list,
        (CopperListInstruction*)&s_copperlist_template,
        COPPER_SIZE,
        sizeof(s_copperlist_template));

    Sprites_disable();

    ImageToFlat_init(
        &s_morph_state,
        g_morph_data,
        SCREEN_WIDTH / 8,
        I2F_IMAGE_BPLSIZE,
        15, 7, 0);

    ImageToFlat_update(&s_morph_state);
    ImageToFlat_step_active(&s_morph_state);

    MulTable_generate(s_lines_mul_table, SCREEN_WIDTH / 8, I2F_IMAGE_HEIGHT);
}

This ended up working great. We had to write some glue inline assembly to expose certain m68k opcodes that GCC refuses to generate, but in the end it was a solid win. For example, here is how we generate a swap instruction where needed:

u32 inline m68k_swap(u32 value) {
    asm volatile ("swap %0" : "=d" (value) : "0" (value));
    return value;
}

We also wrote a preprocessor that enables us to include binary files in C, with full dependency tracking in the build system. This makes it super convenient to just drop in statements like these:

    INCBIN_CHIPMEM(u8, g_i2f_image, "Output/ImageToFlat/ImageToFlatPart1_Background_Indexed_16Colors.bin");
    INCBIN(u8, g_morph_data, "Output/ImageToFlat/image_to_flat.bin");

We generate an assembly file from these statements using a simple python script, and that assembly file is then compiled separately.

fs-uae

One of the best technical choices we made during the project was to build our own emulator from source, and make it a first-class citizen of the code build system.

This means we can make changes to the emulator as we see fit and it will automatically rebuild for us when we kick our demo off.

First we had to find a version of UAE that would actually work on all platforms. It turns out fs-uae is basically the only one that does, but it comes with significant build baggage on Windows, and doesn’t build with MSVC out of the box. So the first task was to port the emulator to MSVC code, including dependencies like glib and other stuff that was needed.

Once that was done we could move on to the features we were eyeing..

Warp

Whenever you boot something in the emulator to test it, it’s slow. You can manually hit F12+W to enable warp mode, but you have to remember to turn it off again, and that makes it hard to see what happens on the first frame of the effect you’re testing.

So the very first thing we did was add a calltrap to control the UAE warp mode from the Amiga side. We start the emulator in Warp mode, and have the startup code turn it off. This easily saved 10 seconds of iteration time whenever we kicked something off.

But that was just the start..

Emulator `printf`

The next thing I added was a calltrap to have the emulator format out a bunch of data from the Amiga side in a printf format and dump that to the console.

On the assembly side we have a Printf macro that pushes parameters to the stack (in our debug configuration) and then traps into the emulator with a simple jsr $ff0fxx. This is nice because it’s much faster than doing it on the Amiga side by calling something like exec’s RawDoFmt.

With this feature, it is super cheap to add diagnostics at bootup that only show up in the debugging config in the emulator, and doesn’t slow things down at all. We can print stuff 100s of times per frame without slowing things down noticably. Starting up the demo on our emulator, the log shows something like this:

res_initcode context = 0x0
[   0.00000] Running under UAE.
TBL setwarp: 0
[   0.02857] Disabling caches
[   0.02869] Standalone loader initializing
[   0.02881] Calling entrypoint with sp=<0021bee4>
[   0.02886] Loader: chip_base    = <00010ef8>
[   0.02888] Loader: fast_base    = <003d4e00>
[   0.02889] Loader: loader_size  = 0
[   0.02889] Loader: tab_dsk_len  = 0
[   0.02890] Loader: loader table = <00226058>
[   0.02891] Loader: floppy0 adj  = 4828
[   0.02893] Loader: stacks_end=<003d4e00>
[   0.02895] Loader: loader fast base = 003d4e00 - 00452e00, 516096 bytes
[   0.02896] Loader: loader chip base = 00010ef8 - 00097e58, 552800 bytes
[   0.02898] Loader: ldr_alloc_permanent: grab 100800 bytes for sharedscreen (chip); remain=452000
[   0.02899] Loader: SharedScreen Chip at 00010ef8 - 000298b8
[   0.02900] Loader: ldr_alloc_permanent: grab 4 bytes for sharedscreen_protect (chip); remain=451996
Memwatch breakpoints enabled
    0: 00DFF000 - 00DFF1FF (512) RWI NONE
    1: 00000000 - 001FFFFF (2097152) RWI NONE
TBL MemProtect: address 000298b8 size 00000004, range (000298b8 - 000298bc), bits 00000000
[   0.03786] Loader: ldr_alloc_permanent: grab 128 bytes for sharedstate (fast); remain=515968
[   0.03790] Loader: SharedState Fast at 003d4e00 - 003d4e80
[   0.03792] Loader: ldr_alloc_permanent: grab 206568 bytes for music (chip); remain=245428
[   0.03793] Loader: ldr_alloc_permanent: grab 78984 bytes for music (fast); remain=436984
[   0.03794] Loader: Music Chip at 000298bc - 0005bfa4
[   0.03795] Loader: Music Fast at 003d4e80 - 003e8308
...

I also repurposed an old printf implementation I had lying around and added custom format code to do things like print all the registers in the machine which came in handy in a number of cases.

Memory protection

Once we saw the possibilities of calltraps we started really brewing. What if we could leverage the memory breakpoint functionality of the UAE debugger and control that programatically from the Amiga side? That way we could protect sensitive data in the track loader from stray writes and tackle issues that would normally take hours to debug much more quickly.

Said and done, we did that too. This turned out to be a major time saver for memory smash bugs and stray reads and writes.

The assembly interface is trivial:

MEMPF_READ	= $0001
MEMPF_WRITE	= $0002
MEMPF_DMAREAD	= $0004
MEMPF_DMAWRITE	= $0008
MEMPF_ALL	= MEMPF_READ | MEMPF_WRITE | MEMPF_DMAREAD | MEMPF_DMAWRITE

    ; UAEMemProtect address,size,bits

    IFD	NEWAGE_DEBUG
UAEMemProtect	macro
    move.l	#\3,-(sp)
    move.l	#\2,-(sp)
    move.l	\1,-(sp)
    jsr	$f0ff54
    lea.l	12(sp),sp
    endm

    ELSE

UAEMemProtect	macro
    endm

    ENDC

As you can see from the bit definitions above, we separate memory accesses into four groups:

CPU reads
CPU writes
DMA reads
DMA writes

This is essential for things like our track loader, which uses floppy DMA. While the DMA is writing to the memory, we don’t want anyone reading it, and vice versa. We made short work of many bugs using this feature, and I don’t think we would have shipped a quality demo without it.

When a memory access faults the protection mask, we break into the UAE debugger on the faulting instruction. This makes it relatively straight forward to solve memory corruption issues:

TBL MemProtect Access Trap
    PC:          0x002213bc
    Address:     0x00000004
    Access:      CPU_D_R
    Size:        4 bytes
    R/W:         read
    Access Prot: 0x01, Memory Prot: 0x04
    -- stub -- activate_console
    D0 00000002   D1 00000005   D2 00013488   D3 20FF00C0
    D4 003D4E00   D5 000326E8   D6 003D4E80   D7 0021EBB4
    A0 00221313   A1 0021BE64   A2 002213DA   A3 0021BD24
    A4 00F0FF54   A5 00DFF088   A6 00200804   A7 0021BD24
USP  0021BEE0 ISP  0021BD24 SFC  00000000 DFC  00000000
CACR 00000000 VBR  00000000 CAAR 00000000 MSP  00000000
T=00 S=1 M=0 X=0 N=0 Z=0 V=0 C=0 IMASK=0 STP=0
002213C0 4eae fdf6                JSR (A6, -$020a) == $002005fa
Next PC: 002213c4
>

High resolution timers

Another easy calltrap interface we added was to get the number of CPU cycles since the machine powered on. This enables us to do simple scope-based profiling.

High-level timeline profiling

We also added a way to profile the loading, decompression and part timings on a high level using Chrome’s tracing view. Every run in our debug emulator will spit out a trace file compatible with this view.

Some notes to make sense of this:

The top row shows what part is active
The next row is frame timings (makes more sense zoomed in)
The third row is what part the loader is trying to pull in
The fourth row indicates what kind of activity the loader thread is doing (such as sample decompression, drum mixing, LZ4 decompression, checksumming, floppy DMA etc)
The next row indicates what pattern we’re at in the song
Finally the rest of the rows show what samples are in use, to debug issues with samples not coming in on time

Instruction tracing profiling

We also decided to capture each instruction as it was flowing through the emulator, as well as long symbol data when we switch parts in the demo. This allows us to capture cycle correct profiling data as well as details about blits and other things that are in flight on the machine.

itrace profiler

Automatic disk swapping

In debug mode, the emulator swaps disks for you :)

readline interface

This is a small point, but if you spend any significant amount of time with the UAE command line debugger, it’s a really good idea to give it a readline interface. This means you keep history between sessions, and you can use Ctrl+R to reverse-search through the history as well, just like in bash.

uaekick

We use a little script called uaekick to test our work in the emulator. It has grown over the years and now supports a plethora of options. A couple of example invocations follow.

Kick off a stock 512+512 A500 configuration, with two floppies in the swap list:

uaekick t2-output/amiga-mac-debug-default/disk?.adf

Kick off a single executable on the same stock A500 config:

uaekick t2-output/amiga-mac-debug-default/TexturedRubberCubeTest

Configure memory on the fly:

uaekick --fastmem=2048 --slowmem=0 path/to/program_or_floppy

Configure kickstart on the fly:

uaekick --kickstart=path/to/kickstart31.rom path/to/program_or_floppy

Switch to an A1200:

uaekick --machine=a1200 --kickstart=path/to/kickstart path/to/kickstart31.rom path/to/program_or_floppy

This is a huge time saver, because you never have to go back and double check that you configured everything correctly.

Asset build pipeline

We use a simple python tool to make sure assets are transformed into binary data we can include in parts. For example, here is the pipeline code for the alley part:

class AlleyScenePart3(PartData):
    def __init__(self):
    super(AlleyScenePart3, self).__init__('AlleyScenePart3', globals.source_data_root + '/AlleyScene', globals.destination_data_root + '/AlleyScenePart3')
    self.image_converter = ImageConverter()
    self.alembic_lines_convert = AlembicLinesConverter()

    def build_part(self):
    self.image_converter.convert_planar_bin_ocs(
        self.input_directory + '/AlleyBackgroundNoCar_8Colors_part2.png', self.output_directory, 8)
    self.image_converter.convert_amiga_sprite(
        globals.source_data_root + '/SwitchFloppy/diskswap2.png', self.output_directory, 16)
    self.alembic_lines_convert.convert(
        self.input_directory + '/NewAlleyScene_LinesAndCamera_Part2.abc',
        globals.source_data_root + '/SpaceAscend/SpaceAscend_Lightning_10.json',
        self.output_directory + '/lines.bin')

The data is then crunched as needed to native format for the Amiga side to simply incbin into code.

The Tech Behind Eon: Tooling!