The directhex! In an Adventure with MIPS

We’ve been trying to build Mono on MIPS in Debian for a long time. Just under a decade, in fact. Mono 0.29.99.20040114-3 was the first attempt, back when the Mono source was 10 meg, not today’s ~80.

It never worked though. Not once. At the end of 2004, 10 upstream versions later, we gave up, and turned off mipsel as a target architecture in the package.

When I became a Debian Developer in 2009, one of the first things I did was try again with Mono on various previously unsupported architectures, by running a build on the Debian porterboxes. Supposedly MIPS support was there, as evidenced by all the MIPS-related files in the source tree. But I’d try it, and it’d fail, and upstream would say “works for me”. I repeat the process every major release or so, to see what’s changed.

The breakthrough

Spurred by the removal of two bitrotted architectures (IA64 and SPARC), I decided to try again, this time investigating more deeply into the reason for the failures, with the help of upstream developer Alex Rønne Petersen. A couple of hours of IRC-driven gdb and test program disassembly later, a seemingly innocent comment flagged something in my brain:

09-07-2013 15:10:11 < alexrp!~zor@baldr.rfw.name: TIL that mul and mult are not the same thing on mips

Why is this notable? Well, MIPS processors lack a whole bunch of instructions which are commonly used in assembler. MUL is one of them – it’s valid MIPS assembler, which is expanded to MULT/MFLO when compiling. Call it a macro, or a mnemonic, or shorthand – the preferred term is “pseudoinstructions”. So what’s the issue?

09-07-2013 15:18:27 > directhex: mono isn't trying to use a mul instruction, right? i mean, that instruction doesn't exist as far as the cpu is concerned, it's a macro the compiler does things with

See where this is headed?

09-07-2013 15:23:59 > directhex: mini/mini-mips.c:#define USE_MUL 1 /* use mul instead of mult/mflo for multiply */

Argh.

See, the thing about MIPS pseudoinstructions is they may be real instructions on a given CPU implementation. Strictly speaking MUL isn’t a standard instruction, but a given CPU might have it anyway, to make multiplication a little faster (by using only one instruction, not two, for multiplication). In this case, the Debian MIPS infrastructure is based around ICT Loongson-2E processors which don’t have that extension – but the upstream Mono developers were building and testing on an extended CPU, never seeing the issue themselves.

Flipping that define to 0 (and amending the instruction length setting in another file) fixed the build. Mono was running on MIPS for me for the first time ever.

Digging through the history in git showed just how annoying this implementation quirk was. USE_MUL was added in late 2008 – replacing a previously used “#if 1″. The mult/mflo version of the code existed in the Mono source since the first time the full MIPS port was committed in 2006, but we never saw it.

The breakage

So, with that patched to work, I added mipsel to the Experimental build of Mono… which still failed. The runtime would build fine, but the class library build would fail at random times, with random meaningless stack traces. Unrepeatable. Some kind of race condition. The build would eventually succeed if I hammered “make” a few times, but that’s no good for the Debian build daemons. Back to square one…

… except I had an epiphany yesterday. I have heard more than once that Loongson processors are missing a few instructions. What if one of those was being hit, intermittently? I started doing a search in places that might need to work around that kind of issue, and found this. A patch to binutils in 2009, replacing one no-op instruction with another, when /usr/bin/as is fed the -mfix-loongson2f-nop flag.

Turns out NOP is another pseudoinstruction on MIPS. Well, more of an alias. The opcode 0x000000 is “Shift Left Logical” with 0 registers and 0 data, which is a no-op. But on all but the latest generation of Loongson-2F chips, that opcode can, under heavy load, fail – causing inconsistent state in the CPU registers. The flag to “as” replaces “sll 0,0,0″ with “or $at,$at,0″, which is also a no-op instruction, but doesn’t trigger the failure on Loongson-2F chips (and 2E chips, although that’s not stated in the documentation).

As long as ALL your programs get fed through “as”, you don’t have a problem, since it uses the replacement opcode… but what if you use a JITter to generate your own opcodes? Oh fuck, it couldn’t be…

diff --git a/mono/arch/mips/mips-codegen.h b/mono/arch/mips/mips-codegen.h
index dc4df7d..1dbd1c6 100644
--- a/mono/arch/mips/mips-codegen.h
+++ b/mono/arch/mips/mips-codegen.h
@@ -334,7 +334,7 @@ enum {
 /* misc and coprocessor ops */
 #define mips_move(c,dest,src) mips_addu(c,dest,src,mips_zero)
 #define mips_dmove(c,dest,src) mips_daddu(c,dest,src,mips_zero)
-#define mips_nop(c) mips_sll(c,0,0,0)
+#define mips_nop(c) mips_or(c,mips_at,mips_at,0)
 #define mips_break(c,code) mips_emit32(c, ((code)<<6)|13)
 #define mips_mfhi(c,dest) mips_format_r(c,0,0,0,dest,0,16)
 #define mips_mflo(c,dest) mips_format_r(c,0,0,0,dest,0,18)

Oh yes it could! Mono was using sll 0,0,0 (the recommended no-op instruction from the MIPS instruction reference manual), causing failures in my tests, because Debian’s build and test infrastructure just happens to use defective silicon. And, again, upstream were unable to reproduce a problem because they use better silicon than we do.

So what now?

Well, last night I uploaded mono_3.2.3+dfsg-3, which includes the above patch to force the replacement no-op instruction. It test built fine on the porterbox, and it should (when the damn experimental buildd gets around to it), just work.

Finally.

After just under a decade, Mono packages will be available on MIPS in Debian.

And after all this time, all we had to change was 4 lines to work around 7 year old Chinese knock-off processors.

The edit

So, things are finally built.

It turns out that despite everything, the replacement NOP opcode is not enough.

If you re-read the post to the binutils list, pay close attention to:

In theory this is still not enough to fully eliminate possible hangs, but the possiblity is extremely low now and hard to be hit in real code.

It’s a filthy lie. It’s easy to hit the issue in real code: just do a from-source build of the whole Mono class library. With the replacement instruction it builds .NET 2.0, 3.0, 3.5, 4.0, and most of 4.5, before dying in the same way as before – an improvement on failing early in the 2.0 build, but not enough.

Thankfully, 2 out of the 5 Debian mipsel build servers are not Loongson 2 – they’re 11 year old Broadcom SWARM developer boards. Not fast – but also not broken. Luck smiled on me, and caused my build to go to one of these Broadcom machines. As a result…

(experimental_mipsel-dchroot)directhex@eder:~$ mono --version | head -1
Mono JIT compiler version 3.2.3 (Debian 3.2.3+dfsg-3)

It’s been a long time coming.

7 Responses to “The directhex! In an Adventure with MIPS”

  1. Now, that’s the kind of bug hunting that I love and I congratulate you for the tenacity in keeping with the task.

    It must have been super satisfying fixing these puzzling bugs.

    I have done some things in the past with PowerPC and I really felt proud, even if the whole world would not even care.

    Again, congrats. You deserve it!

    [reply]

  2. Wow man, thats amazing bug hunting story! Congratulations!

    [reply]

  3. I tried to crosscompile for MIPS (Fritzbox Router 7390) with latest mono 3.4.0 and “make” was successful, but a simple DateTime.Now.ToString() results in an error:

    Unhandled Exception:
    System.TypeInitializationException: An exception was thrown by the type
    initializer for System.DateTime —> System.ArgumentOutOfRangeException: Value
    734668917 is outside the valid range [0,734668917].
    Parameter name: ticks
    at System.DateTime.InvalidTickValue (Int64 ticks) [0x00000] in :0
    at System.DateTime..ctor (Int64 ticks) [0x00000] in :0
    at System.DateTime..cctor () [0x00000] in :0
    — End of inner exception stack trace —
    at ConsoleApplication2.Program.Main (System.String[] args) [0x00000] in
    :0
    [ERROR] FATAL UNHANDLED EXCEPTION: System.TypeInitializationException: An
    exception was thrown by the type initializer for System.DateTime —>
    System.ArgumentOutOfRangeException: Value 734668917 is outside the valid range
    [0,734668917].
    Parameter name: ticks
    at System.DateTime.InvalidTickValue (Int64 ticks) [0x00000] in :0
    at System.DateTime..ctor (Int64 ticks) [0x00000] in :0
    at System.DateTime..cctor () [0x00000] in :0
    — End of inner exception stack trace —
    at ConsoleApplication2.Program.Main (System.String[] args) [0x00000] in
    :0

    Any ideas what goes wrong here?

    [reply]

    directhex Reply:

    @Alex, Big or little endian MIPS? I haven’t had success with Mono on big-endian MIPS.

    [reply]

  4. It was big-endian. We could run a Hello World, but there are problems with basic datatypes like long etc…

    See german discussion: http://www.ip-phone-forum.de/showthread.php?t=268152&p=2002653
    or
    https://bugzilla.xamarin.com/show_bug.cgi?id=7981

    Do you think there is a chance to get it to wrok?

    [reply]

    directhex Reply:

    @Alex,

    A major problem is availability of test hardware for the Mono team. For ARM, there are a hundred dev boards for under $100, but what options are there for MIPS? A Loongson for $300, which is little-endian only?

    [reply]

  5. I understand, but maybe QEMU is an option which comes for free…
    e.g.:
    http://www.aurel32.net/info/debian_mips_qemu.php

    [reply]

Leave a Reply