Thursday, December 17, 2009

512K! Time for Diet Code!

Quand j'ai lu l'article de Cearn sur la taille des binaires C++ pour NDS, je dois admettre que j'avais un léger sourire en coin. C'est bien connu: C++ a besoin d'un "run-time", cet ensemble de code quasiment inévitable qui permet au langage de tourner parce qu'il fournit, p.ex. la gestion des exceptions, etc. Par contre, quand j'ai vu hier lors de la dernière mise à jour de runme que l'exécutable avait gonflé jusqu'à 512K, j'ai cessé de sourire.

Sur la balance du code DS, pour moi, 512K, c'est "Va courir!"

J'ai ressorti la page de Cearn (l'auteur de TONC, ma première référence en programmation GBA) et tenté d'améliorer sa technique pour savoir d'où cette augmentation subite provenait. Bon, déjà, je compilais toujours sans aucune optimisation (-O0, pour faciliter le debugging de ces dernières semaines). On redescend vers les 420K. Ouf. Je continue, passant de la sortie de nm au désassembleur pour trouver qui appelle les fonctions que je n'ai pas écrites moi-même, genre d_print_comp ou _dtoa_r ... A force de creuser, j'ai fini par identifier deux fonctions à la base d'une bonne part de l'overhead. Tirant avantage du mécanisme de liaison des programmes aux bibliothèques, j'ai remplacé ces fonctions par du "code creux", court-circuitant ainsi un bon 40K de run-time support:

 /** these are heavy guys from the lib i want to strip out. **/
extern "C" char* _dtoa_r(_reent*, double, int, int, int*, int*, char**) {
die(__FILE__, __LINE__);
}

extern "C" char* __cxa_demangle(const char* mangled_name,
char* output_buffer, size_t* length,
int* status) {
if (status) *status = -2;
return 0;
}


Et très sympathiquement, Cearn lui-même s'est montré plutôt intéressé par mes résultats.

Nice! How did you find these? That is to say, how did you find out these are the ones at the top of it and can safely be removed?

There are also two others that I found out about recently: __cxa_guard_acquire() and __cxa_guard_release(). They were introduced when I tried to create a local const
array using data from a global constant instance that used templates. I'm assuming these call __cxa_demangle() at some point, but I can't be sure.

Voici donc ma réponse, "comment je m'y suis pris":

$sylvain> nm -S --demangle runme/arm9/runme.arm9.elf | sort -k 2 | grep '^[0-9a-f]\+ [0-9a-f]\+ [^B] ' --color=always | less -R

is how I find the "heavy hitters". I know that __cxa_* is run-time support. For _dtoa_r, I was suspicious because the disassembled code featured many calls to builtin_(insert arithmetic function name here)* things, while i'm not doing any FPU things internally. I tried replacing all the *printf with *iprintf and *scanf with *scanf and then I realised that at some point, both function actually called the same internal function that contains the full logic for floating points as well (maybe it's due to vsniprintf and that iprintf would be just fine).

I then got clued from the library content that "dtoa" is likely "double-to-ascii", and decided to replace it with a "runtime error report" function. So far, it didn't affected the functionality of the code.

The story for __cxa_demangle was more complicated. Initially, the function i was suspicious about was d_print_comp. Again, i tried disassembling and tracing back "who calls that", but it turned out that noone actually called it (that is, it is a virtual function of some sort, called only through a pointer and the content is statically defined). then i scanned lib*.a for a hit on d_print_comp (DS/dka-r21/arm-eabi/lib/libsupc++.a, if you ask), who revealed the symbol was present from cp-demangle.o, where d_print_comp is a "static" (internal) symbol, and __cxa_demangle is the only "external" symbol. I further googled for information on __cxa_demangle, and found http://idlebox.net/2008/0901-stacktrace-demangled/cxa_demangle.htt, where I found the error codes, full function prototype, etc. I gave "status=-2" a try with a dummy exception, and all of sudden, it reports 16iScriptException to be caught, while the code shrunk by ~100K. Bingo.

Similarly, i identified __cxa_terminate() which i succesfully replaced using std::set_terminate(my_terminator) who is responsible from handling uncaught exceptions. I don't feel like just abort()ing a DS program, so now when I got that, I fall back to a "press A to return to moonshell, B to download software upgrade" menu.

Yet, I'm not hot about killing __cxa_guard_acquire() and __cxa_guard_release(). They are required to ensure you got a lock (guard) on the static initialisation. They're defined in guard.o, and from what I see of nm output on libsupc++.a, they rely on throw, unwind, class-type-info, etc., but not __cxa_demangle directly. Even if they would, i did not _remove_ __cxa_demangle here, just replaced it with a "oh, sorry. I cannot demangle that for you. How about showing it raw to the user ?"


No comments: