[Dailydave] Code analysis and scale

Halvar Flake HalVar at gmx.de
Sun Sep 6 12:02:21 EDT 2015


Hey all,
 
while I really should not be posting here while I am on my kinda-sabbatical, the ocean
is entirely flat today and I don't feel like doing real work - so posting to DD is a 
nice middle ground.
 
There was a period in my life where at each and every conference I attended, some
bright and very motivated youngster would come up to me and excitedly tell me about
this new reverse engineering framework he was building - usually in Python or Ruby - where
everything was an object, and it would all be so great once development got a bit further.
 
Over the years, I must have heard about 10+ such frameworks, and each time the
authors eventually ran into the same problem: Binaries are larger than people think,
and your RAM is more limited than you think.
 
A larger real-world application will, once all dependencies are loaded and mapped
into it's address space, easily exceed 100 megs of executable code. With x86_64
instructions averaging a bit above 4 bytes, we are quickly talking about 25m+ instructions.
 
If, for some bizarre reason, you are confined to a 32-bit process, you have 3GB of
address space to distribute among 25m+ instructions, which means that in the best
case you can afford to spend 128 bytes per instruction - not counting heap overhead.
 
On my machine, an empty Python dictionary takes 280 bytes, an empty string 37.
 
In a more realistic scenario, you have 32 GB of RAM in your machine, which gives you
a bit more than 1k of memory per instruction. That should be plenty, no?
 
Not so much - if you want to perform any sophisticated analysis on code, you will want
to have some approximation of the program state associated with program points, and
the number of program points where a reasonable approximation of this can be done
in 1k or less is not going to be large.

Where am I going with all this rambling?
 
While machine code is not "big data" in the modern, search-enginey-social-networky-sense,
real-world-programs are "not small data" - as soon as you wish to associate extra
information with parts of the program, you will quickly exceed the ability to keep it all in
memory on a single machine - provided you analyse something "real" instead of 
notepad.
 
It is interesting that there are no distributed static analysis frameworks yet - and how easy
it is to conveniently forget about scale issues when "architecting" (e.g. dreaming about)
the reverse engineering framework one would like to have.
 
Cheers,
Halvar
PS: It is possible that the successes of fuzzing are due mainly due to the
fact that it happens to be embarrassingly parallel.


More information about the Dailydave mailing list