X86 has no instruction boundaries; execution may be directed to any byte offset. This has a couple of cute imlications for codegolf (e.g. conditionally skip past a prefix), but no practical use. Meantime it opens up myriad possibilities for malicious actors to hide code in a manner which will not be picked up by traditional disassemblers or reverse-engineering tools, _without_ the need for runtime code generation (which may be disallowed by the platform). (A correct representation of x86 machine code is a directed graph of offsets, where there is a node at every offset, and those nodes which may be executed are coloured. Due to undecidability, most programs containing indirect branches must colour all nodes. No disassembler that I know of uses such a representation, though I have tentative plans to build one.)
Multi-byte encodings do not have to be this way. UTF-8, for instance, is self-synchronizing, so it has codepoint boundaries; and, given an arbitrary offset into a utf8 stream, it is possible to tell without additional context whether that offset corresponds to the beginning of a code point.
Might I suggest forwardcom move to a self-synchronizing encoding? E.G. code words have their high bit set and constant words do not. It is slightly less space-efficient, but only slightly. And the benefits to opacity of compiled code are imho worth it. And there may be benefits for the hardware as well, to not have to worry about potentially-overlapping instructions.
(To allow for full 32/64-bit immediates, the code word can contain the high bits of the continuation words; but most instructions should probably default to 31/62-bit immediates, which suffices for most purposes.)
Instruction boundaries
Moderator: agner
Re: Instruction boundaries
Thanks for the proposal.
This has been proposed before. It would be difficult to find space for the extra bits which might be better used for other purposes, and it will make linkers, loaders, and other tools more complicated if they have to split 32-bit and 64-bit constants into non-contiguous fields.
The ForwardCom disassembler has no problem identifying instruction boundaries, unlike x86 disassemblers.
I don't know any examples of malicious x86 code trying to obfuscate instruction boundaries to disassemblers, though I agree that it is possible. Do you have any examples? Virus scanners are not necessarily relying on disassembly.
This has been proposed before. It would be difficult to find space for the extra bits which might be better used for other purposes, and it will make linkers, loaders, and other tools more complicated if they have to split 32-bit and 64-bit constants into non-contiguous fields.
The ForwardCom disassembler has no problem identifying instruction boundaries, unlike x86 disassemblers.
I don't know any examples of malicious x86 code trying to obfuscate instruction boundaries to disassemblers, though I agree that it is possible. Do you have any examples? Virus scanners are not necessarily relying on disassembly.