Instruction boundaries
Posted: 2021-11-12, 10:00:07
X86 has no instruction boundaries; execution may be directed to any byte offset. This has a couple of cute imlications for codegolf (e.g. conditionally skip past a prefix), but no practical use. Meantime it opens up myriad possibilities for malicious actors to hide code in a manner which will not be picked up by traditional disassemblers or reverse-engineering tools, _without_ the need for runtime code generation (which may be disallowed by the platform). (A correct representation of x86 machine code is a directed graph of offsets, where there is a node at every offset, and those nodes which may be executed are coloured. Due to undecidability, most programs containing indirect branches must colour all nodes. No disassembler that I know of uses such a representation, though I have tentative plans to build one.)
Multi-byte encodings do not have to be this way. UTF-8, for instance, is self-synchronizing, so it has codepoint boundaries; and, given an arbitrary offset into a utf8 stream, it is possible to tell without additional context whether that offset corresponds to the beginning of a code point.
Might I suggest forwardcom move to a self-synchronizing encoding? E.G. code words have their high bit set and constant words do not. It is slightly less space-efficient, but only slightly. And the benefits to opacity of compiled code are imho worth it. And there may be benefits for the hardware as well, to not have to worry about potentially-overlapping instructions.
(To allow for full 32/64-bit immediates, the code word can contain the high bits of the continuation words; but most instructions should probably default to 31/62-bit immediates, which suffices for most purposes.)
Multi-byte encodings do not have to be this way. UTF-8, for instance, is self-synchronizing, so it has codepoint boundaries; and, given an arbitrary offset into a utf8 stream, it is possible to tell without additional context whether that offset corresponds to the beginning of a code point.
Might I suggest forwardcom move to a self-synchronizing encoding? E.G. code words have their high bit set and constant words do not. It is slightly less space-efficient, but only slightly. And the benefits to opacity of compiled code are imho worth it. And there may be benefits for the hardware as well, to not have to worry about potentially-overlapping instructions.
(To allow for full 32/64-bit immediates, the code word can contain the high bits of the continuation words; but most instructions should probably default to 31/62-bit immediates, which suffices for most purposes.)