Recent

Author Topic: [SOLVED] Executing x86_64 code from dynamic array of byte  (Read 7823 times)

Joanna

  • Hero Member
  • *****
  • Posts: 1419
Re: Executing x86_64 code from dynamic array of byte
« Reply #15 on: August 26, 2024, 11:25:02 pm »
I don’t know much about this type of programming but I noticed that the example code is for windows. Can this also be done on other platforms ?

440bx

  • Hero Member
  • *****
  • Posts: 6118
Re: Executing x86_64 code from dynamic array of byte
« Reply #16 on: August 27, 2024, 01:19:19 am »
<snip> ... that effect can be achieved in C using a union, similar to a Pascal variant record.  Or, you could always just declare a pointer of one type and point it to the memory of another type to get the same effect as absolute.
True but, it's not nearly as convenient.



I don’t know much about this type of programming but I noticed that the example code is for windows. Can this also be done on other platforms ?
It must be possible otherwise their there would be no way to create an executable file.  All executable files begin their "life" as data.

ETA:

corrected grammatical error.
« Last Edit: August 27, 2024, 03:26:14 am by 440bx »
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Joanna

  • Hero Member
  • *****
  • Posts: 1419
Re: Executing x86_64 code from dynamic array of byte
« Reply #17 on: August 27, 2024, 02:36:31 am »
True ..
I wonder if it’s easier with Linux..

cdbc

  • Hero Member
  • *****
  • Posts: 2664
    • http://www.cdbc.dk
Re: Executing x86_64 code from dynamic array of byte
« Reply #18 on: August 27, 2024, 03:16:46 am »
Hi Joanna
Quote
I wonder if it’s easier with Linux..
No need to wonder, it _is_ quite an advanced topic, this business of setting up code / thunks on the fly, to be executed... I remember, back in my Delphi-days, that Hallvard Vassbotn used to fiddle with that stuff and write about it in "The Delphi Magazine" ...Good Times  8)
At that level, there ain't much 'easy' floating around...
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

Joanna

  • Hero Member
  • *****
  • Posts: 1419
Re: Executing x86_64 code from dynamic array of byte
« Reply #19 on: August 27, 2024, 02:22:07 pm »
I’ve never heard of thunks before  %)

cdbc

  • Hero Member
  • *****
  • Posts: 2664
    • http://www.cdbc.dk
Re: Executing x86_64 code from dynamic array of byte
« Reply #20 on: August 27, 2024, 02:28:12 pm »
Hi Joanna
No wonder, I think the only one around here, who can explain 'Thunks' in detail would be @PascalDragon a.k.a. Sven/Sara  ;)
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

wizzwizz4

  • New Member
  • *
  • Posts: 20
Re: Executing x86_64 code from dynamic array of byte
« Reply #21 on: August 27, 2024, 09:17:27 pm »
Look at Rtti.AllocateMemory and Rtti.ProtectMemory. These two functions are used inside the Rtti unit to allocate a block of memory where first code is copied to (in case of the Rtti unit a thunk for interface methods), then adjusted (to match the dynamically generated interface instance) and then the protection is adjusted so that it's executable instead of writable.

Am I right in thinking that this dedicates an entire page to each thunk? If so: isn't that really wasteful of memory? I guess it's okay for a reflection API (which won't really be called in real code, only in developer tools), and I can't think of a thread-safe alternative that handles W^X correctly.

I wonder if it’s easier with Linux..

Yes and no: the POSIX mmap and mprotect are easier than the Windows equivalents (mostly because they're documented more clearly), but Linux doesn't have an equivalent of FlushInstructionCache. (See If FlushInstructionCache doesn’t do anything, why do you have to call it, revisited by Raymond Chen for more information.) On x86-64, I believe we don't need to flush the instruction cache, but I'm still trying to work that out.

If you're using Free Pascal, it's no easier or harder: I've found PascalDragon's reference (lines 1315-1356 of /packages/rtl-objpas/src/inc/rtti.pp – no docs, sadly) and it looks cross-platform, so you could just use that.



Thunks aren't really that hard, conceptually. Suppose you need to register an event handler. We can do that by providing two pointers: one to a procedure, and another to some data. When we call the event handler, we pass the data as an argument to the procedure. The data pointer can point to an arbitrary object, so this interface is powerful enough to implement arbitrary behaviour in our event handler.

Why do we need two pointers, though? A procedure's code can contain data, too! A thunk is where we include the data in the procedure, so we only need to pass around one pointer. (Specialisation is another angle on the same general idea: describing that general idea is left as an exercise for the reader.)

In some cases, we will only know the value at runtime; but we can still make a thunk! Just use a placeholder value at compile-time, and keep track of where in the machine code that value is stored. Then, when we know the value (at runtime), we can make a copy of the code, and replace the placeholder with the correct value. When this new (dynamically-created) thunk is run, it will use that value. (Some operating systems forbid this, of course, but most only require us to jump through a few hoops. This is only impossible on pure "Harvard-architecture" machines, which basically don't exist these days and arguably never did.)

The details of any particular implementation are a bit tricky, but that's only because the idea of a "thunk" is vague. There are lots of ways a thunk could be implemented, and a few reasons it might be implemented. (The Wikipedia article isn't tremendously useful.)

Joanna

  • Hero Member
  • *****
  • Posts: 1419
Re: Executing x86_64 code from dynamic array of byte
« Reply #22 on: August 28, 2024, 08:18:04 am »
Quote
Suppose you need to register an event handler. We can do that by providing two pointers: one to a procedure, and another to some data. When we call the event handler, we pass the data as an argument to the procedure. 
I have no idea about the inner workings of event handlers and I’m surprised that two pointers are involved. From my perspective I just set the onchange property of an object to the address of the change procedure and it magically goes there everytime the object has change event. I don’t know what is involved with creating what is needed to do that...

BrunoK

  • Hero Member
  • *****
  • Posts: 766
  • Retired programmer
Re: Executing x86_64 code from dynamic array of byte
« Reply #23 on: August 28, 2024, 10:41:09 am »
One of the interesting thing that un-protecting code / writing a patch at the start of a procedure then re-protecting code segment allows to patch limited number of instructions.

Probably, the debugger patches the code where there are break points with INT 3.

I have used it to patch TObject.Newinstance with a jmp in a modified version of heaptrc that replaced TObject.Newinstance with  code useful for reporting ClassName of a leaked instance(s).

wizzwizz4

  • New Member
  • *
  • Posts: 20
Re: Executing x86_64 code from dynamic array of byte
« Reply #24 on: August 28, 2024, 04:29:41 pm »
I have no idea about the inner workings of event handlers and I’m surprised that two pointers are involved.

You're talking about TFieldNotifyEvent, I assume? The role of the "second pointer" is fulfilled by the Sender argument. That's a bit different to what I described, because of the API design: it's expected that you can subclass the TField, or look things up externally, if you need extra (field-specific) data in your event handler. But still, a pointer (well, class instance: similar thing) is passed to the event handler at runtime, so you can re-use one event-handling procedure for multiple fields. You rarely see an event handler without that, except in a language with callable objects / closures instead of function pointers – but that's usually implemented with a dispatch table (e.g. CPython, some implementations of C++), two pointers but secretly (e.g. Rust, ECL), or something exotic (e.g. Tcl – which has no standard way of doing this, but usually it's something analogous to a thunk).
« Last Edit: August 28, 2024, 04:32:16 pm by wizzwizz4 »

wizzwizz4

  • New Member
  • *
  • Posts: 20
Re: Executing x86_64 code from dynamic array of byte
« Reply #25 on: August 28, 2024, 08:13:05 pm »
Consider the following snippet. Without going too much into Tcl implementation details (which I assume are off-topic), [namespace current]::regexp is like a pointer, and namespace code is like constructing a thunk.

Code: TCL  [Select][+][-]
  1. set reggie ^.*$
  2. entry .regbox -textvariable [namespace current]::reggie
  3. entry .valbox
  4. pack .regbox .valbox -fill x
  5. .valbox configure -validate key -validatecommand [namespace code {regexp -- $reggie %P}]

You might want a thunk in Free Pascal if you wanted to port something like this and, for some reason, you had to do everything in a handler. In most cases, you should make a special-purpose control like TMaskEdit: the "only use event handlers" restriction should never come up in real life. (I don't see a direct analogue in Lazarus TEdit for Tcl/Tk entry's all validation mode, which correctly handles pasting etc, so you'd probably want a custom control anyway.) If that restriction does come up, you should at least be able to use a control that provides a two-pointer event handler interface. I wouldn't want to work on a codebase that tried to brute-force such things by generating machine code at runtime.
« Last Edit: August 28, 2024, 08:31:36 pm by wizzwizz4 »

Khrys

  • Sr. Member
  • ****
  • Posts: 400
Re: Executing x86_64 code from dynamic array of byte
« Reply #26 on: August 29, 2024, 09:27:20 am »
EDIT: This is absolutely not production-ready code. Do not use it as-is. Read the rest of this thread first!

Ok, so since I just now needed to implement a new reference counted type, I tried applying the knowledge I gained to this problem; be aware that this still requires testing, though.
Features:
  • Atomically & automatically reference counted by the compiler - exact same semantics as dynamic arrays
  • Management structure and code are separate, so memory protections only apply to the code and no surrounding data
  • Fulfillment of W^X - the code buffer is never writable and executable at the same time
  • Runs on both Windows and Linux (x86_64)
Here's a quick sample program for demonstration.
Basically, the  Size  property is the equivalent to dynamic arrays'  SetLength. When writing to the buffer, use the  Data  property. When running, use the  Code  property. These two internally change the protection of the backing memory, which is stored alongside the reference count.

Code: Pascal  [Select][+][-]
  1. program ThunkTest;
  2.  
  3. {$macro on}
  4. {$mode objfpc}{$H+}
  5.  
  6. uses
  7.   SysUtils, Thunk;
  8.  
  9. var
  10.   GLOBAL_THUNK: TThunk;
  11.  
  12. function CreateThunk(const Source: String): TThunk;
  13. begin
  14.   Result := Default(TThunk);
  15.   Result.Size := Source.Length;
  16.   Move(Source[1], Result.Data^, Result.Size);
  17.   GLOBAL_THUNK := Result;
  18. end;
  19.  
  20. function Main(): Integer;
  21. type
  22.   TFunc_Int_Int = function(X: Integer): Integer; {$if defined(CPU64)} MS_ABI_Default; {$else} register; {$endif}
  23. const
  24.   CODE_N_TIMES_N_PLUS_1 = {$if defined(CPU64)}
  25.                             #$89#$C8 +     // mov eax, ecx
  26.                           {$else}
  27.                             #$66#$90 +     // xchg ax, ax (2-byte nop)
  28.                           {$endif}
  29.                           #$8D#$50#$01 +   // lea edx, [eax + 1]
  30.                           #$0F#$AF#$C2 +   // imul eax, edx
  31.                           #$C3;            // ret
  32. var
  33.   Thunk: TThunk;
  34. begin
  35.   // Create thunk function f(n) = n * (n + 1)
  36.   Thunk := CreateThunk(CODE_N_TIMES_N_PLUS_1);
  37.   Result := TFunc_Int_Int(Thunk.Code)(4);
  38.   WriteLn('-> f(4) = ', Result);
  39.   // Patch formula to g(n) = n * (n - 1)
  40.   PByte(Thunk.Data)[4] := $FF;
  41.   Result := TFunc_Int_Int(Thunk.Code)(4);
  42.   WriteLn('-> g(4) = ', Result);
  43.   // Like with dynamic arrays, changing the size/length ensures refcount = 1
  44.   Thunk.Size := 1;
  45.   PByte(Thunk.Data)[0] := $C3; // "ret"; empty function
  46.   TProcedure(Thunk.Code)();
  47.   // The patched thunk still exists because CreateThunk added another reference
  48.   Result := TFunc_Int_Int(GLOBAL_THUNK.Code)(5);
  49.   WriteLn('-> g(5) = ', Result);
  50. end;
  51.  
  52. begin
  53.   Main();
  54. end.

With debug logging enabled (TTHUNK_DEBUG), this produces the following output:

Code: C  [Select][+][-]
  1. AllocateInstance() = 0x0000739AFFE6F100
  2. IncreaseReferenceCount(0x0000739AFFE6F100) = 1
  3. CodeRealloc(Buffer = 0x0000000000000000, Size = 9, PrevSize = 0) = 0x0000739AFFE56000
  4. IncreaseReferenceCount(0x0000739AFFE6F100) = 2
  5. CodeProtect(Buffer = 0x0000739AFFE56000, Size = 9, Executable = -1)
  6. -> f(4) = 20
  7. CodeProtect(Buffer = 0x0000739AFFE56000, Size = 9, Executable = 0)
  8. CodeProtect(Buffer = 0x0000739AFFE56000, Size = 9, Executable = -1)
  9. -> g(4) = 12
  10. AllocateInstance() = 0x0000739AFFE6F200
  11. IncreaseReferenceCount(0x0000739AFFE6F200) = 1
  12. CodeRealloc(Buffer = 0x0000000000000000, Size = 1, PrevSize = 0) = 0x0000739AFFE55000
  13. CodeProtect(Buffer = 0x0000739AFFE56000, Size = 9, Executable = 0)
  14. DecreaseReferenceCount(0x0000739AFFE6F100) = 1
  15. CodeProtect(Buffer = 0x0000739AFFE55000, Size = 1, Executable = -1)
  16. CodeProtect(Buffer = 0x0000739AFFE56000, Size = 9, Executable = -1)
  17. -> g(5) = 20
  18. DecreaseReferenceCount(0x0000739AFFE6F200) = 0
  19. CodeRealloc(Buffer = 0x0000739AFFE55000, Size = 0, PrevSize = 1) = 0x0000000000000000
  20. FreeInstance(0x0000739AFFE6F200)
  21. DecreaseReferenceCount(0x0000739AFFE6F100) = 0
  22. CodeRealloc(Buffer = 0x0000739AFFE56000, Size = 0, PrevSize = 9) = 0x0000000000000000
  23. FreeInstance(0x0000739AFFE6F100)
« Last Edit: September 02, 2024, 08:58:18 am by Khrys »

wizzwizz4

  • New Member
  • *
  • Posts: 20
Re: Executing x86_64 code from dynamic array of byte
« Reply #27 on: August 30, 2024, 11:22:47 am »
Thanks, this looks really good! (I haven't tested it.) I didn't know about the AddRef operator. Some nitpicks:

Runs on both Windows and Linux (x86_64)
It doesn't, because InterlockedIncrement (etc) is part of the Windows API. Since this code is only running on x86-64, which has a fairly strong memory model (Intel® 64 and IA-32 Software Developer's Manual, volume 3 chapter 9) we should be able to use the assembly instructions MOV for reading, LOCK INC for incrementing, and LOCK DEC for decrementing. Don't quote me on this, though: I'm not an x86-64 assembly programmer, and certainly should not be trusted as an authority for mission-critical code.

If you do this, you must check the flags in TThunk.DecreaseReferenceCount (perhaps indirectly – see Why did InterlockedIncrement/Decrement only return the sign of the result? for the NT 3.51 / 95 design). Do not read the memory a second time to check it. Why? Assume there are two references, and consider the following sequence:
  • Thread A decreases reference count (2 to 1)
  • Thread B decreases reference count (1 to 0)
  • Thread B detects ReferenceCount = 0, frees the resources.
  • Thread A detects ReferenceCount = 0: double free!
This is not a bug in your code, only one that might be introduced by the careless programmer. If you do a straightforward InterlockedIncrement shim (see How does InterlockedIncrement work internally?), I believe your code should work fine.

Your debug code performs non-atomic reads, which is UB in some high-level languages and might be a bug on processors with a weaker memory model (but I think is fine on x86-64). For efficiency on some other architectures, you should use InterlockedIncrementAcquire and InterlockedDecrementRelease (does not affect x86-64).

Your code assumes that mmap and mprotect are implemented as numbered syscalls. This is a valid assumption for Linux, but most UNIX-like OSs don't have a stable syscall ABI, if they even implement these as discrete syscalls. Seeing as you're only using POSIX functions, there's no need to restrict yourself to Linux! Adding a libc fallback mode for {$ELSEIF UNIX} might be nice. Alternatively, you should make it explicitly fail to compile on unsupported systems.

Fulfillment of W^X - the code buffer is never writable and executable at the same time
Your implementation is unsafe: we can read .Code, access .Data, then try to use .Code and get fireworks. Rather than storing Executable: Boolean, could you store a separate "executable reference count", and then forbid .Data accesses while there's an outstanding .Code access? (You could implement that by having .Code return an advanced record wrapper around a PThunkInstance (or, perhaps better, a ^SizeInt Pointer pair), though I'm not sure whether this would end up expensive. It's free in Rust, but rustc does heavier optimisations than fpc.)
« Last Edit: August 30, 2024, 02:17:45 pm by wizzwizz4 »

cdbc

  • Hero Member
  • *****
  • Posts: 2664
    • http://www.cdbc.dk
Re: Executing x86_64 code from dynamic array of byte
« Reply #28 on: August 30, 2024, 01:48:19 pm »
Hi
This:
Code: Pascal  [Select][+][-]
  1. function InterLockedIncrement (var Target: longint) : longint; assembler; nostackframe;
  2. asm
  3. {$ifdef win64}
  4.         movq    %rcx,%rax
  5. {$else win64}
  6.         movq    %rdi,%rax
  7. {$endif win64}
  8.         movl    $1,%edx
  9.         xchgq   %rdx,%rax
  10.         lock
  11.         xaddl   %eax, (%rdx)
  12.         incl    %eax
  13. end;
  14.  
is NOT just winapi, it's defined in system and implemented in per cpu-include-files.
I use it all the time on linux.
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12695
  • FPC developer.
Re: Executing x86_64 code from dynamic array of byte
« Reply #29 on: August 30, 2024, 02:00:59 pm »
It doesn't, because InterlockedIncrement (etc) is part of the Windows API. Since this code is only running on x86-64,

FPC implements this in system https://www.freepascal.org/daily/doc/rtl/system/interlockedincrement.html 

Quote
should be able to use the assembly instructions MOV for reading, LOCK INC for incrementing, and LOCK DEC for decrementing. Don't quote me on this, though: I'm not an x86-64 assembly programmer, and certainly should not be trusted as an authority for mission-critical code.

That is what the fallback of said RTL function does if there are no OS locking functions that are already NUMA compliant. In some cases additional barriers might be necessary (as e.g. when executing since that hits instruction rather than data cache)

mmap and mprotect have fp* equivalents in baseunix, and they use syscalls or libc depending on the platform. I suggest to use those as much as possible.

 

TinyPortal © 2005-2018