Written by Yehuda Smirnov, Hoshea Yarden, Hai Vaknin and Noam Pomerantz

TL;DR

Most process injection techniques follow a familiar pattern:
allocate → write → execute.

In this research, we ask: what if we skip allocation and writing entirely?

By focusing on execution-only primitives, we found distinct approaches to inject code without allocating / writing memory:

Inject a DLL using only CreateRemoteThread.
Call arbitrary WinAPI functions with parameters using SetThreadContext.
Utilize NtCreateThread to remotely allocate, write and execute shellcode.
Expand the technique to APC functions such as QueueUserAPC.

Find the RedirectThread Github repo here

Introduction

Modern Endpoint Detection & Response (EDR) stacks typically watch for three signs of classic process‑injection:

Allocation of fresh memory ( VirtualAlloc[Ex] )
Modification of that memory ( WriteProcessMemory , VirtualProtect)
Execution to it ( CreateRemoteThread, APCs, etc.).

Our goal in this research was to test the lower bound: can we trigger only the execution primitive, skipping both allocation and write primitives, yet still land malicious code in the target?

The idea started with a rumor that huge memory pages are always allocated and mapped as read–write–execute (RWX), even not specifically requesting these.

Post Research Note: when we tested this, it wasn’t the case.

This raised a natural question:
Would security tools treat this differently from a typical RWX allocation?

If so, skipping the allocation part of the chain could bypass part of the usual detection logic. And if that’s possible, what about skipping the data injection step too? If the bytes we need already live inside the target, we might not actually have to write anything new at all.

This sparked an idea:

If we already have valid, addressable data inside the target process, could we take the classic DLL injection – the LoadLibrary method, and simply point it at existing data within the target process, then let Windows do the rest?

Why LoadLibrary?

LoadLibraryA/W automatically appends “.dll” to whatever string pointer it receives and then resolves the usual DLL search order (stackoverflow.com). With that behavior we can find an existing in‑process string (e.g. “0“) and drop a file named (for example) 0.dll somewhere earlier in the search path.

This would launch a remote thread whose start routine is LoadLibraryA, with its argument set to a character pointer (e.g., “0“), finally causing the DLL to be loaded into the target process.

But how are we going to find this existing in-process string?

Shared Memory 101

When Windows maps file‑backed sections (e.g., ntdll.dll) into multiple user‑mode processes, the sections are backed by the same physical memory; each process merely receives its own virtual‑address view.

Copy-On-Write protects shared memory regions until a process attempts to modify them. At that point, the kernel creates a private copy of the page to ensure the process does not alter memory that’s shared across all processes in the system.

Since Windows Vista, ASLR randomizes base addresses every startup. System DLLs are loaded at a consistent base address across all processes to optimize relocation performance.
As a result, an offset like ntdll + 0x4 should point to the same bytes in all processes.

Then we can locate a pointer to a character locally and reuse the address when calling LoadLibraryA in the target process.

Many convenient string literals live in ntdll.dll which is loaded by all processes. We chose to use “0” for our case.

Pointer‑Only LoadLibrary Injection

The Proof‑of‑Concept below demonstrates the minimal steps required to turn that idea into a working process‑injection primitive.

Locate an in‑process string
- Search ntdll.dll for a static ASCII string such as “0” (0x30 0x00). Because the DLL is mapped at the same offset in both the target process and our own process, the virtual address should be valid across both processes.

ntdll.dll being mapped into the same addresses between processes

Prepare the payload DLL
- Build 0.dll with implant of choice.
- Drop the DLL into a directory that resolves via search order.

We are not necessarily limited to classic search‑order hijacking. Perhaps it is possible to abuse DefineDosDeviceW or NT symbolic links to load the payload from arbitrary locations such as SMB shares or WebDAV mounts.

Create the remote thread

Note that the address of LoadLibraryA is obtained from our own process, similar to the method in step 1.

HANDLE hThread = CreateRemoteThread(
    hProcess,
    NULL,               // default security
    0,                  // default stack size
    (LPTHREAD_START_ROUTINE)GetProcAddress(GetModuleHandleA("kernel32"), "LoadLibraryA"),
    (LPVOID)0x7FFE0300, // pointer to shared NUL byte inside ntdll
    0,                  // run immediately
    NULL);

Execution
- LoadLibraryA appends “.dll”, resolves the path, and loads 0.dll into the target process:

We got an unexpected but exciting result:

By skipping memory allocation and writing, a well-known injection technique has bypassed detection in two industry leading EDRs.
- We also tested the regular DLL injection technique (which involves writing memory), and it was detected by both EDRs.
This small tweak revealed that many techniques rely on the same early-stage behavior and how easily that behavior can be avoided.

This led us to rethink the injection chain:

Most techniques follow the same pattern: inject data → trigger execution.
Across many of them, the execution trigger is the common piece.

So we asked:

Do we really need to inject data at all?
What if we just focused on the execution trigger?

Instead of swapping out each part of the chain, we doubled down on the trigger and pushed to see how far we could go with just that.

Investigating Remote-Thread Creation Noise with ETW

Before diving deeper, we took a quick detour prompted by a simple question:

Is this just a detection oversight?
Why don’t security products flag any remote thread creation as malicious? Isn’t it rare enough to catch with a simple whitelist?

Turns out — not at all. Remote thread creation is surprisingly common and is used by:

Application resource monitors
Performance profilers
Instrumentation and tracing tools
Debuggers, compatibility shims, accessibility helpers
Enterprise agents and endpoint frameworks

To measure the baseline “noise”, we wrote a small ETW-based tracer to capture and correlate thread creation events where the creator PID ≠ target PID. ETW gave us all the data we needed.

Looking through the data captured of less than a minute on a clean Windows install, we already find some thread creation across different processes:

At this point, we already felt it was worth digging deeper into the execution trigger, while avoiding allocation and modification.

If EDRs don’t flag malicious activity based on the execution step alone, even when using the infamous CreateRemoteThread, then the question becomes: how powerful is this primitive, and where can we take it next?

CreateRemoteThread + SetThreadContext Injection

Our first thought on leveraging CreateRemoteThread further was to avoid dropping a DLL to disk and attempt to implement a memory only injection with it. We aimed to achieve a similar power to the classic remote Allocate → Modify → Execute chain.

The obvious question is: why not just call those functions directly inside the remote process?

If the new thread we create inside the target process is designed to allocate and modify memory within its own process, then we can simply take the existing techniques for remote memory allocation/modification and adapt them into ‘self’-allocation/self-modification methods.

Even better, we can use any local functionality already available in the target process for our primitives — instead of being restricted to just WinAPI or syscalls that support remote operations.

After all, there’s nothing suspicious about a process reading from or writing to its own memory.

We would basically change this:

With this:

CreateRemoteThread Limitations

The first problem we encountered with CreateRemoteThread was its parameter limitation. The API accepts only a pointer to the start routine (the code to execute) and a single pointer to its first argument.

While one parameter is enough for simple APIs like LoadLibrary, calling more useful functions becomes difficult. For example, VirtualAlloc and WriteProcessMemory both require four parameters.

Another issue is that any additional parameters (beyond the first) are not guaranteed to be NULL—they’re often just undefined garbage data. For instance, if you call MessageBox, you will get a message box in the remote process, but the title (2nd param) and text (3rd param) will likely be junk.

For our targets, we needed full control over all four parameters. Luckily, there’s a straightforward method to achieve this, which also acts as an execution trigger: Hijacking Thread Context.

By using APIs such as SetThreadContext, we can configure a target thread in the remote process to run VirtualAlloc and WriteProcessMemory with up to four controlled parameters.

With this limitation bypassed, we’re now free to call runtime functions like malloc, memset, memcpy, and also system-native functions such as RtlFillMemory, HeapAlloc and more.

Before we continue to exactly how, let’s do a quick refresher on the x64 calling convention and how it relates to the CONTEXT struct which we use extensively.

Windows x64 Calling Convention 101

The x64 calling convention passes the first four arguments to a function via registers (RCX, RDX, R8, R9), and the remaining arguments are passed on the stack.

Practically speaking: If we can control the CONTEXT of a thread, we can set the RIP (instruction pointer) to any function we want to execute, and set up the registers to pass the appropriate arguments to that function.

For our second injection proof-of-concept, we aimed to adapt the classic Thread Hijacking technique: SuspendThread → SetThreadContext → ResumeThread — but apply it to a newly created thread.

Our intended flow looked like this:
CreateRemoteThread (Suspended) → SetThreadContext → ResumeThread

This approach turned out to be challenging. However, we successfully worked around the issues while sticking to our rule: never use standard remote memory allocation or modification.

Troubleshooting & Research

This section is about the problems we’ve encountered and how we overcame those, it gets a bit technical and perhaps heavy, but we wanted to share our methodology.

Show Sections

Issue – Empty Initial Stack

Setup

First we’ve tried creating a new thread in a suspended state. Then we overwrite its CONTEXT struct so that when we resume the thread, it jumps right into VirtualAlloc.

RIP  = &pVirtualAlloc
RCX = param1
RDX = param2
R8  = param3
R9  = param4

Result

We get a memory access violation crash on the return from VirtualAlloc.

Root cause

The thread starts with an empty stack.
VirtualAlloc executes fine and returns to its caller, but the caller’s return address is zero. That is because the return address is normally pushed to the stack by the thread‑startup stub, which did not happen.
Dereferencing the null return address triggers the memory access violation.

Why CreateRemoteThread doesn’t crash in 1‑arg cases

When you let Windows launch a thread normally, execution begins in the
native startup stub (RtlUserThreadStart → BaseThreadInitThunk).
The stub sets up a proper stack frame, calls your target routine, then
finishes with ExitThread.
However, when you forcibly set RIP to a function such as VirtualAlloc, the thread starts inside the function and seems to skip the native startup stub. This means the thread can’t go to ExitThread once it’s done.

Trial 2 – Stealing Valid Stack from Another Thread

Idea

Reuse the stack of a “sacrificial” thread that is sleeping forever using SleepEx (NtDelayExecution). The hope is that the call stack of the sacrificial sleep thread would be clean and safe to return through.
If the thread had been paused mid-call, say during stack setup (after the prologue but before the epilogue), we might need to realign the stack manually, e.g., adjusting RSP += 16 to compensate.

Plan

Spawn sacrificial thread
- CreateRemoteThread( hProc, NULL, 0, Sleep, (LPVOID)INFINITE, 0, NULL);
Wait until the thread is inside NtDelayExecution.
Grab its stack pointer
- CONTEXT ctx;
- GetThreadContext(hSleep, &ctx);
- uintptr_t sleepRsp = ctx.Rsp;
Create a second thread (suspended) and overwrite its CONTEXT:

RIP = &VirtualAlloc
RSP = sleepRsp          // borrow stack from the sleeping thread
RCX, RDX, R8, R9 = args // first four params

Resume the second thread.

Outcome

Crash on return to the native startup stub.
Reason: the borrowed stack is valid, but the new thread’s TEB is empty.
- BaseThreadInitThunk expects initialized fields (SEH list, TLS, etc.).
- Dereferencing TEB → NtTib.ExceptionList (still NULL) triggers a memory access violation.

Lesson

A good stack alone isn’t enough. Each thread gets a fresh Thread Environment Block (TEB), and the kernel points the GS register to it. Returning into code that assumes those TEB fields are initialized will fail.

Our next idea was to hijack the sleeping thread itself instead of repurposing its stack.

Trial 3 – Hijacking the Sacrificial Sleep Thread

Goal
Hijack execution of a thread with a fully‑initialized TEB and stack after it enters SleepEx.

Intended call‑stack

RtlUserThreadStart
 ↳ BaseThreadInitThunk
     ↳ Sleep → SleepEx → NtDelayExecution
                           ↳ (context switch) → VirtualAlloc

Steps

Launch a sleeper thread for 1 second:
- HANDLE hT = CreateRemoteThread( hProc, NULL, 0, Sleep, (LPVOID)1000, 0, NULL);
Wait until RIP sits inside NtDelayExecution.
Hijack context while the thread is still sleeping:
- RIP = &VirtualAlloc
- RCX,RDX,R8,R9 = desired parameters
- RSP left unchanged
Let the timer expire; the thread should resume directly in VirtualAlloc.

Outcome

RIP changed as expected — execution reached VirtualAlloc.
All other registers were junk; VirtualAlloc failed and the process crashed on return.

Take‑away: during a timed sleep, only RIP is reliably written. The kernel’s wait‑resume path seems to overwrite or ignore the rest of the supplied context.

Where we pivot next

A cleaner target is a thread that is running yet idle – not blocked in a kernel wait but is post initialization.
That insight leads to Trial 4, where we start the thread in a minimal endless‑loop gadget that is trivial to hijack.

Trial 4 – Sleep alternative, the Loop Gadget and CFG

Idea
Start the thread in a busy‑wait loop (JMP -2) that touches no registers, then hijack its context.

Why the EB FE gadget?

It’s a two‑byte “jmp ‑2” instruction — an infinite loop.
Easy to locate in any executable module; we simply scan our own ntdll for it and reuse the same address in the target process.
Leaves every register except RIP untouched, so no collateral damage.

Expected call‑stack

RtlUserThreadStart
 ↳ BaseThreadInitThunk
     ↳ [loop gadget: EB FE]
         ↳ (context hijack) → VirtualAlloc

Procedure

Create a remote thread whose start address is the loop gadget.
Wait until it spins inside the loop.
SetThreadContext
- RIP = &VirtualAlloc
- RCX,RDX,R8,R9 = params
- RSP unchanged
Thread executes our function (no suspend/resume needed; the thread is already running).

Result

Immediate Control‑Flow Guard (CFG) violation → process crash.
Root cause: BaseThreadInitThunk dispatches to the start address via an indirect jump that is CFG‑instrumented.
Our loop gadget wasn’t in the module’s valid call‑target bitmap, so CFG killed the thread before the loop even began.

Trial 5 – Double Hijack: Loop Gadget Pivot

Idea

Start with a normal thread startup into Sleep (to satisfy CFG), just like before. But this time, we hijack twice:

First into a loop gadget, to gain stable execution control.
Then into the target function (e.g., VirtualAlloc) with full parameter control.

Previously, we tried replacing the sleep function with a gadget directly — but that skipped the thread’s natural startup logic. We still wanted the native thread initialization to happen first, then slide quietly into a dormant state until hijack time.

The breakthrough was combining the two ideas:

Let the thread start normally with Sleep.
When it hits NtDelayExecution, hijack it to a loop gadget.
From the loop, hijack again, this time fully setting the context (RIP + RCX, RDX, R8, R9).

Call Flow

RtlUserThreadStart
    ↳ BaseThreadInitThunk
        ↳ ...
            ↳ NtDelayExecution (from Sleep)
                ↳ Context Hijack → Loop Gadget
                    ↳ Context Hijack → VirtualAlloc (with params)

Execution Flow

CreateRemoteThread with start address set to Sleep.
Wait briefly for the thread to initialize and enter sleep.
Hijack #1: Set RIP to the loop gadget. Leave everything else alone. No suspend/resume needed.
Hijack #2: Set RIP to VirtualAlloc, with proper values in RCX, RDX, R8, and R9. Again, no suspend/resume required.

Outcome

Success. We now have a clean PoC that calls any local function in a remote thread with up to 4 parameters, without using any standard remote allocation or modification techniques.

This approach even plays nicely with APC delivery, since the thread returns normally, without being sacrificed. The only downside? Two waits are needed per invocation to let the thread reach its sleep, and then the loop gadget.

Maybe we can optimize it a little if we could find an alternative to the initialization + sleep stage.

Trial 6 – Fixing the Stack using ROP

Idea

Finally, we thought of a simpler solution: use a ROP gadget in the target process to sets up the stack with two return addresses – one for thread exit (RtlExitThread) and one for our target function. This single gadget, would suffice for our stack building purposes.

Back in Trial 1, we ran into a problem: after the target function finished executing, the thread tried to return to a null address — a result of starting with an empty stack. We wanted more elegant ways to deal with this.

Waiting for thread initialization, sleeping, and doing two context hijacks (as in Trial 5) worked — but it felt a bit heavy-handed.

ROP Gadget

By using a ROP gadget already present in the target process, we could build a valid stack with just a single step.

The ROP gadget looks like this:

push reg1
push reg2
ret

Where the two above registers are unique from one another, and are any of the following: RAX/RBX/RBP/RDI/RSI/R10-15.

With this simple gadget we can provide any API with following Context struct:

RIP → Gadget address
RCX, RDX, R8, R9 → 4 arguments
Gadget register 1 → RtlExitThread
Gadget register 2 → Pointer to Function (ex, VirtualAlloc)

Discovering ROP gadgets within Notepad.exe

This way, we have managed to skip the initialization + sleep and create a clean stack which exits via RtlExitThread once the target function finishes its work.

Call Flow

ROP Gadget  
  ↳ VirtualAlloc  
      ↳ RtlExitThread

We scan the remote process‘ memory for the desired ROP gadget.
We create a new thread via CreateRemoteThread, in a suspended state. The thread’s start address is irrelevant.
We prepare a new context:
- Set RIP to the ROP gadget.
- Set the gadget’s input registers to:
  - VirtualAlloc (function call target)
  - RtlExitThread (post-call return target).
- Fill RCX, RDX, R8, and R9 with the four VirtualAlloc parameters.
We resume the thread and let it run.

Outcome

The thread successfully executes VirtualAlloc and then cleanly exits by returning into RtlExitThread—no crashes, no cleanup needed.

Proof of Concept

The proof of concept performs the following:

Searches the target process memory for a ROP gadget (push reg1; push reg2; ret)
Uses this gadget to call VirtualAlloc, RtlFillMemory, and execute shellcode
- Creates a suspended thread using CreateRemoteThread
- Sets the thread’s CONTEXT with SetThreadContext, assigning:
  - RIP to the ROP gadget
  - RCX – R9 to function arguments
  - Stack values for ExitThread and the target function to reg1 and reg2
  - Calls ResumeThread to run the thread.
  - Each thread pushes the address of ExitThread, then the target function, and performs a ret to jump into the target function.

NtCreateThread Context Injection

While CreateRemoteThread is widely used for thread injection, it only accepts:

The start address of the thread
A pointer to the first argument.

This led us to wonder why it’s so limited, and what would happen if we investigated the underlying API (NtCreateThread) which provides more control over thread creation.

Post Research Note: This is a good moment to say – we thought that CreateRemoteThread calls NtCreateThread while CreateRemoteThreadEx calls NtCreateThreadEx.

That is not true. In modern Windows, both APIs call NtCreateThreadEx which does not take a context struct as an argument. With NtCreateThreadEx, the kernel performs stack allocation in the remote process.

With NtCreateThread, we can pass a CONTEXT structure, giving us control over the thread’s registers, stack, and the return address.

This sparked an idea: what if we avoid using SetContext APIs by supplying the CONTEXT structure to NtCreateThread?

NtCreateThread prototype from ntinternals

Troubleshooting & Research

At first, what seemed like it would be a walk in the park, turned into a few nights worth of debugging. This section gets a bit debugging heavy as well, but again, we wanted to share the methodology and our conclusions. Feel free to skip ahead.

Show Sections

NTSTATUS 0xC0000022 – Access Denied

Our first PoC was simple (or so we thought):

Obtain PROCESS_ALL_ACCESS handle to target process.
Get the Sleep() function address from our process.
Allocate a clean stack using VirtualAllocEx to the remote process (we can likely can avoid this step by being a bit creative and combining previous ideas)
Prepare CONTEXT Struct & TEB struct in our process.
Call NtCreateThread

However, we received an ‘Access Denied‘ NTSTATUS from the Syscall, even though we:

Used a high integrity injecting process with an admin token and SeDebugPrivilege enabled.
Injected into medium integrity processes – notepad.exe, calc.exe, msedge.exe and a dummy process we’ve created.

Looking in the internet, we didn’t find much info on the reason this happens, but ChatGPT gladly told us that the NtCreateThread syscall is a legacy syscall that will not work when used cross-process after Windows Vista:

**ChatGPT o3 with search enabled** (openrce reference) (securityxploded reference)

That obviously seemed like a hallucination, especially when the sources GPT pointed to didn’t contain that wording explicitly, so we went ahead and did some kernel debugging.

Tracing NtCreateThread in the Kernel

The general flow of the NtCreateThread syscall is:

Checks nt!NtCreateThread Performs

Before the kernel gets to nt!PspCreateThread, the nt!NtCreateThread wrapper does some basic housekeeping which include:

Verifies all pointers passed in (handles, CLIENT_ID, CONTEXT, INITIAL_TEB) are aligned and readable/writable.
Scrubs the CONTEXT record, stripping out privileged flags and illegal register values.
Sanity-checks the stack info in INITIAL_TEB, making sure a real stack was allocated and no “previous stack” fields are set.
Zeros the output handle & keeps the stack 16-byte aligned so any early error unwinds cleanly.

These are mostly safety checks, they’re not relevant to our troubleshooting but we thought to mention them.

Checks nt!PspCreateThread Performs

PspCreateThread is where the actual thread object (ETHREAD) is created and linked into the target process. It performs its own set of checks, some of which are more security-focused.

The PspCreateThead is lacking on documentation, so we’ve tried our best reversing it. The prototype is as follows (when called from NtCreateThread):

NTSTATUS __fastcall PspCreateThread(
        PHANDLE RemoteThreadHandle,
        ACCESS_MASK DesiredAccess,
        POBJECT_ATTRIBUTES ObjectAttributes,
        HANDLE RemoteProcessHandle,
        _EPROCESS *TargetProcObject,
        __int64 __zero,
        PCLIENT_ID ClientID,
        PCONTEXT InitialContext,
        PINITIAL_TEB InitialTeb,
        ULONG CreationFlagsRaw,
        ULONG_PTR IsSecureProcess,
        ULONG_PTR Spare,
        PVOID InternalFlags)

First we have some initializations and caching of parameters:

  callerThread = KeGetCurrentThread();
  RemoteProcessHandle2 = RemoteProcessHandle;
  RemoteThreadHandle2 = RemoteThreadHandle;
  __zero_3 = __zero;
  ClientID2 = ClientID;
  InitialTeb2 = InitialTeb;
  DesiredAccess2 = DesiredAccess;
  callerProcess = (_EPROCESS *)callerThread->ApcState.Process;
  InternalFlags2 = InternalFlags;
  ObjectAttributes2 = ObjectAttributes;
  CallerProcess2 = callerProcess;

Skip a bit forward, first, the kernel resolves the target process:

  if ( RemoteProcessHandle )
  {
    LOBYTE(RemoteProcessHandle) = PreviousMode;
    result = ObpReferenceObjectByHandleWithTag(
               RemoteProcessHandle2,
               2LL,
               PsProcessType,
               RemoteProcessHandle,
               0x72437350,
               &TargetProcessObject,
               0LL,
               0LL);
    TargetProcessObject2 = (_EPROCESS *)TargetProcessObject;
    goto LABEL_5;
  }

Afterwards, the kernel checks that resolving the target process was successful and jumps to perform a check using PspIsProcessReadyForRemoteThread:

 CallerProcess3 = CallerProcess2;
  if ( TargetProcessObject2 != CallerProcess2 )
  {
    if ( !PspIsProcessReadyForRemoteThread(TargetProcessObject2) )
      return 0xC0000001; // STATUS_UNSUCCESSFUL
    CallerProcess3 = CallerProcess2;
  }

Now things start to get a bit weird. There is a check to see if the target process is protected by virtualization and lives in VTL-1 (Secure Kernel), but since NtCreateThread always calls PspCreateThread with IsSecureProcess = 0, the check is skipped and is evaluated to True as part of the other conditions we can see just below.

We reach the key check that caused the Access Denied error:

  if ( !__zero_3 // Always true from NtCreateThread
    && !IsSecureProcess // Always true 
    && ((TargetProcessObject2->MitigationFlags & 1) != 0
     || (CallerProcess3->MitigationFlags & 1) != 0
     || (TargetProcessObject2->MitigationFlags2 & 0x4000) != 0
     || (CallerProcess3->MitigationFlags2 & 0x4000) != 0) )
  {
    return 0xC0000022;
  }

The first two conditions evaluate to true when called from NtCreateThread. The next two checks (MitigationFlags & 1) determine whether Control Flow Guard (CFG) is enabled in either our process or the target. The final two checks (MitigationFlags2 & 0x4000) verify whether the Import Address Filter (IAF) mitigation is enabled on either processes.

Just a bit later there’s a few other interesting checks which could return 0x22:

  if ( (TargetProcessObject2->Flags3 & 1) != 0 // Minimal Process bit
        && !TargetProcessObject2->PicoContext // Not a Pico Process
        && InitialContext )
  {
    return 0xC0000022;
  }

The first condition checks whether the target process is a Minimal Process. and that is not a Pico Process and an InitialContext was provided (which it is from NtCreateThread.

To sum up, if any of the following applies to the target process, NtCreateThread injection will fail:

Control Flow Guard is enabled
Import Address Filtering (EAF) is enabled
The process is a Minimal Process and not a Pico process.

Therefore, the NtCreateThread syscall will likely work on most 3rd party programs, which are typically not compiled with CFG or IAF, and are typcially not Minimal processes. In contrast, most Microsoft binaries are typcially compiled with Control Flow Guard (CFG).

Proof of Concept

The proof of concept performs the following:

Searches the target process memory for a ROP gadget (push reg1; push reg2; ret)
Uses this gadget to call VirtualAlloc, RtlFillMemory, and execute shellcode
- Allocate an empty stack in the target process for the new thread using VirtualAllocEx (we can likely avoid this step)
- Initialize CONTEXT struct, assigning:
  - RIP to the ROP gadget
  - RCX – R9 to function arguments
  - Stack values for ExitThread and the target function to reg1 and reg2
  - Each thread pushes the address of ExitThread, then the target function, and performs a ret to jump into the target function.
- Creates a thread using NtCreateThread with the given CONTEXT which executes immediately.

The new thread does not need a context hijack while belonging to the target process as its context is already pre-supplied.

RedirectThread Tool

Find the RedirectThread Github repo here

To demonstrate the techniques from this blog in a practical and repeatable way, we built a command-line tool that implements the injection methods discussed. RedirectThread supports context-only process injections which include:

Pointer-only DLL injection
Various APC injections
CreateRemoteThread + SetThreadContext injection
NtCreateThread

Usage:

Usage: C:\RedirectThread.exe [options]

Required Options:
  --pid <pid>                 Target process ID to inject into
  --inject-dll                Perform DLL injection (hardcoded to "0.dll")
  --inject-shellcode <file>   Perform shellcode injection from file
  --inject-shellcode-bytes <hex>  Perform shellcode injection from hex string (e.g. 9090c3)

Delivery Method Options:
  --method <method>           Specify code execution method
     CreateRemoteThread       Default, creates a remote thread
     NtCreateThread           Uses NtCreateThread (less traceable)
     QueueUserAPC             Uses QueueUserAPC (requires --tid)
     QueueUserAPC2            Uses QueueUserAPC2 (requires --tid)
     NtQueueApcThread         Uses NtQueueApcThread (requires --tid)
     NtQueueApcThreadEx       Uses NtQueueApcThreadEx (requires --tid)
     NtQueueApcThreadEx2      Uses NtQueueApcThreadEx2 (requires --tid)

Context Method Options:
  --context-method <method>   Specify context manipulation method
     rop-gadget               Default, uses ROP gadget technique
     two-step                 Uses a two-step thread hijacking approach

Additional Options:
  --tid <tid>                 Target thread ID (required for APC methods)
  --alloc-size <size>         Memory allocation size in bytes (default: 4096)
  --alloc-perm <hex>          Memory protection flags in hex (default: 0x40)
  --alloc-address <hex>       Specify base address for allocation (hex, optional)
  --use-suspend               Use thread suspension for increased reliability
  --verbose                   Enable verbose output
  --enter-debug               Pause execution at key points for debugger attachment

Example:
  C:\RedirectThread.exe --pid 1234 --inject-dll mydll.dll
  C:\RedirectThread.exe --pid 1234 --inject-shellcode payload.bin --verbose
  C:\RedirectThread.exe --pid 1234 --inject-shellcode payload.bin --method NtCreateThread
  C:\RedirectThread.exe --pid 1234 --inject-shellcode-bytes 9090c3 --method QueueUserAPC --tid 5678
  C:\RedirectThread.exe --pid 1234 --inject-shellcode-bytes $bytes --context-method two-step --method NtQueueUserApcThreadEx2 --tid 5678

Injection Detection Logic Theory

This section is about detection logic models commonly used in EDRs, and how execution-only techniques challenge their core assumptions.

1.  “Thread‑Context Hijacking” by itself isn’t evil

A lone execution trigger. For example: suspending a thread, tweaking its CONTEXT, and resuming it doesn’t touch memory and therefore looks benign.
Modern EDRs typically label process injection when two or more of the following activities are tied together in the same victim process:

Activity	Typical API evidence	What it really means
Remote allocation	`VirtualAllocEx`, `MapViewOfFile3`, etc.	Create fresh pages inside another process
Remote modification	`WriteProcessMemory`, `VirtualProtectEx`	Change bytes or permissions in those pages
Remote execution trigger	`CreateRemoteThread`, APC queueing, context hijack, UI‑callback registration	Force the target to jump to attacker‑controlled code

EDRs often model these as an ordered chain (1 → 2 → 3) or as any {2 of 3} combination.

This check is often done as a correlation check with the start address of a new Thread or APC, or on the event of an any API call involved in the activity.

2.  Why Execution‑Only Attacks Are Hard to See

A defended endpoint would have to:

Spot Trigger #1 (remote thread creation) – easy.
Spot Trigger #2 (context hijack) – also easy.
Elevate every subsequent local memory touch inside that thread to “remote”.
- Requires tracing all syscalls and user‑mode helpers (memset, etc.).
- Needs data‑flow (“taint”) analysis to prove the write originated from the foreign context.
- Impractical at scale; borderline impossible in real time.

Result: attackers who skip Activities 1 & 2 and rely only on creative triggers could slip past detection.

3.  Why Attackers Swap Triggers More Than Allocators

The pool of “allocate/write remotely” APIs is tiny and heavily monitored.
The execution surface is huge: threads, APCs, timers, UI callbacks, UMS, DCOM marshaling, and so on. It’s a fertile ground for variants.
As EDRs hardened around memory ops, merely choosing an exotic trigger no longer guarantees stealth, yet it remains the easier and cheaper option.

4.  Weakness in the 2 of 3 Philosophy

The model silently assumes “remote ≠ local”. Once an attacker coerces the victim into performing its own writes, that assumption fails:

Local write occurs (memset inside victim).
Write is actually driven by an external context hijack = logically remote.
Current telemetry can’t prove that causal link, which results in no alert.

Unless the defender can join trigger telemetry with deep intra‑thread taint tracking, the injection hides in plain sight.

5. Takeaways for Defenders

For what we’ve discussed in the blog:

Monitoring for rapid set of thread creation events in a short amount of time might prove useful.
Monitoring thread creation followed by a large number of SetThreadContext (and similar APIs).

But the above detections are API specific, to get to the bottom of things we’d need to correlate who requested the trigger, not just what API was called.

Research Notes

This is a section for things we did not pursue, unanswered questions and other details.

Are we limited to 4 arguments?

No. The Windows x64 calling convention gives us four registers (RCX, RDX, R8, R9) for arguments, but if needed, additional arguments could be placed on the stack manually if we find better ROP chains.

Can we avoid creating hundreds of threads?

Probably yeah. We explored reusing the same thread by re-hijacking its context multiple times.
Instead of pushing RtlExitThread, we could push a loop gadget (as discussed in the troubleshooting section here and recycle the thread.
This could reduce thread creation in cases where you want to perform multiple operations over time, especially for shellcode delivery.

We could potentially hijack existing waiting-to-run threads in the right state such as DeferredReady/Ready waiting to run and restore their state later. This could also be done using the gadget by pushing the previous address instead of an exit. This would achieve a similar result to the earlier two-step approach with APCs, but with the efficiency of the later approach.

Can we avoid `ReadProcessMemory` to find ROP gadgets?

Yes! While our approach was to search directly in memory, this certainly isn’t the only viable method. There are several potential alternatives which can work:

Option 1: Load the same DLLs used by the target process and scan them locally to find ROP gadgets within specific modules.
Option 2: Parse the PE file from disk, locate the gadget, then calculate its memory address using base address + offset.

Each method has trade-offs in precision, stealth, and complexity. We chose in-memory scanning mainly for its simplicity.

Can we use other WinAPI / NT functions to control the Context?

The answer is yes! While not all of APIs are suitable for cross process injection, some just might offer a bit of ‘stealthier’ way to achieve the same results.

The following is a non-definitive list of APIs we’ve found that accept a CONTEXT struct. However, not all of them were tested, as the list quickly grew too large to completely explore.

WinAPI:
- SetThreadContext
- SetThreadInformation
- Wow64SetThreadContext
- RtlWow64SetThreadContext
- RtlRestoreContext
- RtlRegisterFeatureConfigurationChangeNotification
- RtlRegisterWait
- SetUmsThreadInformation
- EtwRegister
- UpdateProcThreadAttribute
- UmsThreadUserContext

NT Functions:
- NtSetContextThread
- NtCreateThread
- NtSetInformationThread
  - ThreadWow64Context
  - ThreadCreateStateChange
  - ThreadApplyStateChange
- NtContinue
- NtContinueEx
- ALLOCATE_VIRTUAL_MEMORY_EX_CALLBACK
- EtwEventRegister

Optimizations?

Absolutely! plenty of optimizations could be made:

We could improve gadget discovery logic by:

Tolerate non-disruptive intermediate instructions (e.g., a gadget like push rax; push rbx; mov r10, r11; ret, where mov r10, r11 has no side effects).
Look for other instructions. Eg. push rax; call rbx;.
Use chains instead of a monolith. Use one gadget to prepare rsp, a second to save registers, a third to dispatch. ROP.

We could improve two-step approach’ efficiency by:

Using more than a single thread at a time, automating thread’s choice too. The writing/copying operation can be done for each byte in parallel.
Shortening the sleep timing, looking for better ways to detect when thread is ready for the next step.
Using the Nt* versions for the writing operation. As they support 3 parameters (perfect fit RtlFillMemory) we can skip the two-step hijack for the majority of the operation.

Unexplored Ideas

Idea	Conclusions
Catch the thread before it returns, suspend it, and reuse it	Not explored. Seemed impractical due to tight timing. We didn’t investigate delaying the thread’s exit in a reliable way.
Find a native function that accepts a callback (e.g., to redirect to `ExitThread`)	We brainstormed a trampoline approach to redirect the call chain to an exit, but couldn’t find a good native candidate in NTDLL/Kernel32/KernelBase.
Push an existing exit function already present on another thread’s stack	Not explored. In theory, wait until a thread calls `ExitThread`, then reuse the stack pointer. But it’s a narrow and volatile time window, and likely dangerous.
Find an exit function address already present in memory	We scanned for common function pointers (like `ExitThread`, `RtlExitUserThread`) but didn’t find usable results. This sub-idea remains unresolved.
Catch exceptions during bad returns via debug APIs	Not explored. Likely violates the “no extra process hooks” principle of this research. Could be interesting in separate work.
Inject a custom SEH (structured exception handler)	Theoretically interesting. We wondered whether an execution-only setup could install a handler without modifying memory. Left untested.

When we kicked off this project we could find no public write-ups that fully skipped both VirtualAlloc[Ex]/WriteProcessMemory and still delivered x64 code. Only after nearing the end of our research did we stumble on the GhostWriting research [1] [2] [3], which is incredible.

GhostWriting proposed the idea to steal an existing thread and manipulate its CONTEXT to run shellcode with no remote allocations. Our research started from the same intuition but diverged in three ways:

Pointer-only LoadLibrary injection. We show that feeding LoadLibraryA a pointer to an in-process ASCII literal (e.g., "0") plus a disk file named 0.dll achieves DLL loading with zero remote writes. GhostWriting focuses solely on raw shellcode.
CreateRemoteThread ➜ SetThreadContext ROP pivot. Instead of hijacking a live GUI thread, we spin up new remote threads, repair their empty stacks with a push reg; push reg; ret gadget, and chain any WinAPI with up to four params. This workflow doesn’t appear in prior papers (as far as we know and happy to be corrected).
Full x64 NtCreateThread PoC. Our demo supplies a crafted CONTEXT and INITIAL_TEB directly to NtCreateThread, along with a kernel-debug exploration of the CFG/IAF gates.

So while “execution-only” injection has history, we believe the variants, mitigations, and insights presented here push the idea into a new territory.

Conclusion

If you made it this far, you’ve earned a debugger coffee breakpoint, hope it was worth the stack space.

Thanks for reading, sticking through the details, and joining us down this rabbit hole. If you’ve got follow-up ideas or ways to push this further, we’d love to hear them, someone out there always finds the next clever step.

New Process Injection Class: The CONTEXT-Only Attack Surface

TL;DR

Introduction

Why LoadLibrary?

Shared Memory 101

Pointer‑Only LoadLibrary Injection

Investigating Remote-Thread Creation Noise with ETW

CreateRemoteThread + SetThreadContext Injection

CreateRemoteThread Limitations

Windows x64 Calling Convention 101

Troubleshooting & Research

Issue – Empty Initial Stack

Trial 2 – Stealing Valid Stack from Another Thread

Trial 3 – Hijacking the Sacrificial Sleep Thread

Trial 4 – Sleep alternative, the Loop Gadget and CFG

Trial 5 – Double Hijack: Loop Gadget Pivot

Trial 6 – Fixing the Stack using ROP

Proof of Concept

NtCreateThread Context Injection

Troubleshooting & Research

NTSTATUS 0xC0000022 – Access Denied

Tracing NtCreateThread in the Kernel

Checks nt!NtCreateThread Performs

Checks nt!PspCreateThread Performs

Proof of Concept

RedirectThread Tool

Injection Detection Logic Theory

1. “Thread‑Context Hijacking” by itself isn’t evil

2. Why Execution‑Only Attacks Are Hard to See

3. Why Attackers Swap Triggers More Than Allocators

4. Weakness in the 2 of 3 Philosophy

5. Takeaways for Defenders

Research Notes

Are we limited to 4 arguments?

Can we avoid creating hundreds of threads?

Can we avoid ReadProcessMemory to find ROP gadgets?

Can we use other WinAPI / NT functions to control the Context?

Optimizations?

Unexplored Ideas

A Note on “GhostWriting” and Related Work

Conclusion

Read more

Evading EntraID Conditional Access Policies via Cross-Tenant ROPC

Breaking Down SharePoint Walls: Hunting for Sensitive Files

ShadowHound: A SharpHound Alternative Using Native PowerShell

Peeking Behind the Curtain: Finding Defender’s Exclusions

WMI Research and Lateral Movement

Hook, Line and Sinker: Phishing Windows Hello for Business

1.  “Thread‑Context Hijacking” by itself isn’t evil

2.  Why Execution‑Only Attacks Are Hard to See

3.  Why Attackers Swap Triggers More Than Allocators

4.  Weakness in the 2 of 3 Philosophy

Can we avoid `ReadProcessMemory` to find ROP gadgets?