Войти в систему

Home
    - Создать дневник
    - Написать в дневник
       - Подробный режим

LJ.Rossia.org
    - Новости сайта
    - Общие настройки
    - Sitemap
    - Оплата
    - ljr-fif

Редактировать...
    - Настройки
    - Список друзей
    - Дневник
    - Картинки
    - Пароль
    - Вид дневника

Сообщества

Настроить S2

Помощь
    - Забыли пароль?
    - FAQ
    - Тех. поддержка



Пишет nancygold ([info]nancygold)
@ 2024-05-30 02:11:00


Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Настроение: amused
Музыка:Clan Of Xymox - Weak in My Knees
Entry tags:computing

Diversion

Want to teach myself some Ghidra.

As a test specimen, I decided on https://www.mobygames.com/game/482/stronghold/

There are several reasons.
1. The game is 16 bit and I never did any segmented 16 bit x86,
   while Ghidra has limited support for decompiling real 16bit x86.
2. The game likely has some interesting coding, since it has a 3d view
   and processes multiple entities at once.
3. Stronghold is the early example of Dungeon Keeper / Majesty style games.
   The only game predating it is Populous.
   So there is some historical value in it.
4. The lead programmer is a trans girl Cathryn Mataga!


The disassembling of any code base is a methodical process, which begins
with the following steps:
0. Researching what versions are available and picking either the one
   with more debug information or the one which will be easier to emulate
   and decompile.
   It also makes sense to research the other products the programmers behind
   the disassembled code made at around the same time, since high chances
   they re-used some custom code.
1. Informing oneself about the CPU and OS, which run the code being decompiled.
2. Deciding on the disassembler/decompiler to use.
3. Determining the compiler used.
4. Procuring a function signature database to quickly get over
   all the standard runtime code.
5. Enumerating the OS syscalls.
6. Locating the actual main() function, as opposed to the start().



Stronghold comes in 2 flavors
________________________________________________

In addition to English, german, spanish, french and Japanese versions.

The german, spanish and french versions don't appear to be of any importance
for reversing the game, the Japanese one was released for two somewhat different
x86 based computers with MSDOS: FM Towns and PC-98.

In Japanese it is called "ストロングホールド ~皇帝の要塞~".
But it can be found in abandonware sites by googling:
* "Stronghold: Koutei no Yousai"
* "Stronghold: Emperor no Yousai"

The Japanese executable differs drastically from the English version.
In fact it even has a different name (ST.EXE).
The Japanese version comes with a 3d animated intro STRONG.FLI,
has General MIDI music and unique error strings like:
  "Error %d %d.  (Give Sato & Miwa Both numbers.)"

These are missing in the STRONG.EXE.

Still the Japanese version has no additional symbol information,
so I decided to go with the English version, since it is the most common
and easier to emulate. The Japanese version can be used for reference
after enough progress has been made on the English version.

There were also two different DOS CD releases,
* German version by Softgold
* 1995 English version
  Part of the "Unlimited Fantasy' a 5-CD set from Slash Corp.

The English CD version STRONG.EXE is identical to the floppy one.
I couldn't obtain the German CD one.

All versions, including the Japanese, appear to be using the same
version of either Borland C++ or Turbo C++ compiler.

There are several other games by Stormfront Studios:
* Tony La Russa Baseball II could be reusing some code, like animation timing
  yet the engine is not 3d and the custom file extensions are different.
  The compiler used is the same Borland C++ and the MAIN.EXE has similar size.
  It has no debug data present.
* Eagle Eye Mysteries isn't promising since it had a different programmer
  and the engine EAKIDS.EXE made by Electronic Arts is completely different,
  and it was compiled with MSC.
  The EAKIDS.EXE has debug info though, that means the original source code
  can be fully recovered.
  The game also uses LBM graphics, just like STRONG.EXE, but that was just
  a popular format back in the day.
  The engine's driver, EEM.EXE, is compiled using Borland C++,
  but it doesn't appear to share any code with STRONG.EXE 
* Neverwinter Nights, Treasures of the Savage Frontier, etc...
  These have the same programmer but use a Gold Box engine.
* Rebel Space had only a screenshot preserved:
  https://www.vintagecomputing.com/wp-content/images/prodigy/screenshots/prodigy_rebel_space_large.png
  Apparently that was a early EVA Online like game

None of these specimens are of any help to our endeavor.
We can use them to produce a function signature database by detecting
the C runtime library part, but instead we will try to obtain
the original compiler used to build the STRONG.EXE and use its library.


16bit x86 DOS computing environment
________________________________________________

Proper decompilation requires near expert knowledge of the system's quirks.
And 16bit x86 DOS is probably the baroquest and the hardest to grasp
system around. Still studying this sour subject will help understanding
modern x86.

To better grasp what we are getting into, let's start by looking at
the official Stronghold system requirements:
https://www.mobygames.com/game/482/stronghold/specs/

Minimum OS Class Required: PC/MS-DOS 5.0
Minimum CPU Class Required: Intel i386
Minimum RAM Required: 2 MB
Video Modes Supported: VGA
Notes: 	
  505,000 Bytes of free base RAM
  1,000,000 Bytes of free EMS/XMS

The i386 CPU requirement could mean one of the two things:
1. The i286 16MHz speed is not enough to run the program,
   which is still 16bit.
2. The program is actually 32bit, in that case the progam will be using
   32-bit runtime, called VCPI or DPMI, which is basically a mini 32bit OS.
 
Since there are no references to VCPI or DPMI in the executable or manual,
we can conclude the program is 16bit and needs i386 for more speed.
Reviews at the time of release indeed said it needs min 33MHz.
In 1993, 32bit had just begun gaining momentum and programs were 16bit.
But wait a second!!! What are these "base RAM" and "EMS/XMS"?
To answer this question we have to understand how the orginal i8086 CPU worked.

The main thing about 16bit x86 is its use of two registers to access memory.
One register is called a segment register, and the other is an offset inside
that segment. The x86 had 4 dedicated registers, holding the segment addresses.
These are CS, DS, SS, ES. By convention CS points to code, DS to data,
SS to stack and ES points at some array, we worked with at the time.

We can imagine a segment being a pointer to a C-struct, which is 16byte aligned
has arrays as members (i.e. functions in the CS segment or global vars in DS).
To get the linear address of the referenced memory, we do
    linear = ((seg<<4) + off) & 0xFFFFF

That `& 0xFFFFF` (mod 0x100000) is here for a not so good reason.

The very first x86 CPU (i8086) had an architectural memory limit of 1**20 (1MB),
due to the address bus connecting CPU to memory controller having 20 pins.
When anything above 1MB got accesses, 8086 instead of segfaulting or returning
zeroes just `mod 0x100000` the address, since that required less transistors.
As usual the quirk was declared to be a feature, instead of undefined behavior.
So people began depending on it.

In fact, Intel never planned to support more than 1MB, since x86 was made
for microcontroller market, so x86 becoming a standard for workstations due to
Microsoft's marketing came as a surprise to the engineers, due to workstation
use demanding larger and larger amounts of memory. Additional complication
came from the fact that memory above 640KB (0xA0000–0xFFFFF) was reserved for
hardware use, like BIOSes and IO buffer areas for various devices.
That memory was named the upper memory blocks (UMBs), each 64KB in size.
The 640KB below UMBs were called "conventional memory" or "base memory".
The entire 384KB area of these UMB was called upper memory area (UMA).

In particular the 64KB area at 0xA0000 was reserved for graphics output,
and that gave the classic DOS 320x200 limit to graphics resolution.

Since the large portion of UMBs was unused, people invaded it to use for their
own needs, like moving there daemon services, which were named
terminate-and-stay-resident (TSR), because they operated by listening
to interrupts which back in the day were a primitive version of
interprocess communication.

So the need came to extend the CPU, and Intel engineers decided to include
an additional pin, called A20, which specified if address pin 20 is active.
The quirk is that it only controlled the activity of pin 20, not any pins above
it. So when the A20 was disabled, the memory had a "striped" look, where
every odd megabyte mapped onto the megabyte preceding it. I.e.
[0x000000:0x0FFFFF] accessible
[0x100000:0x1FFFFF] maps to [0x000000:0x0FFFFF]
[0x200000:0x2FFFFF] accessible
[0x300000:0x3FFFFF] maps to [0x200000:0x2FFFFF]
etc...

Sweet? Unfortunately that is the core of the i286 system running our STRONG.EXE,
which uses 2MB RAM so we have to understand it to get our project anywhere.

Now accessing any memory above 1MB with a 16bit segment model, required
Intel to introduce an additional quirk - The Segment Descriptor Tables.
One had to initialize translation tables and feed them to the CPU.
These tables specified the 24-bits addresses for 16bit segment registers,
extending the addressable memory to the whooping 16 MB.
That memory above 1MB was called "Extended Memory" or "High Memory".
Additionally a chunk of 0x10000-16 bytes, called high memory area (HMA),
was located at 0x100000. It was accessible, by setting the segment
register to 0xFFFF and enable A20.

Now, instead of addresses, the segment registers hold the 13bit index of
an 8-byte entry inside the descriptor table, plus a bit indicating
if the userspace or OS-space tables are used (LDT/GDT) and also 2bit privilege.
The i286 descriptor table was composed of the following entries:
  typedef struct {
    uint8_t   address[3]; //linear 24bit base address, enough to cover 16MB
    uint16_t  size; // Basically size_in_bytes-1 
                    // i.e. size of 0 means only the first byte is accessible
    uint8_t   type; // code/data/system
    uint16_t  reserved1;
    uint16_t  reserved2;
  } i286_segment_descriptor_t;

To access the 16MB with these descriptors, developers used a special API, called
eXtended Memory Specification (XMS), which was implemented by a driver or
by a BIOS int 15h service 87h, called "Move Block":
http://vitaly_filatov.tripod.com/ng/asm/asm_026.14.html
Which basically mapped a chunk of memory from the above 1MB under it.

It did some black magic, including entering the so-called "protected mode",
where the segment registers hold LDT/GDT indices, instead of raw addresses.
Inside the protected mode, BIOS did the bank switching, and used "SOFT RESET" to
return into the direct memory access mode (called real mode on x86).

That BIOS service call was available only on the pre i386 systems.
While the i386 systems emulated it by means of HIMEM.SYS driver,
which did the usual 32bit switching:
http://info.wsisiz.edu.pl/~bse26236/batutil/help/HIMEM_S.HTM
Note the
   /INT15=xxxx
      Allocates the amount of extended memory (in kilobytes) to be reserved
      for the Interrupt 15h interface. Some older applications use the
      Interrupt 15h interface to allocate extended memory rather than using
      the XMS (eXtended-Memory Specification) method provided by HIMEM.

THe entire process is described here:
https://medium.com/@wolfcod/a-journey-into-himem-sys-de2ece29c0c8
  To switch back to real mode from protected mode on 80286 first, it's necessary
  to store at address 0040:0067 the return address of your code in
  protected mode and from protected mode you can reset the CPU, but before doing
  this it's necessary to write on CMOS memory a code (typically 05 or 0A) to
  signal to the POST bios code to switch back. Without this magic value,
  the POST procedure will continue the execution of standard BIOS code with
  a complete bootstrap of the operating system.

Before the segment descriptor table, the i8086 memory expansion boards supported
the so-called bank switching, following the Expanded Memory Specification (EMS).
EMS implied that an UMB at 0xD0000 or 0xE0000 was broken into 4 16KB banks,
each of which could be switched to point into the memory above 1MB.
To support bank switching, a divice driver, called expanded memory manager,
or EMMXXXX0, was installed, as part of say HT12MM.SYS for HT12 chipset.



The programs using EMS had to check for this EMMXXXX0 driver. 
So the presence of string "EMMXXXX0" anywhere in the specimen data indicates
it uses bank switching to access anything above 1MB.
In later systems expanded memory was emulated in software through
the XMS driver, which did the bank switching through segment descriptor tables.
The emulators were QRAM or USE!UMBS.SYS, which used the "shadow ram" feature of
i286 motherboards to perform the mapping:
https://retrocmp.de/hardware/rampage-286/use!umbs.txt
https://retrocmp.de/hardware/rampage-286/quarterdeck-qram.pdf

Some purely software drivers were called "LIMulators". On i286 these LIMulators,
like EMM286.EXE, had limitations, like no mirroring/aliasing:
https://www.pcorner.com/list/UTILITY/EMM286.ZIP/EMM286.TXT/
On the 32bit systems the EMM386 emulated EMS through the HIMEM.SYS API,
which had no such issues due to the use of proper MMU.
And the use of XMS/EMS precluded the use of 32bit code, which could access
entire 4GB of memory at once.

The machines without 2MB could still run EMS software. The MEMSIM32 LIMulator
swapped the banks to/from HDD, instead of expanded memory.


The memory availability can be checked by the DOS MEM command.

All these technologies require maintaining a list of segments or bank tables.
To properly decompile STRONG.EXE we will have to recover these tables buried
somewhere in the startup sequence the exe file code, since MZ exe format
supports only the pre i286 relocations. The C compiler vendors had to introduce
their custom formats to support the memory sizes above 1MB.
Given the officially stated EMS/XMS requirement, we have to deal with all that.

To summarize: typical 1993 i286 DOS systems had about 2MB of memory,
and accessing memory above 1MB is done through the EMS API.

Being so quirky and old, 16bit x86 has a rather limited support from the tools
used to statically analyze the code. For example, many contemporary utilities
will fail to open OMF object files and libraries, which were used by x86
tools before the Windows NT popularized COFF.


Picking a decompiler
________________________________________________

Decompiling a program requires several stages:
* Recovering the control flow graph the machine code.
* Recovering functions and data references.
* Recovering data structures.
* Running the decompiler to produce C/C++ code of functions of interest.
* Using the feedback from the decompiler to recover more of the above.
* Refactoring said the decompiled code, compiling it and plugging in
  function by function into a running program, either emulated or recompiled,
  which allows easier inserting C/C++ code.

All of these stages require using a builtin address space manager, coupled with
disassembler and hex editor. So while there are separate decompilers,
like e2c,DCC, Bumerang, it is necessary to have everything united into
a holistic development environment.

In general, decompilation is as old as the computing itself:
* https://www.program-transformation.org/Transform/HistoryOfDecompilation1.html
* https://www.program-transformation.org/Transform/HistoryOfDecompilation2.html



But for 16 bit DOS, we have three IDE options, each with its own drawbacks
* IDA Pro:
  * While very popular, it is also very expensive and requires obtaining
    a pirated copy, which is usually outdated, incomplete and has viruses.
  * Closed source and has limited ways to modify the tool to suit your needs.
    Outside of decompiling the decompiler itself.
    So if it fails to decompile your code, you have no way to fix it.
  * Doesn't spport decompiling 16 bit x86, outside of connecting to
    an external decompiler, like DCC or Ghidra.
* Ghidra ( https://ghidra-sre.org/ ):
  * Very popular and well documented.
  * Open source (MIT license), so can be modified to do anything you want.
    Being a US government project made to analyze malware and vulnerabilities,
    its ability to decompile DOS games is a fortunate byproduct.
    The code consists of boilerplate rich Java, with all the crazy OOP patterns
    you can imagine used for the simplest of tasks.
  * It has become standard.
    Plan to work with others on a paid project - you do better know it.
  * Has advanced features, like a machine learning plugin.
    Should be possible to connect it with LLM to get quick insight on the code.
  * Comes with enterprise integration and automation, like headless mode,
    to process large loads of code without a human being to drive it.
  * It supports decompiling 16bit x86!!!
* Reko (  https://github.com/uxmal/reko ):
  * A C# challenger to Ghidra, with a bit cleaner code, without most of the
    Java bullshit.
  * Cleaner IDA-inspired UI and much better Windows support than Ghidra,
    because of C#.
  * Super lightweight (20MB distribution completely with IDE).
  * Astonishingly good code analysis for 16bit DOS executables.
    It was able to locate main() in STRONG.EXE without any help
    But it struggles with further stages of decompilation, such as
    dataflow analysis, so I couldn't see the decompiled code for most functions.
    Yet for 32bit code, like say binkw32.dll, it does a good job.
  * Doesn't allow stack variables and arguments layout editing.
  * Not a full featured IDE, and offers now way to rename data areas,
    supply types, introduce imaginary segments, external data or code.
    But Reko does offer some Python scripting support, which I haven't
    researched in depth. Maybe you can implement some of these features?
    I.e. recovering strings.
  * Supports decompiling 16bit x86 and loading OMF files.
    It even supports some packers.
    Moreover, it claims that:
    "16-bit real mode" is "The first architecture targeted by Reko"
    https://github.com/uxmal/reko/wiki/Supported-binaries
    I.e. the authors made it to analyze DOS executables.
* Radare2+RetDec+Iaito ( https://github.com/radareorg/iaito/ )
  * Another full featured disassembly IDE, which consists of several
    subprojects, each of which consists of separate utilities.
    People love and fork it, so there will always be some active fork like Rizin
  * Written in plain C, with a few Python scripts to glue everything together.
    Think command line utilities, servers and different GUIs to drive them.
  * Apparently the most flexible and professional toolkit, and it expects
    the user to know exactly what they are doing.
    Will take the most time to set up, but can be made to do anything you want.
  * Support for 16bit DOS executables is left as an exercise to the reader,
    but given the flexibility of the framework, you can easily insert
    any other decompiler.

These are listed in order of historical appearance.

I decided on Ghidra, since it is mature, has all the features needed to reverse
the 16bit x86 code, and if some features are missing, I can implement them
with a plugin or an extension. Ghidra also produces the best C code out of them
all. Ghidra is also an opportunity to study the stupid modern Java practices,
since I never did any JVM programming. Ghidra also has MIT license, if whatever
you contribute to it will automatically have more value to the society than
the Reko's or Radare2's GPL.


Beside IDA, Ghidra, Reko and Radare2, there are standalone and CLI decompilers.
These include e2c, DCC, Boomerang, RetDec, REC Studio, mips2c, m2c, ExeToC...
Some of them can be used with IDA. Others are useful in their context.
For example, mips2c/m2c is specially geared towards recovering
the exact original C source code, when it is required for historical accuracy. 

Apparently none of the tools support decompiling EMS/XMS code properly,
which will be a big issue.



Deducing the compiler.
________________________________________________

Ghidra failed to determine the compiler used to compile strong.exe.

So I loaded the executable into IDA, and then checked Option->Compiler

IDA determined it as Borland C++ with a small model (near code, near data).

Both as we find later were wrong.
Yet FLIRT signatures for "TCC/TCC++/BCC++ 16 bit DOS" worked and
about 1/8 of the functions got recognized.

Feeding STRONG.EXE to string exposes:
   "Borland C++ - Copyright 1991 Borland Intl"

Apparently IDA also used this string to determine the compiler.
The other ways to detect the compiler are
* Trying function signatures for the runtimes of different compilers against
  the studied code.
* Code and data generation and layout idioms specific to that compiler.
* Decompiling a small routine and then compiling it with a suspected compiler.


To pinpoint the exact compiler used, we obtain every 1993 BCC/TCC version, and 
check their c0*.obj against the entry point of the strong.exe.

Borland compilers are available at:
  https://winworldpc.com/product/borland-c/20
  https://winworldpc.com/product/turbo-c/1x
  https://winworldpc.com/product/borland-turbo-c/1x

First thing we check is what copyright strings these compilers leave.

Turbo C++ 1.01 spits into exe files
  "Turbo C++ - Copyright 1990 Borland Intl."

While Turbo C++ 3.0, Borland C++ 3.0 and 3.1 leave
  "Borland C++ - Copyright 1991 Borland Intl"

That excludes Turbo C++ 1.01.

Luckily Borland C++ 3.0 and 3.1 come with source code for the runtime.
Although the Turbo versions are missing it.

The startup code is located in c0.asm under LIB/STARTUP.

Reading c0.asm we see that every Borland C++ executable contains
the exact memory model used at around seg000:0262.

The strong.exe has model=0xC004

To interpret it we will need constants from STARTUP/RULES.ASI:
#define FCODE 0x8000  /*far code*/
#define FDATA 0x4000  /*far data*/

Lower bits determine the model type
#define TINY    0
#define SMALL   1
#define MEDIUM  2
#define COMPACT 3
#define LARGE   4
#define HUGE    5

Given that, we can concluded that the model is FCODE|FDATA|LARGE.

Yet the asm code and the resulting obj files in Borland C++ 3.1 differ from
the strong.exe.

After SaveVectors call, strong.exe has LES, Borland C++ 3.1 always has
  mov     ax, _envseg@
  mov     es, ax

The only compilers doing "LES DI,_envseg@" are Borland Turbo C++ 1.01 and 3.0,
as well as Borland C++ 3.0.

Now we have deduced with 100% certainty that either Borland C++ 3.0
or Turbo C++ 3.0 was used to compile strong.exe.
And for Borland the default calling convention is __cdecl, so we set the
  Edit -> Options -> Decompiler -> Prototype Evaluation
to be __cdecl16far.

Arguments to __cdecl16far functions start at SS:[BP+4], since the return is
a far pointer, which gets popped by RETF (return from a far function).

While Turbo C++ and Borland C++ 3.0 compiler executables are different,
the startup code and the functions in CL.LIB appear to have exactly
the same machine code with both compilers. The executable size difference could
be due to the Turbo C++ has two features cut, since the compiler is missing
the following option, which were in earlier released Borland C++ 3.0:
  -Hxxx   Use pre-compiled headers
  -Wxxx   Create Windows application

Both were intended to support Windows code, which had huge headers.

Therefore determining the exact compiler is impossible.
The Borland C++ 3.0 comes with the source code for the C runtime
library, which we will be using for reference.
The C0.ASM there has the following comment
  "Turbo C++ Run Time Library"
So Turbo C++ 3.0 is probably the same compiler as Borland C++ 3.0.
Just with the Windows support cut out.


Further work would be comparing both TCC and BCC against each other,
and checking if any of the differences are inside of the code generator.
And if there are any, then deducing which variant was most likely used
from the generator choices. But that would be going too far off our way,
just to establish some minor historical certainty.

Strong.exe was also compiled with __NOFLOAT__, since MINSTACK is 128.
Apparently 16 bit x86 already supported floating point numbers.

Borland's C/C++ had one unusual feature: it allowed mixing C and x86 asm
statements together, where the asm statements could reference C variables.
It did some magic behind the scenes to allow that to working seamlessly,
compared to say GCC. Best of example of such code is VRAM.CAS
So back in the day C was truly a portable assembler.
This feature can be useful for partially decompiling the code.

Note that, Borland C++ runtime also includes a few functions with `pascal`
calling convention, which is basically cdecl but with arguments in reverse.
Why? No idea, I was never a fan of Pascal anyway.



Signatures Database
________________________________________________

Since the compiler is known, we need a way to determine all the statically
linked functions belonging to its standard library. Since that significantly
reduces the amount of work and allows us to go directly to the main().

While IDA includes sig/pc/bc31rtd.sig, Ghidra doesn't support the *.sig files.
Instead Ghidra has its own *.fidb signature libraries.
The only included *.fidb are the signatures for vs20XX.

For Borland C++, there are
https://github.com/moralrecordings/ghidra-fidb-dos-win16

but these don't appear to work with my Ghidra version.
The *.java code there used to generate *.fidb doesn't work with
the latest Ghidra (in addition to the Python script requiring Linux),
since the headless script for some reason fails to import the OMF *.lib files.
Modifying it I was able to produce the *.fids, but they failed to work either,
just like the premade ones.

What actually works is just using the UI
* File -> Open File System -> pick one of the *.lib files.
* Select all the imported *.obj files and dragging them into code browser,
  which makes "Analisis -> Analyze All Open" possible.
* Finally, Tool -> Function ID -> Create new empty FidDb
  and Tool -> Function ID -> Populate FidDb from programs

Among the recognized functions, we see a few related to the overlay feature.
The STRONG.EXE also includes strings "EMMXXXX0" and "Runtime overlay error".
That means the code was compiled with the -Yo flag, in addition to -ml.
This spells additional difficulties for us, since the same area of memory can
hold different code and data at different times (i.e. bank switching).
And the compiler generated these overlays accesses all over the code.
Basically it is a poor man's virtual memory paging.

The code handling the runtime part of overlays is called overlay manager.
Its compiled code resides in OVERLAY.LIB, which comes without any source code.
Beside EMS, OVERLAY.LIB includes functions to work with XMS.
But that is of no value to us, since the STRONG.EXE speaks with it through API.
More on overlays is on page 357 of Borland C++ 3.1 Programmer's Guide
Borland Open Architecture Handbook 1.0 includes info on debugging them:
http://annex.retroarchive.org/cdrom/psl-v3n8//PRGMMING/DOS/GEN_TUTR/BC4BOA.ZIP

The overlay function prototypes are located in DOS.H
extern  unsigned    _Cdecl  _ovrbuffer;
int cdecl far _OvrInitEms( unsigned __emsHandle, unsigned __emsFirst,
                           unsigned __emsPages );
int cdecl far _OvrInitExt( unsigned long __extStart,
                           unsigned long __extLength );

The _ovrbuffer is initialized to the size_of_the_largest_overlay*2.

The idea apparently was to use two overlays, where code running in one overlay
could load another overlay and jumps into it. Compiler generated such loads
behind the scenes. These overlays can reside either in extended memory or on
disk.


Some people already made a feature request on github for it:
https://github.com/NationalSecurityAgency/ghidra/issues/5543

While googling a decompilation of OVERLAY.LIB, I stumbled upon:
https://borland.public.cpp.borlandcpp.narkive.com/9hI7JTC4/platform-dos-overlay
>All the software I own is legitimate. I don't know any real programmer that
would knowingly purchase pirated/illegal software. We all know how much work
goes into getting something that works, is good looking and marketable.

Toxic American morality: deny everything, pretend to be overly pious,
all while daily shoplifting and watching gigabytes of CP.



The C library part of STRONG.EXE also has a string COMPAQ, but that comes
from CRTINIT.OBJ, which checks for COMPAQ to implement some hacks.
CRT stands for cathode-ray-tube, not C Run-Time. And its source is CRTINIT.CAS.


DOS syscalls
________________________________________________

DOS uses int 21h to invoke the OS code. The AH is set to the function index.

Ghidra doesn't support the DOS int 21h or syscalls recognition at all,
but it provides a general mechanism for documenting and decompiling syscall.

That is done through the creation of separate "imaginary" address space
for mapping function indexes to actual functions.

To create one for DOS we can do
  Window -> Memory Map -> Add new block
Name it say DOS21, make it overlay block at 0x0 of size 0x100

Then we just methodically add DOS syscall labels, turning each one into
a function by selecting the appropriate byte with mouse drag rect.
Then turn all int 21h into a CALL_OVERRIDE_UNCONDITIONAL onto the operand.

The swi(0x21) will remain cluttering the decompiled code,
but after it a proper function call will be present.

It is also possible to populate the DOS21:: segment from a txt file.
That is done with Data/ImportSymbolsScript.py
Which takes a text file with entries like
  DOS_GetDTAAddress   DOS21::2f   f
  DOS_GetDOSVersion   DOS21::30   f

Where DOS21::2f is the function id and `f` means we are making a function map.
That is what I actually did.




Locating the main()
________________________________________________

With standard library functions and syscalls being recognized and decompiled,
we are ready for going into the program's code. That involves locating
the main function. Of course we can cheat by using Reko for that, but doing
it manually for once has some educational value.

During the compilation the main() routine is referenced from the startup file.
Luckily we have access to the compiler's startup files and we know which exact
one to use (C0L.OBJ for the large model). So we can just open the C0L.OBJ,
which invokes the analyze dialogue. After that we go to the main symbol
and see where it is referenced from... but is not referenced from anywhere!!!

Absence of known calls to main() is due to Ghidras analyzer being very
conservative and ignoring the TEXT section (since no symbols export it
from the OBJ file). We have to initiate the disassembly manually.
The start of the TEXT section is basically our entry routine for the strong.exe.
Borland C++ entry is called start() (or startx() if you consider it also
including the cleanup code).

After the disassembly, we clearly see that the main() routine is referenced
directly at the end of the entry() code. Therefore we just open our strong.exe
disassembly, and go to the entry. That unknown FUN_ there will be the 
strong.exe's main().

Note though that Borland's main also includes environment argument:
  int main(int argc, char **argv, char **envp)

Strong.exe's main doesn't use envp, and the start() pops it for us,
because in the cdecl calling convention caller is responsible for popping
the passed args from stack, because that simplifies the implementation variable 
argument functions, like `printf`.



Conclusion
________________________________________________

Moving from IDA to Ghidra feels like moving from 3ds max to Blender.
Everything is obscure and clunky. For example, in IDA creating new *.sig
was just a matter of running plb and sigmake, which unfortunately are part of
FLAIR, which most pirated IDA distributions miss.

But IDA costs several thousand USD and by buying it you support Russians.
So I doubt any person ever purchased IDA legally, outside of pirating
the versions their employers purchased.

Radare2 appears to be a viable alternative to Ghidra, but it is more difficult
to set up, and will be an overkill for a small simple project. After all,
nothing stops you from using Reko to locate your main()



Future work
________________________________________________

Next step would be preparing the game for complete decompilation.
That requires recovering the overlays table, which will require a Ghidra script.
The static analysis alone isn't enough to decompile larger programs, where
while we slowly replace every function one by one with its decompiled version.
That requires comparative dynamic analysis, where we run original and decompiled
versions side by side, comparing control and data flow traces.
In addition, we need a flexible way to hook recompiled code over the original.
Both goals achieved by modifying existing emulator with breakpoints, which
can redirect execution towards our decompiled code, , which can as well be
compiled by the host C compiler (or whatever language you like).

Given my previous experience modifying DOSBox analysis, when I dumped
the sprites from Whizz game, hooking the decompiled code shouldn't be
too hard - just set breakpoint on execute and make DOSBox call decompiled code.
The hardest task is actually compiling the DOSBox on Windows, since it requires
a Unix userspace (to run ./configure scripts), while I only have the simplified
w64devkit. MinGW apparently provides it together with a mount command:
  https://www.dosbox.com/wiki/Building_DOSBox_with_MinGW

And if one building against Cygwin, SDL2 could then expect actual XWindows
for output. Still DOSBox is an easy route compared to actually emulating DOS.

To be continued...


(Добавить комментарий)


[info]nancygold
2024-05-30 02:22 (ссылка)
Ghidra ready DOS syscall table
DOS_ProgramTerminate          DOS21::00  f
DOS_CharacterInput            DOS21::01  f
DOS_CharacterOutput           DOS21::02  f
DOS_AuxilaryInput             DOS21::03  f
DOS_AuxilaryOutput            DOS21::04  f
DOS_PrinterOutput             DOS21::05  f
DOS_DirectConsoleIO           DOS21::06  f
DOS_DirectConsoleInputNoEcho  DOS21::07  f
DOS_ConsoleInputNoEcho        DOS21::08  f
DOS_DisplayString             DOS21::09  f
DOS_BufferedInput             DOS21::0a  f
DOS_GetInputStatus            DOS21::0b  f
DOS_FlushAndInput             DOS21::0c  f
DOS_DiskReset                 DOS21::0d  f
DOS_SetDefaultDrive           DOS21::0e  f
DOS_OpenFileFCB               DOS21::0f  f
DOS_CloseFileFCB              DOS21::10  f
DOS_FindFirstFileFCB          DOS21::11  f
DOS_FindNextFileFCB           DOS21::12  f
DOS_DeleteFileFCB             DOS21::13  f
DOS_ReadFileFCB               DOS21::14  f
DOS_WriteFileFCB              DOS21::15  f
DOS_CreateFileFCB             DOS21::16  f
DOS_RenameFileFCB             DOS21::17  f
DOS_Reserved18                DOS21::18  f
DOS_GetDefaultDrive           DOS21::19  f
DOS_SetDiskTransferArea       DOS21::1a  f
DOS_GetCurDriveAllocInfo      DOS21::1b  f
DOS_GetDriveAllocInfo         DOS21::1c  f
DOS_Reserved1D                DOS21::1d  f
DOS_Reserved1E                DOS21::1e  f
DOS_GetCurDriveParamBlock     DOS21::1f  f
DOS_Reserved20                DOS21::20  f
DOS_RandomRead                DOS21::21  f
DOS_RandomWrite               DOS21::22  f
DOS_GetFilesizeInRecords      DOS21::23  f
DOS_SetRandomRecordNumber     DOS21::24  f
DOS_SetInterruptVector        DOS21::25  f
DOS_CreatePSP                 DOS21::26  f
DOS_RandomBlockRead           DOS21::27  f
DOS_RandomBlockWrite          DOS21::28  f
DOS_ParseFilename             DOS21::29  f
DOS_GetSystemDate             DOS21::2a  f
DOS_SetSystemDate             DOS21::2b  f
DOS_GetSystemTime             DOS21::2c  f
DOS_SetSystemTime             DOS21::2d  f
DOS_SetVerifyFlag             DOS21::2e  f
DOS_GetDTAAddress             DOS21::2f  f
DOS_GetDOSVersion             DOS21::30  f
DOS_TerminateAndStayResident  DOS21::31  f
DOS_GetDiskInfo               DOS21::32  f
DOS_GetSetCtrlBreakFlag       DOS21::33  f
DOS_ReentrancyStatusPtr       DOS21::34  f
DOS_GetInterruptVector        DOS21::35  f
DOS_GetFreeDiskSpace          DOS21::36  f
DOS_GetSetSwitchChar          DOS21::37  f
DOS_GetSetCountryInfo         DOS21::38  f
DOS_CreateDirectory           DOS21::39  f
DOS_RemoveDirectory           DOS21::3a  f
DOS_SetCurrentDirectory       DOS21::3b  f
DOS_CreateFile                DOS21::3c  f
DOS_OpenFile                  DOS21::3d  f
DOS_CloseFile                 DOS21::3e  f
DOS_ReadFile                  DOS21::3f  f
DOS_WriteFile                 DOS21::40  f
DOS_DeleteFile                DOS21::41  f
DOS_MoveFilePointer           DOS21::42  f
DOS_GetSetFileAttributes      DOS21::43  f
DOS_DriverIOCTL               DOS21::45  f
DOS_DuplicateHandle           DOS21::45  f
DOS_RedirectHandle            DOS21::46  f
DOS_GetCurrentDirectory       DOS21::47  f
DOS_AllocateMemory            DOS21::48  f
DOS_ReleaseMemory             DOS21::49  f
DOS_ReallocateMemory          DOS21::4a  f
DOS_ExecuteProgram            DOS21::4b  f
DOS_TerminateWithReturnCode   DOS21::4c  f
DOS_GetReturnCode             DOS21::4d  f
DOS_FindFirstWildcardMatch    DOS21::4e  f
DOS_FindNextWildcardMatch     DOS21::4f  f
DOS_SetCurrentPSP             DOS21::50  f
DOS_GetCurrentPSP             DOS21::51  f
DOS_GetSYSVARS                DOS21::52  f
DOS_GreateDiskParmeterBlock   DOS21::53  f
DOS_GetVerifyFlag             DOS21::54  f
DOS_CreateProgramPSP          DOS21::55  f
DOS_RenameFile                DOS21::56  f
DOS_GetSetFileDateTime        DOS21::57  f
DOS_GetSetAllocationStrategy  DOS21::58  f
DOS_GetExtendedErrorInfo      DOS21::59  f
DOS_CreateUniqueFile          DOS21::5a  f
DOS_CreateNewFile             DOS21::5b  f
DOS_LockFile                  DOS21::5c  f
DOS_FileSharingFunctions      DOS21::5d  f
DOS_NetworkFunctions          DOS21::5e  f
DOS_NetworkRedirection        DOS21::5f  f
DOS_QualifyFilename           DOS21::60  f
DOS_Reserved61                DOS21::61  f
DOS_62GetCurrentPSP           DOS21::62  f
DOS_GetDBCSPtr                DOS21::63  f
DOS_SetWaitExternalEventFlag  DOS21::64  f
DOS_GetExtendedCountryInfo    DOS21::65  f
DOS_GetSetCodepage            DOS21::66  f
DOS_SetHandleCount            DOS21::67  f
DOS_CommitFile                DOS21::68  f
DOS_GetSetMediaId             DOS21::69  f
DOS_6aCommitFile              DOS21::6a  f
DOS_Reserved6b                DOS21::6b  f
DOS_ExtendedOpenCreateFile    DOS21::6c  f

(Ответить)


[info]aryk38
2024-05-30 03:03 (ссылка)
срочно заберите у этово чела кловиотуру !!

(Ответить)


(Анонимно)
2024-05-30 05:46 (ссылка)
Посмотрел на игру на ютубе. Какой-то прото-Dungeon Keeper на поверхности, с пришитой сбоку DnD рулесетом. Полигональная графика без текстур + спрайты. Что там интересного? В каком нибудь шейдере на shadertoy больше мудрости чем в бинарях этой игры зарыто.

Реверсинг это вообще как в какашках копаться, по мне так. Несозидательная деятельность.

>actually compiling the DOSBox on Windows

Windows subsystem for linux.

(Ответить) (Ветвь дискуссии)


[info]nancygold
2024-05-30 11:15 (ссылка)
>Windows subsystem for linux.

That requires installing entire Linux distro.
I don't have so much HDD space.

(Ответить) (Уровень выше) (Ветвь дискуссии)


(Анонимно)
2024-05-30 12:05 (ссылка)
You don't have 10gb? Great success in life!

(Ответить) (Уровень выше)


(Анонимно)
2024-05-30 05:49 (ссылка)
а сказать-то чё хотел?

(Ответить)


(Анонимно)
2024-05-30 05:49 (ссылка)
ПОДКАТ СУКА

(Ответить) (Ветвь дискуссии)


(Анонимно)
2024-05-30 06:30 (ссылка)
Подкатил тебе за щеку

(Ответить) (Уровень выше)


(Анонимно)
2024-05-30 06:51 (ссылка)
>паразитирует на опенсорсе при этом хая его

Покупай за свои кровные, жопные, или делай сам, гнойный анти-опенсорсный пидор.

IDA + decompiler = 365 USD + 2765 USD Сколько раз тебе нужно продать свою жопу за такую сумму?

>by buying it you support Russians. No more Russians than you are. By continuing living you support a Russian.

Отмазы нищеброда. HexRays is a Belgian company.

>Doesn't spport decompiling 16 bit x86

Не проблема для пропретарной бляди вроде тебя. IDA has open plugin architecture.

(Ответить) (Ветвь дискуссии)


[info]nancygold
2024-05-30 11:10 (ссылка)
https://en.wikipedia.org/wiki/Ilfak_Guilfanov

a Russian software developer,
graduated from Moscow State University

(Ответить) (Уровень выше) (Ветвь дискуссии)


(Анонимно)
2024-05-30 12:00 (ссылка)
kiwifarms.net/threads/nikita-vadimovich-sadkov-nashgold-zolotse-sadk0v-snv1985.68524/

A Russian software developer, didn't graduate from shit
, because retarded, but still a Russian software developer, in the same sense as these Belgians from HexRay, тупорылая ты скотина.

(Ответить) (Уровень выше) (Ветвь дискуссии)


[info]nancygold
2024-05-30 13:51 (ссылка)
stay mad

(Ответить) (Уровень выше) (Ветвь дискуссии)


(Анонимно)
2024-05-30 14:01 (ссылка)
But it's you who is mad. You would refulse to "support" yourself, if there was another Sadkov.

Also, Russians did nothing wrong to you. Bullying and ostracism of violent genetic trash is natural and righteous.

(Ответить) (Уровень выше)


[info]nancygold
2024-05-30 11:13 (ссылка)
>IDA has open plugin architecture.

I know, but you will have to plugin Ghidra or Rek.

(Ответить) (Уровень выше) (Ветвь дискуссии)


(Анонимно)
2024-05-30 12:02 (ссылка)
No, you don't get to plug in any opensource. You have to write your own, to not be hypocritical piece of shit that you are.

(Ответить) (Уровень выше) (Ветвь дискуссии)


[info]nancygold
2024-05-30 13:50 (ссылка)
Too much effort.

(Ответить) (Уровень выше) (Ветвь дискуссии)


(Анонимно)
2024-05-30 14:50 (ссылка)
Yes, not using opensource is "Too much effort", yet you still argue against opensource regularly.

(Ответить) (Уровень выше)


(Анонимно)
2024-05-30 12:26 (ссылка)
диплом сам себя не получит

(Ответить)


(Анонимно)
2024-05-30 12:43 (ссылка)
>to study the stupid modern Java practices
https://libgen.pm/index.php?req=java&columns%5B%5D=t&columns%5B%5D=a&columns%5B%5D=s&columns%5B%5D=y&columns%5B%5D=p&objects%5B%5D=f&topics%5B%5D=l&res=100&covers=on&filesuns=all&curtab=f&order=year&ordermode=desc

(Ответить) (Ветвь дискуссии)


[info]nancygold
2024-05-30 15:17 (ссылка)
These are books, not practices.

(Ответить) (Уровень выше)


(Анонимно)
2024-05-30 17:57 (ссылка)
Please write posts like this more often. What you're doing is extremely interesting and inspiring. Seems like you're getting close to something really big

(Ответить)