1
0
mirror of https://github.com/pumpitupdev/pumptools.git synced 2025-01-26 00:03:46 +01:00
pumptools/doc/development/notes/f2-crashing-on-modern-linux-post-mortem.md

15 KiB

Post mortem: Fiesta 2 crashing on modern linux

Date of writing: 17th November 2019

This is a port mortem about an issue that was puzzling me for months. In the following sections, I am documenting the whole story as detailed as possible hoping this is going to be useful for anyone else (or even myself) in the future.

What happened?

I have been running a multidisk setup with all pump games on it for a few years now and have been maintaining and improving pumptools in the process. I always kept things quite up to date because once you need some more tooling even on something like a dedicated machine, updating the system after it has not been updated for months or years becomes very difficult. Furthermore, things might stop working that have worked before because of one or even multiple issues that got introduced with the update.

Anyway, my cabinet with the multidisk setup yields the following on uname -a: Linux piumd 4.15.11-1-ARCH #1 SMP PREEMPT Mon Mar 19 18:21:03 UTC 2018 x86_64 GNU/Linux

My workstation is a bit more up-to-date with (though, I am behind a few months...): Linux ambient 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux

In the past, like more than a year ago, my workstation was also running a newer kernel, version of runtime libraries etc. but running Fiesta 2 and newer with pumptools wasn't an issue. However, Fiesta 2 and newer stopped working when I was continuing with my development efforts and just crashed with a segmentation fault. All old versions prior Fiesta 2 were still working perfectly fine.

Error analysis: first information and debugging steps

First, I enabled f2hooks built in logging to yield more output util.log.level=4 and enabled halting on crashes to be able to attach a debugger patch.sigsegv.halt_on_segv=1. This is the output I got:

[I][f2hook][main.c:155]: Hooking finished
[M][f2hook][main.c:168]: Calling orig_init
[E][patch-sigsegv][sigsegv.c:18]: ===================
[E][patch-sigsegv][sigsegv.c:19]: !!!!! SIGSEGV !!!!!
[E][patch-sigsegv][sigsegv.c:28]: Backtrace (3 frames):
[E][patch-sigsegv][sigsegv.c:31]: ./f2hook.so(+0x6a02) [0xf7e20a02]
[E][patch-sigsegv][sigsegv.c:31]: ./f2hook.so(+0xf827) [0xf7e29827]
[E][patch-sigsegv][sigsegv.c:31]: linux-gate.so.1(__kernel_sigreturn+0) [0xf7f07940]
[E][patch-sigsegv][sigsegv.c:37]: Halting for debugger...

Ok, so it seems like something crashed in f2hook.so. However, hooking up the debugger did not yield anything useful and this was just pointing to the sigsegv handler (thought it worked with debugging other issues and actually pointed to the location where it crashed...strange).

Next, I started removing stuff from f2hook to get a minimal version which still boots up to that point, no changes.

Then, I hooked up gdb, set a breakpoint before the crash happened and stepped my way through the whole libc runtime setup. I figured out, that the crash happens when the libc runtime is calling handler functions of the target application which setup various parts of the runtime environment, like static initialization.

And this is the place where the crash happened. There is one function in the piu executable which setups up all static variables, for example constructing static strings or other objects. Once it tried to construct the first static string of that static setup function, it crashed. To be more precise, it crashed in the string constructor when allocating memory for the string.

Great, we found the root cause for that. Let's figure out why that happens and what's different now that it crashes there.

Tracing the bug down to libstdc++ and being totally mislead by various things

Because there have been issues with incompatible libc/libcstd++ versions previously, I quickly came to the idea of comparing the version that worked on my cabinet and the version that did not work on my workstation. Indeed, these were different versions (considering the age gap of the two OSes, that wasn't very surprising).

I started digging into the libstdc++ source code, especially looking at what is going on when strings are allocated. Navigating the libstdc++ source is not a nice task. The naming is very confusing and, imo, non intuitive. With templating all over the place, this makes it even worse.

Finally, I found something that sounded very related to what I have discovered so far: In the file include/bits/basic_string.h, search for

_GLIBCXX_RESOLVE_LIB_DEFECTS
3076. basic_string CTAD ambiguity

and you will end up at a string constructor call. From that, I concluded that something was fixed related to constructor calls. I compared the string allocation call of both versions, the non working and working ones. The allocation for strings has changed.

Going back to the location of the crash in the piu executable, I saw that the string allocation was even inlined and not going to a library call of...boost filesystem. The first static object created is the path "../" which is used by a path object by boost filesystem. Because boost filesystem is a header only library, I assume that's why this got inlined.

However, I was using the libstdc++ version of my local machine where the string allocation scheme was different. This still did not truly explain why it crashed there with a jump to address 0, but things aligned pretty well so far.

So, the first idea for a solution was to use the old libstdc++ version from the machine where everything was fine and working. However, using it did not change anything. I went on and grabbed more libraries from the machine until I had all of them on my workstation and hooked them up with a LD_LIBRARY_PATH. But now, the application did crash very early and did not even get to setting up the runtime at all. I did some poking but could not figure out what was happening there. It just crashed with another jump to 0 very early in launching the elf.

Debugging continued for a bunch of weeks, on and off. I got more focused on digging deeper into the string allocation thing as this was giving me a direction and was aligning with my findings so far. I came up with a "solution" to hook and overwrite the string constructor calls of the newer libstdc++ version.

I started replacing them one by one to re-implement the old allocation scheme. However, this was not enough as many function calls that are working with string objects rely on the used allocation scheme, too. Therefore, I also started patching these until I realized that I have to re-implement, at least, everything basic_string related.

At that point, I was quite sure that this is the wrong way to go and stopped with this approach.

A few more months passed, I got back at this multiple times but couldn't come up with another solution or even direction.

AppImage

In a talk with a buddy, I had the idea of using some sort of sandboxing or environment boxing to pack things up nicely into a single blob to make it easier to handle, even distributable and maybe even more platform independent.

After some internet research, I found a solution that looked like it fits the pump game use-case very well which was AppImage. AppImage packs all your stuff into a single binary blob which is basically just a sqash fs image with a special runtime bootstrapping the whole image and running a bash script once the image is mounted. The provided features are very minimalistic but solve all our needs for packing pump games nicely.

The reference documentation also pointed out a few things about packing runtime libraries and how to set them up using LD_LIBRARY_PATH which I was already doing but just using a naked bash script. However, this reading was starting to point me into the right direction.

Providing a platform/distro independent environment for your application down to the linker

I started playing around with the libraries I used for Fiesta 2. First, I grabbed all libraries from my cabinet which are required to run the game (using ldd). Then I started trying different combinations of including them using LD_LIBRARY_PATH and some used by my native system. Nope, either I was running into the static init crashing issue or things crashed even earlier on.

Wait, have I ever looked into why the application crashed so early, even before spinning up the runtime environment? After working on this on and off for months and totally losing track by getting to deep into this libstdc++ thing, I realized I didn't.

I set up the application to use all libraries from my cabinet and started debugging. First, I tried my luck with gdb but didn't get very far as it just showed me a jump to 0, again. Next, I got pointed out to using valgrind. This was a great hint as valgrind hooks into the bootstrapping process of the application and even displays a lot of information about library loading and setting up the environment even way before the runtime is set up. Recommendation: Use valgrind -v -v -v to get a lot of output.

However, in the end, valgrind also just pointed out the jump to 0 being the reason for the crash but not why this happened. The additional information was good to figure out the steps happening in the application setup process but did not yield any additional insights.

After giving this some more thought and making sure I used all runtime libraries from the cabinet, I did some internet research again and tried go get more information on how ELFs are bootstrapped and what happens. It turned out that information on that topic very sparse and difficult to find. However, this brought me to the following thread from the gnu mailing list:

Hi all;

I have a need to use a different runtime linker (ld-linux.so) for some
of my applications.  I _don't_ want to change the executables themselves
(changing the path to the runtime linker in the ELF image for example).

I know that I can just run my application as an argument to ld-linux.so,
and that actually does work very well.  But, I am left with two
problems:

  * Some of my applications fork/exec other programs, and those other
    programs also need to use the other runtime linker.  Just as I don't
    want to change the ELF image, I certainly don't want to have to
    recompile the programs to exec the runtime linker!

  * Debugging: I can't debug because I can't find a way to convince GDB
    to invoke the program-to-be-debugged using an alternative runtime
    linker.

I was wondering if anyone has any ideas about how to do this?  I checked
the man pages and didn't see anything (I was looking for something like
an environment variable that ld-linux.so might obey to re-exec a
different version, or some such thing--not likely, I know).

If there's nothing like that available another idea is to use LD_PRELOAD
to install a private version of the execve() system call, which would
set the alternative ld-linux.so as the first argument.  Any thoughts on
this approach?  Do you think it would solve the GDB problem as well?
I'm going to proceed to work on this idea but I wanted to see if anyone
had other thoughts about it.



PS. In case anyone's curious, I'm trying to run some programs natively
    on my desktop system which were actually compiled for a different
    system, and the native and target systems have different versions of
    glibc installed.  I can't change either of the versions of glibc,
    but I do have the complete root filesystem of the target box in a
    subdirectory of my native box and I can use LD_LIBRARY_PATH, etc. to
    use them... I just need a way to get the runtime environment to
    initialize properly.

--
-------------------------------------------------------------------------------
 Paul D. Smith <address@hidden>   HASMAT--HA Software Mthds & Tools
 "Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
   These are my opinions---Nortel Networks takes no responsibility for them.

Source: https://lists.gnu.org/archive/html/help-gnu-utils/2004-06/msg00020.html

Oh, wow. I was amazed that someone else was having a very similar use-case of running applications from legacy systems on a newer machine with different library versions and conflicts.

Reading on, this answer was finally revealing the last piece of the puzzle to understand the issue I was facing with Fiesta 2:

If you have an old a.elf main executable program that was linked
to run with libc.so.6, and you want to run that a.elf on a machine
whose libc.so.6 differs from the one which a.elf was linked against,
and you want a reasonable guarantee that it will work, then it may
be necessary to use something like rtldi to choose the matching
ld-linux.so.2 and libc.so.6 and other libraries as of the time of
static binding of a.elf.  In theory, libc.so.6 is backwards compatibile
(a given a.elf should run the same under any subsequent libc.so.6),
but in practice there have been too many bugs in {ld-linux.so.2,
libc.so.6, ...}.  For example, many programs linked for libc.so.6
under RedHat Linux 7.2 and earlier won't run under glibc-2.3.2.
The most effective way to run such a program may be to install the
old glibc-x.y.z[.rpm] in a different location, binary edit the PT_INTERP
string of the a.elf, and use rtldi.  This makes the old [edited] a.elf
simultaneously interoperable with other programs from different
libc.so.6 generations, and only the "oddball" program has to know.

...

ld-linux.so.2, libc.so.6, libdl.so, libnss*.so, and _many_ other shared
libraries of a given released version, are a matched set.  They must be
used together, all from the same release, or not at all.  Mixing and
matching pieces from different glibc-x.y.z need not work.  In order to
use a consistent set that differs from the default set, you must invoke
the ld-linux.so.2 using "--library-path PATH" to specify the location
of the rest of the set.  Using --library-path supersedes LD_LIBRARY_PATH
for one execve() only, and does not interfere with chilren, execve()
with no fork(), or other processes.

Oh, I was not aware about ld-linux.so being THAT important to this set of libc.sp, libdl.so etc. Therefore, I grabbed the ld-linux.so file from my cabinet and ran the application with all libraries (except the GPU driver specific ones) on my workstation like this:

LD_PRELOAD="./f2hook.so" ./ld-linux.so.2 --library-path ./lib ./piu ./game/ --options ./hook.conf

And the application starts up, passes runtime initializing, hits main and the game works.

I don't think there is anything else to add to the copy-pasted explanations above. Make sure to read them carefully to understand the whole issue. Once you know what is going on, the whole thing is not even super complex.

Conclusions

I would consider this being one of the major issues I have resolved so far for pumptools as this was so misleading, difficult to debug and find any general information on. The reason behind that and the solution are super simple, once you know them.

As we have already discussed various times in the past, this issue further points out how important proper solutions for sandboxing, containerizing or how you want to call it for legacy applications like pump games are. The naked bash script setting up everything is already quite good and doing the job. AppImage further enhances this by making the whole application more portable.

However, understanding the whole process behind ELF loading, library linking, setting up the process environment and setting up the runtime environment is very important. Otherwise, rather simple incompatibilities are the result which have a major impact on running the application.