Crash

A Global Variable in U-Boot that caused a Hang

The best type of software bug is one where you get to learn something along the way. Like any good disaster, the bug we’re going to explore is one that arose from a chain of unexpected events and bad assumptions. This bug relates to uninitialised data and gives a good insight into the inner workings of U-Boot. Our bug resulted in a hang, though we’ve simplified it in this blog post to a single patch of buggy code to illustrate the problem – can you spot the bug in the following patch?

diff --git a/common/board_f.c b/common/board_f.c
index 9f441c44f176..ded4b2a87960 100644
--- a/common/board_f.c
+++ b/common/board_f.c
@@ -669,8 +669,12 @@ static int reloc_bloblist(void)
        return 0;
 }
 
+static int global_variable = 0;
+
 static int setup_reloc(void)
 {
+       printf("Value at %p is 0x%x\n", &global_variable, global_variable);
+
        if (gd->flags & GD_FLG_SKIP_RELOC) {
                debug("Skipping relocation due to flag\n");
                return 0;

What is the value of ‘global_variable’ at the point where we call printf? We expect it to be zero as we’ve initialised it to zero in the variable’s declaration. We can also assume that nothing else touches this variable. Let’s run U-Boot to find out:

U-Boot 2021.01-rc2-00121-g5b8991c667f7-dirty (Nov 26 2020 - 23:04:06 +0000)                                                                                                                                                           
                                                                                                                                                                                                                                      
DRAM:  948 MiB                                                                                                                                                                                                                        
Value at 0000000000103da4 is 0x18089750                                                                                                                                                                                               
RPI 3 Model B+ (0xa020d3)               

As you can see the value isn’t zero its 0x18089750. Of course if you were hunting down some undesirable behaviour in U-Boot it’s very unlikely that the above code would raise any suspicion – yet the side effects can be significant. Imagine if ‘global_variable’ was a pointer that was initialised to NULL – its very common to write code that checks a pointer isn’t NULL before accessing it – a random value here could cause an exception (as was the case in the original manifestation of this bug) and hang.

In this particular case the value that is printed is the same each time U-boot runs – however if we add a printf elsewhere in the code – the value may change. For example, lets add a printf to another function in this file:

diff --git a/common/board_f.c b/common/board_f.c
index 9f441c44f176..de9329e32e2d 100644
--- a/common/board_f.c
+++ b/common/board_f.c
@@ -666,11 +666,16 @@ static int reloc_bloblist(void)
        }
 #endif

+       printf("Hello\n");
        return 0;
 }

+static int global_variable = 0;
+
 static int setup_reloc(void)
 {
+       printf("Value at %p is 0x%x\n", &global_variable, global_variable);
+
        if (gd->flags & GD_FLG_SKIP_RELOC) {
                debug("Skipping relocation due to flag\n");
                return 0;

Let’s see what the value is now:

U-Boot 2021.01-rc2-00121-g5b8991c667f7-dirty (Nov 26 2020 - 23:38:38 +0000)                                                                                                                                                           
                                                                                                                                                                                                                                      
DRAM:  948 MiB                                                                                                                                                                                                                        
Hello                                                                                                                                                                                                                                 
Value at 0000000000103de4 is 0xb850406c                                                                                                                                                                                               
RPI 3 Model B+ (0xa020d3)                        

The value has changed. In fact most changes to the code will have the same bizarre effect. As a result this type of bug is tricky to track down – if the bug causes a hang, then you may appear to fix it by changing something unrelated – but actually the bug is still present, its just not causing undesirable behaviour anymore.

So what went wrong? The clue is in the name of the function where we make use of the global_variable – the function ‘setup_reloc‘ is run before U-Boot relocates itself in RAM. At this point in time the C environment is not fully set up and as a result global variables that are uninitialised or initialised to zero shouldn’t be used as their value will be undefined. But this is noted in the U-Boot documentation, which I assume you’ve fully read, right?

To understand this let’s unpack it some more. When you declare a global variable and initialise it to a value the linker will place the variable in a program section called ‘.data’. Different sections are used for different purposes as when the program is loaded the program loader may need to relocate these sections in different places – for example if a program was run from read only memory the loader would need to copy the .data section of the program into read/write memory such that variables can be written to. We can find out which section a variable or symbol has been placed via the nm command, e.g.:

$ cat test.c 
int variable_1;
int variable_2 = 0;
int variable_3 = 1;

int main()
{
        return 0;
}
$ aarch64-linux-gnu-gcc -O0 test.c 
$ aarch64-linux-gnu-nm a.out | grep variable
0000000000411034 B variable_1
0000000000411030 B variable_2
0000000000411028 D variable_3

The output above shows that ‘variable_3’ which was initialised to 1 is located in the ‘D’ section, i.e. the .data section. This is as we expected. You’ll also notice that ‘variable_1’ which we didn’t initialise is located in the ‘B’ section – this is the BSS or Block Starting Symbol and is where uninitialised variables are stored. However to save space in the executable the associated data for BSS is discarded – after all it’s uninitialised. The only information it keeps about the BSS is its size and location – this allows the program loader to allocate space for it.

You’ll also notice that ‘variable_2’ which was initialised to 0 is also in the BSS section – this may be unexpected as we’ve initialised it. However on many platforms the BSS is zero’d out – so compilers often place data that is initialised to 0 in this section too. In fact there is a GCC switch for controlling this behaviour. As follows:

$ aarch64-linux-gnu-gcc -O0 -fno-zero-initialized-in-bss test.c
$ aarch64-linux-gnu-nm a.out | grep variable
0000000000411034 B variable_1
0000000000411028 D variable_2
000000000041102c D variable_3

You can see that by passing the flag ‘-fno-zero-initialized-in-bss’ we’ve moved variable_2 from the BSS into the .data section.

Let’s go back to our original bug – we’ve just learnt that because our ‘global_variable’ was initialised to zero then it’s likely in the BSS section. And as a result it is necessary for the program loader to initialise this area of memory to zero before we start using it. However as we’re a bootloader we don’t have the luxury of a program loader – and we don’t want to rely on the previous firmware doing this for us. Therefore U-Boot clears the BSS itself – however at the point where we accessed our variable the BSS hadn’t yet been cleared and so the value was undefined. The value relates to whatever happens to be in that memory location at the time which is of course dependent on what earlier boot firmware or U-Boot has done.

The reason the value of ‘global_variable’ changes whenever we add printfs or change the code is because it moves the memory location of global_variable within the program image. We can see this with nm if we compare the two versions of U-Boot that we’ve built – we can also see it in the printf statements we printed earlier.

$ aarch64-linux-gnu-nm u-boot.printf | grep global_variable
0000000000103de4 b global_variable

$ aarch64-linux-gnu-nm u-boot.noprintf | grep global_variable
0000000000103da4 b global_variable

There is one final question to answer – why does U-Boot relocate itself and why does it do it so late during its initialisation? why not initialise the BSS right at the start of day before it even jumps to C code?

U-Boot relocates itself for a variety of reasons – initially the SDRAM may not have been set up and U-Boot is running from SRAM or a ROM. Even if SDRAM was available when U-Boot starts – ideally we’d keep U-Boot at the top of RAM so that there is a large amount of contiguous RAM for users of U-Boot to make use of (e.g. loading large images). However as the code isn’t position independent, at link-time you’d need to know the load-address of the code – and this may change every time the size of U-Boot changes. This would also require updates to the firmware such that it copies and jumps to U-Boot at the right address.

Now this is the interesting bit – when U-Boot is built, the machine code it generates is designed to run from a given address in memory (CONFIG_SYS_TEXT_BASE), i.e. it would be OK for the compiler to generate instructions that refer to absolute addresses. Therefore when U-boot relocates itself, with the help of some information added by the compiler (.rela.dyn), it has to rewrite itself to fixup addresses so that they point to the relocated area in memory. This of course includes symbols in our BSS, we can see this by printing the address of a global variable before and after relocation.

U-Boot 2021.01-rc2-00121-g5b8991c667f7-dirty (Nov 27 2020 - 11:08:51 +0000)                                                                                                                                                           
                                                                                                                                                                                                                                      
DRAM:  948 MiB                                                                                                                                                                                                                        
global_variable is located at 0000000000103de4 in board_f.c
global_variable is located at 000000003b3e4de4 in board_r.c
RPI 3 Model B+ (0xa020d3)                  

You can see that the address of our variable changes depending on when we refer to it. Everything before the call to board_init_r is prior to relocation and thus prior to this the BSS shouldn’t be used.

U-Boot could be modified to clear the BSS at the start of day, however at this point RAM may not be initialised – and in any case it would then have to be relocated when U-Boot relocates itself – which it doesn’t currently have to do. Also by doing it this way its consistent with the rest of the code base.

And that’s how an innocent code change can wreak havoc!

Popular Posts