It started as a request from our tech lead: please help triage these patches. So I lookedat the set of patches and started with what looked like the simplest one:
Fix topology for Core scheduling.
It *just* reorders the code to call
store_cpu_topology(cpu);
before
notify_cpu_starting() .
Yeah…there is no such thing as a simple patch. These are my notes as I study and learn ACPI. My assumptions are here for all to see, and may well prove to be wrong.
Lets dig in.
To start, I figured I would reproduce the bug. If you follow the conversation on down, there is some discussion about using a recent version of stress-ng and ensuring that CORE_SCHED and related symbols are defined. I got the latest out of github, built and Ran it on an Altra Max using the Fedora 36 (unpatched) kernel…and it ran just fine.
I’ve since reread the message that states: “I assume this is for a machine which relies on MPIDR-based setup (package_id == -1)? I.e. it doesn’t have proper ACPI/(DT) data for topology setup.”
And I realize that the machine I am working on does not match that criteria. What is it talking about? Lets start with ACPI. The C in ACPI stands for Configuration and the P stands for Power. Many of us associate ACPI with dealing with standby and hibernate issues on Laptops. However, ACPI really is a generic way to manage hardware, since a system can have many different physical configurations. ACPI can bethought of as a way for the hardware to tell the operating systems things such as “what devices are installed in the system” and “what buses are they attached to” as well as “what power state are they in.” It then allows the operating system to affect change in those devices.
Because ACPI is an optional subsystem, it resides in the driver subtree of the linux Kernel. At this point, my brain noted two things:
- The code I am chasing down is in the acpica subdirectory.
- There is an architecture specific subdirectory for arm64 as well as x86_64.
Like many technologies in the Open world, ACPI is an open standard with a dominianant reference implementation. ACPIPA is an architecture agnostic implementation for most of ACPI. Thus, the ARM64 specific subdirectory only has a small bit of code, glue for the ARM64 system like the one I’m testing.
But then I notice that, at the ACPI level has a bunch of files as well. Leaving my brain to ponder why some common code is in drivers/acpi and some in drivers/acpi/acpipa. If I find an answer as to why things would go in one versus the other, I will update.
Note that “driver” is not the same as module. Running lsmod on my system does not show anw acpi modules. But looking at the config for the kernel:
grep "CONFIG_ACPI=" /usr/lib/modules/5.17.13-300.fc36.aarch64/config CONFIG_ACPI=y |
It is built in to the Fedora stock Kernel for ARM64.
What is the earliest reference to acpi in the kernel? As of this writing, the earliest I can find is in init/main.c: acpi_early_init(); This means that ACPI is explicitly one of the subsystems initialized when the Kernel first starts running.
The code that implements the early_init is in drivers/acpi/bus.c. After we pass all the precondition checks, the heart of the function is a calls to acpi_reallocate_root_table(); and acpi_initialize_subsystem();
What is a root table? It is defined in drivers/acpi/acpica/acglobal.h but what populates it? in drivers/acpi/acpica/tbdata.c we see: acpi_gbl_root_table_list.tables = tables; Lets dig a little deeper there.
The ACPI table list is defined in drivers/acpi/acpica/aclocal.h As a structure, it is designed to carry the header information for a list of tables. Asid from the pointer to the table entries, it has two counts: current and max, and a byte sized binary flag field. If we assume that this whole structure gets initialized to 0 at allocation, then the table counts will both be 0. Working with this assumption (which may well be wrong) means that the acpi_early_init-> acpi_reallocate_root_tables code may well be the first place to ever modify this table and does that initial population. Lets assume that for the moment and mentally trace the code.
Passing the preconditions (including code that does an early return but only values are set in the structure) we see that the first real call is acpi_ut_acquire_mutex(ACPI_MTX_TABLES);
The function acpi_ut_acquire_mutex is fairly complex code fort acquiring a mutex. It does a good bit of deadlock prevention, and makes some os wrapped calls. It eventually resolves to a call to down_timeout, which is a Linux specific call.
With the mutex allocated, we get into the table loading code. We skip the first loop, as it is based on the check i < acpi_gbl_root_table_list.current_table_count. Right now we are working with the assumption that this count is initialized to 0 and not yet enabled. We would skip the following loop for the same reason.
After the loops, we see that flags are initialized. Between the two flags is a call to resize the root table list. Maybe this is where the first initialization happens?
acpi_gbl_root_table_list.flags |= ACPI_ROOT_ALLOW_RESIZE; status = acpi_tb_resize_root_table_list(); acpi_gbl_root_table_list.flags |= ACPI_ROOT_ORIGIN_ALLOCATED; |
The resize code is also in tbdata.c. Skip the precondition checks. The variable table_count gets initialized to 0, and the max_table_count gets initialize to 0 + ACPI_ROOT_TABLE_SIZE_INCREMENT; which makes it 4. This gets allocated and zeroed out.
This if statement will evaluate to false and thus skip the block that does the actual copying.
if (acpi_gbl_root_table_list.tables) |
There are a handful of assignments at the end of the function. The ones that catche my eye are:
acpi_gbl_root_table_list.tables = tables; ... acpi_gbl_root_table_list.flags |= ACPI_ROOT_ORIGIN_ALLOCATED; |
Which makes me look back up into that skipped block and see the check:
if (acpi_gbl_root_table_list.flags & ACPI_ROOT_ORIGIN_ALLOCATED) { ACPI_FREE(acpi_gbl_root_table_list.tables); } |
So after this call, the tables value will be set, and this check will make sure that we free the old list each time. Thus, the next time that this code gets called, it will pass the checks…but it will still have a size 0 for table_count. Something else must fill in this code to make the tables have actual values.
It is interesting to note that each time this function gets called, it incremental the max table could by 4, and thus the allocated number of entries in the table. This makes me think that we will encounter code to add an entry that will call the resize if the table is full.
None of the rest of the code loads in the entries. Jumping back to acpi_early_init, we see that the follow on call after acpi_reallocate_root_table is to acpi_initialize_subsystem() and now it makes sense that the code thus far only provides some space to hold the data; this should be where we first load in the ACPI table entries. Lets skip ahead to the next real call, the os specific initialization code.
We see four calls to acpi_os_map_generic_address. Two of them are left intact, and two of them are converted to unsigned longs and assigned to global variables. What is this mapping? It s a wrpper to the function acpi_os_map_iomem which has this comment:
acpi_os_map_iomem – Get a virtual address for a given physical address range.
@phys: Start of the physical address range to map.
@size: Size of the physical address range to map.
*
Look up the given physical address range in the list of existing ACPI memory mappings. If found, get a reference to it and return a pointer to it (its virtual address). If not found, map it, add it to that list and return a pointer to it.
*
During early init (when acpi_permanent_mmap has not been set yet) this routine simply calls __acpi_map_table() to get the job done.
This is not the same virtual memory map as used by the Linux Kernel. This is an ACPI specific mapping. ACPI is a hardware (coprocessor?) subsystem that needs to keep internal pointers and sometimes expose these to the operating system. THis mapping allows the two subsystems to point to the same set of data values. Thus, what I think this code is doing is storing a local variable with an uninitialized chunck of ACPI memory.
The global variable acpi_gbl_FADT is defined in include/acpi/acpixf.h (External interfaces) using a macro (so the source browser can’t jump to it…) and this is ainstance of a prettly large structure of type acpi_table_fadt . FADT stands for Fixed ACPI Description Table. How do we find this? Lets start a couple steps back with the Root System Description Pointer. This point to a structure that has, among other things, an entry to the global description table, which then points to (among other things) the Fixed ACPI Description table.
But where do we get that first pointer, the thing that points to all the other things…the root of it all? We scan through memory, apparently. We are given a chunk of physical memory to search, defined by the constants
ACPI_EBDA_PTR_LOCATION, ACPI_EBDA_PTR_LENGTH)
which are defined as
define ACPI_EBDA_PTR_LOCATION 0x0000040E /* Physical Address */
and
define ACPI_EBDA_PTR_LENGTH 2
But again, we perform an acpi_os-map_memory on these, so they are somehow transformed, perhaps into actual physical memory locations accessed by the Linux Kernel. Once the memory transformations are complete, we scan the memory: This function checks for the magic string in the right location. There seem to be a couple ways that hardware can expose this, so the calling function checks them in order. The comments for these sections explain.
- 1a) Get the location of the Extended BIOS Data Area (EBDA) */
- 1b) Search EBDA paragraphs (EBDA is required to be a minimum of 1K length
- 2) Search upper memory: 16-byte boundaries in E0000h-FFFFFh
I’d be interested in seeing what our Hardware actually exposes here, and why there are the two options. Maybe git has an explanation? It looks like most of the logic for this function was written in two patches, one in 2005, the next in 2007. There are some later changes to0. Lets see what it looked like in 2005…it looks the same. OK, whatever logic is here, it has been around for a while….as it looks like the commit that has the initial code is abased on a Linus reworking of the whole repository to load it into git.
Now that we have the root pointer, what happens with it? We’ll explore this in the next article.
UPDATE: Interesting comment I found that says it is done differently in a UEFI system….and I know that we are supporting UEFI,