Android vs. Maemo power management: static vs. dynamic

Some of you might have heard about Google’s Android team proposal to introduce wakelocks (aka suspend-blockers) to the Linux kernel. While there was a real issue being solved in the kernel side, the benefits on the user-space side were dubious at best, and after a huge discussion, they finally didn’t get in.

During this discussions the dynamic and static power management were described and discussed at length, and there was a bit of talk about Maemo(MeeGo) vs Android approaches to power management, but there was so much more than that.

Some people have problems with the battery on Android devices, for some people it’s just fine, some people have problems with the Maemo, other don’t, so in general; your mileage might vary. But given the extremely different approaches, it’s easy to see in which cases you would have better luck with Maemo, and in which with Android–Although I do think it’s obvious which approach is superior, but I am biased.

An interesting achievement was shared by Thiago Maciera, who managed to get ‘5 days and a couple of minutes‘ out of the Nokia N9 while traveling, and actually using it–and let’s remember this is a 1450 mAh battery. Some Android users might have trouble believing this, but hopefully this blog post would explain what’s going on.

So lets go ahead and explore the two approaches.

Dynamic Power Management

Perhaps the simplest way to imagine dynamic power management, is the gears of a manual transmission car. You go up and down depending on how much power does the system actually needs.

In some hardware, such as OMAP chips, it’s possible to have quite a lot of fine control on the power states of quite a lot of devices, plus different operating power points on the different cores. For example, it might be possible to have some devices, such as the display on, and active, some other devices partially off, like speaker, and other completely off, like USB. And based on the status of the whole system, whole blocks can be powered off, other with low voltage levels, etc.

Linux has a framework to deal properly with this kind of hardware, the runtime power management, that originally came from the embedded world, and a lot from OMAP development, but is now available to everyone.

The idea is very simple; devices should sleep as much as possible. This means that if you have a sound device that needs chunks of 100ms, and the system is not doing anything else but playing sound, then most of the devices go to sleep, even the CPU, and the CPU is only waken up when it needs to write data for the audio device. Even better is to configure the sound device for chunks of 1 second, so the system can sleep even more.

Obviously, some co-operation between kernel and user-space is needed. Say, if you have an instant messenger program that needs to do work every minute, and a mail program that is configured to check mail every 10 minutes, you would want them to do work at the same time when they align at every 10 minutes. This is sometimes called IP heartbeat; the system wakes up for a beat, and then immediately goes back to sleep. And there are kernel facilities as well, such as range timers.

All this is possible thanks to very small latencies required for devices to go to sleep and wakeup, and have intermediary modes (e.g. on, inactive, retention, off), so, for example a device might be able to go to inactive mode in 1ms, retention in 2ms, and off in 5ms (totally invented numbers). Again, the more sleep, the better. Obviously, this is impossible on x86 chips, which have huge latencies–at least right now, and it’s something Intel is probably trying to improve effusively. All these latencies are known by the runtime pm framework in the kernel, and based on that it and the usage, it figures out what is the lowest power state possible without breaking things.

Note I’m not a power management expert, but you cant watch a colleague of mine explain the OMAP 3 power-managment on this video:

Advanced Power Management for OMAP3

And there’s plenty of more resources.

Update: That was the wrong link, here are the resources.

Static Power Management

Static power management has two modes: on and off. That’s it.

OK, that’s not exactly the case in general, but it is in the Android context; the system is either suspended, or active, and it’s easy to know in which mode you are; if the screen is on, it’s active, and if it’s off; it’ is suspended (ideally).

There’s really not much more than that. The complexity comes from the problem of when to suspend; you might have turned off the display, but there might be a system service that still needs to do work, so this service grabs a suspend blocker which, as the name suggests, prevents the system from suspending (until the lock is released). This introduces a problem; a rouge program might grab a ‘suspend blocker’ and never release it, which means your phone will never suspend, and thus the battery would drain rather quickly. So, some security mechanisms are introduced to grant permissions selectively to use suspend blockers.

And moreover, Android developers found race conditions in the suspend sequences in certain situations that were rather nasty (e.g. the system tries to suspend at the same time the user clicks a button, and the system never wakes up again), and these were acknowledged as real issues that would happen on all systems (including PC’s and servers, albeit rarely, because they don’t suspend so often), and got fixed (or at least they tried).

Versus

First of all, it’s important to know that if you have dynamic pm working perfectly, you reach exactly the same voltage usage than static pm, so in ideal cases they both behave exactly the same.

The problem is that it’s really hard for dynamic pm to reach that ideal case, in reality systems are able to sleep for certain periods of time, after which they are woken up, often times unnecessarily, and as I already explained; that’s not good. So the goal of a dynamic pm system is to increase those periods of time as much as possible, thus maximizing battery life. But there’s a point of diminished returns (read this or this for expert explanations), so, if the system manages to sleep 1s in average, there’s really not much more to gain if it sleeps 2s, or even 10s. These goals were quite difficult to achieve in the past (not these, I invented those numbers), but not so much any more thanks to several mechanisms that have been introduced and implemented through the years. So it’s fair to say that the sweet spot of dynamic pm has been achieved.

This means that today a system that has been fine-tuned for dynamic pm can reach reach a decent battery life compared to one that uses static pm in most circumstances. But for some use-cases, say, you leave your phone on your desk and you don’t use it at all, static pm would allow it to stay alive for weeks, or even months, while dynamic pm can’t possibly achieve that any time soon. Hopefully you would agree, that nobody cares about those use-cases were you are not actually using the device.

And of course, you only need one application behaving badly and waking up the system constantly, and the battery life is screwed. So in essence, dynamic pm is hard to achieve.

Android developers argued that was one of the main reasons to go for static pm; it’s easier to achieve, specially if you want to support a lot of third party applications (Android Market) without compromising battery life. While this makes sense, I wasn’t convinced by this argument; you still can have one application that is behaving badly (grabbing suspend blockers the whole time), and while permissions should help, the application might still request the permission, and the user grant it (who reads incomprehensible warnings before clicking ‘Yes’ anyway?).

So, both can get screwed by bad apps (although it’s likely that it’s harder in the static pm case, albeit not that much).

But actually, you can have both static and dynamic power management, and in fact, Android does. But that doesn’t mean Android automatically wins, as I explained, the system needs to be fine-tuned for dynamic pm, and that has never been a focus of Android (there’s no API’s or frameworks for that, etc.). So, for example, a Nokia N9 phone might be able to sleep 1s in average, while an Android phone 100ms (when not suspended). This means when you are actually using the device (the screen is on), chances are, a system fine-tuned for dynamic pm (Nokia N9) would consume less battery, than an Android device.

That is the main difference between the two. tl;dr: dynamic pm is better for active usage.

So, if Android developers want to improve the battery usage while on active usage (which I assume is what the users want), they need to fine-tune the system for dynamic pm, and thus sleep as much as possible, hopefully reaching the sweet spot. But wait a second… If Android is using dynamic pm anyway, and they tune the system to the point of diminishing returns; there is not need for static pm. Right? Well, that’s my thinking, but I didn’t manage to make Android developers realize that in the discussion.

Plus, there’s a bunch of other reasons while static pm is not palatable for non-Android systems (aka. typical Linux systems), but I won’t go into those details.

Nokia’s bet was on dynamic, Google’s bet was on static, and in the end we agreed to disagree, but I think it’s clear from the outcome in real-world situations who was right–N9 users experiencing more than one day of normal usage, and even more than two. Sadly, only Nokia N9 users would manage to experience dynamic pm in full glory (at the moment).

Upstream

But not all is lost, and this in my opinion is the most important aspect. Dynamic pm lives on the Linux kernel mainline through the runtime power management API. This is not a Nokia invention that will die with the Nokia N9; it’s a collaborative effort where Nokia, TI, and other companies worked together, and now not only benefits OMAP, but other embedded chips, and even non-embedded ones. Being upstream means it’s good, and it has been blessed by many parties, and has gone through many iterations, and finally it probably doesn’t look like the original OMAP SRF code at all. Sooner or later your phone (if you don’t have an N9) will benefit from this effort, and it might even be an Android phone, your netbook (if not already benefiting in some way), and even your PC.

Android’s suspend blockers are not upstream, and it’s quite unlikely that they will ever be, or that you would see them used in any system other than Android, and there’s good reasons for that. Matthew Garrett did an excellent job of summarizing what went wrong with the story of suspend blockers on his presentation ‘Android/Linux kernel: Lessons learned’, but unfortunately the Linux foundation seems to be doing a poor job of providing those videos (I cannot watch them any more, even though I registered, and they haven’t been helpful through email conversations).

Update: I managed to get the video from the Linux Foundation and pushed it to YouTube:
[youtube:http://www.youtube.com/watch?v=A4psPP67YMU%5D

Here is part of the discussion on LKML, if you want to read it for some strange reason. WARNING; it’s huge.

Advertisements

A tale of just another Linux kernel bug

As part of a bigger effort to get my Nokia N900 in good shape for development, I decided to track down an issue with the keyboard; I could type ‘a’, but not ‘A’ or any special characters, so no ‘shift’ or ‘ctrl’ or anything special. Trying to figure out what was going on took me through an unexpected journey, which is not remarkable, but I think it’s a good example of what many kernel developers (and low level developers) constantly go through, and as such, might be interesting for some people to read.

Keycodes

So, first things first. I recently had an issue with a PS/2 keyboard on my PC, so I had an idea how to debug this, and the first thing I did was to run keycode, which shows messages like these:

keycode  30 press
keycode  30 release
keycode  30 press
keycode  30 release
keycode  29 press

Then these keys are supposed to be converted to real characters somehow through a key map, and apparently X has a map of its own.

However, I saw keycodes being pressed when I clicked ‘shift’, so I concluded that neither hardware, nor input driver was the problem, could it be the mapping? The MeeGo project provided a mapping file that I have been using for some time, and it clearly shows “keycode 42 = Shift”, and I was getting keycode 42, so the mapping seemed correct, but was it being applied properly? I found out this can be checked by using dumpkeys, and indeed, the mapping was correct.

Everything was fine up to this level. Next.

Note: all these tools (showkey, loadkeys, and dumpkeys) are provided by the ‘kbd’ package.

Virtual terminal

Now I stumbled into a problem; I’ve no idea what it is that I am interacting with on the framebuffer console (I’m not using X). So, I try fill that gap in my knowledge by going to the ##linux IRC channel in freenode.net, and ask the question: what is it that converts they keycodes to characters in a framebuffer console? As it’s typical when I ask these sorts of tricky questions, I get useless responses, like, ‘what distribution are you using?’ I knew that didn’t matter, so I set out to investigate myself.

I thought, well, what is this thing that I have to add in my inittab to get the console working? 1:2345:respawn:/sbin/getty 9600 tty1. I tried to read documentation about getty and try different options, like try to specify vt100 as an argument, but to no avail.

Maybe it’s done in the kernel? I thought. So I quickly went through my kernel config, and I find this gem:

CONFIG_VT

If you say Y here, you will get support for terminal devices with
display and keyboard devices. These are called "virtual" because you
can run several virtual terminals (also called virtual consoles) on
one physical terminal. This is rather useful, for example one
virtual terminal can collect system messages and warnings, another
one can be used for a text-mode user session, and a third could run
an X session, all in parallel. Switching between virtual terminals
is done with certain key combinations, usually Alt-.

The setterm command ("man setterm") can be used to change the
properties (such as colors or beeping) of a virtual terminal. The
man page console_codes(4) ("man console_codes") contains the special
character sequences that can be used to change those properties
directly. The fonts used on virtual terminals can be changed with
the setfont ("man setfont") command and the key bindings are defined
with the loadkeys ("man loadkeys") command.

You need at least one virtual terminal device in order to make use
of your keyboard and monitor. Therefore, only people configuring an
embedded system would want to say N here in order to save some
memory; the only way to log into such a system is then via a serial
or network connection.

If unsure, say Y, or else you won't be able to do much with your new
shiny Linux system :-)

That makes sense, now we are getting somewhere. So, in the past, there was this notion of ‘terminals’, which are specialized pieces of hardware that send and receive ASCII (or some codes), so they are the ones that have all the needed stuff to control they keyboard, screen, and so on. I never had the need to use one of these terminals, therefore I never really knew what a “virtual” terminal was. So a virtual terminal does the job of a real terminal; it needs an input driver, and a display driver, and really puts them to do something useful.

Interesting theory, but is it true? I quickly looked at the code inside drivers/tty/vt, and I found code to control the keyboard, screen, and also that contains the keyboard mappings. Excellent! So we found the thing that uses these keyboard mappings. Now what?

Before moving on, the fact that this code resides in tty is also helpful, basically, serial console, virtual terminals (framebuffer console), and real terminals all operate through tty’s (teleprinter), which are means of communication between hosts and these “devices”. So, getty really gets a tty to be used by any of these.

All right, so now the knowledge gap seems to be filled, what next? Well, clearly there’s something wrong with this virtual terminal, but what? I immediately started looking at the code at drivers/tty/vt/keyboard.c, and I noticed something interesting: kbdmode can have different values, like RAW, and this mode, certain things are not handled, like ‘shift’. That looked promising, but how to change that mode?

I looked for tools to control the virtual terminal, I found an interesting one, setterm, which was not available in my minimal system, I couldn’t find how to get it, and anyway didn’t have any option that I wanted. terminfo and such looked interesting, but it didn’t seem like anything relevant to the issue at hand. Then I found kbd_mode, which obviously did what I wanted, and I already had it :). I couldn’t even type mbd_mode on my keyboard, so I had to write shortcuts on my PC (kbda, kbdb, etc.). Unfortunately, I found out that initially the mode was set to ‘unicode’, which seemed Ok, and changing to ‘ascii’ didn’t change anything.

So, it didn’t seem to be a user-space configuration of any sort. Next.

Getting our hands dirty

Time to actually type some code. I modified the code in ‘drivers/tty/vt/keyboard.c’, specially in kbd_keycode() to find out the true kbdmode, the keycodes coming down, and how they were being interpreted.

I quickly found out that they keycodes and mode were indeed correct, but each and every key press was immediately followed by a key release, so shift+a was interpreted as shift, a. Now we are getting somewhere; the problem has to be on the input driver.

Maybe I chose the wrong one, or maybe I’m missing some configurations. I see some CONFIG_KEYBOARD_GPIO, and CONFIG_KEYBOARD_TWL4030, and it looks like I should be using TWL4030, as that’s the chip the Nokia N900 has, but I’m not sure, so time to look at the N900 schematics. Well, it seemed like TWL4030 is indeed the right one, and there’s nothing to it; either you have it or not.

Maybe some recent change broke it… But there’s nothing recent that I can see that could do that. So it’s time to take a look at the actual code: drivers/input/keyboard/twl4030_keypad.c. After adding a few prints here and there I realize the problem starts with this code ret = twl4030_kpread(kp, &reg, KEYP_ISR1, 1) which returns 0 after a key press (that returns 1). So, time to read the TRM.

It took me some time to find the right document, and then understand what all the configuration options were actually doing (more or less), and then play around with them. After making a lot of more or less random changes I notice no difference in this particular problem of getting an extra ‘0’. So I think to myself; maybe the problem is the interrupt.

So I abandon the configuration of the keyboard, and look at the code to request the interrupt:
request_threaded_irq(kp->irq, NULL, do_kp_irq, 0, pdev->name, kp);

I had some interrupt issues before (in fact on this very keyboard), so I knew a few hacks I could try, like specifying the IRQF_ONESHOT flag. That didn’t help, so I tried to do that in the parent interrupt on ‘drivers/mfd/twl4030-irq.c’, because I saw a patch from Neil Brown on the linux-omap mailing list that fixed another issue, but that didn’t help either. Then I realized I saw one patch that affected this interrupt request (see here). So I revert the patch, and voilà, no more 0’s afterwards, and the keyboard works properly.

However, there’s a nasty warning about interrupts being enabled, which is probably the reason why the original patch happened, so I try a few random things to get rid of it, but nothing helps. So then I wonder, maybe the reason the keyboard driver worked before that patch is just pure luck, and these extra interrupts were not being detected properly.

Enough fooling around

Since I really want to fix this issue properly, I push myself to really understand what’s going on in the driver. So I slowly read all the documentation, and all the registers, and try to set different values to see what’s going on. While doing that, I noticed one function was calculating the times wrongly, and was telling the driver to use values twice as big as originally intended. It took me some time to figure out why the author chose 31 << (x + 1) instead of 2 << (x + 1) * 31, which is what a direct function conversion would return, until I realized that a shift basically means multiplying by two, and x << 1 is basically the same as (2 << 0) * x, but the author missed that x = ptv + 1, and so it should be 31 << (ptv + 2). Anyway, after being confident of these timeout values, I could set big timeouts on the range of seconds without overflowing the registers and see exactly what they were doing in timescales I could notice.

So yeah, these registers were doing what I thought they should be doing. Nice try.

Time to go back to the interrupt handling. After reading the code of the twl4030-irq, which is supposed to fire the irqs that the keyboard driver eventually gets, and then reading some kernel core code as well, it was not really clear to me how these were all weaved together, so I added some printfs.

Before going forward I’ll briefly explain a bit what TWL4030 is. To my understanding it’s basically an integrated circuit that has many functions, one of them is having a keyboard controller. So there is no dedicated IRQ for the keyboard interrupt, but TWL4030 has a level interrupt, and then the right module IRQ is demultiplexed. OK.

I was hoping there was a chain of actions like pih -> sih (from twl-core) -> keyboard, but no, there was only one action in the chain. Fortunately when I reverted the patch I got a warning with a backtrace where I could see something like twl4030_irq_thread->generic_handle_irq->handle_simple_irq. It was really straight-forward, but I couldn’t see why. There was a lot of code in twl403o-irq for a “secondary interrupt handler” but it didn’t seem to be called at all, and I didn’t see how generic_handle_irq was calling handle_simple_irq.

Time to step back for a second. I remembered a patch from Neil Brown trying to fix something regarding how this sih stuff was called (here) because it was not called at all, but these patches are supposed to be fixes for other patches that are not applied at this point, so they really wouldn’t help.

At this point I’m pretty much stuck, as I’ve no idea how all this code is connected, you might think that all this is way over my head, and it might be, even if just a little, but that has never deterred me. I know that as long as I have the code, and I have a way to run it, I can figure out how it works, so I just keep trying.

After reading the code more carefully, I noticed that only some twl “modules” had a sih setup, and the keyboard was not one of them, so that explained the sih part. And then I noticed a part of the irq handling that dealt with “chips”, and one interesting function that setup the chip’s “handler”. Wait a second… So there’s a chip handler, and a client handler, a few keystrokes after and I find this gem irq_set_chip_and_handler(i, &twl4030_irq_chip, handle_simple_irq). Finally! So I can now see the whole chain of events, but alas, I still have no idea why things are failing.

I thought maybe some states where miscalculated, so I find a function that is used to print everything related to an irq, but it was hidden in some internal code, so I had to do a few tricks to use it in the keyboard driver. All the states seemed to be OK, except one… the irq count, which showed something like 0, 0, 0, 1, 2, 2, 3 where it should have shown 0, 0, 1, 1, 2, 2. I looked at all the places where this counter is modified, and I found an interesting function: handle_nested_irq, so I replaced handle_simple_irq just for fun, and…

BAM!

Everything worked perfectly.

Why, Why, Why

So, now I found a fix, but why does it work? If I want to fix this in a way that other people will not get bitten, I have to find the proper fix, so “This seems to make things work here” usually doesn’t convince many kernel hackers.

Using handle_nested_irq makes the compiler throw a warning right away, because the number of arguments is different, so I looked at how other people use this function, and it’s indeed very different, it looks like people are calling handle_nested_irq directly, instead of generic_handle_irq, which would eventually handle handle_simple_irq, because that’s the chip’s configured handler. It all boils down to the difference between handle_nested_irq and handle_simple_irq, and looking at them it’s clear that handle_nested_irq does much less, specifically, it calls directly the action->thread_fn() callback, which is the keyboard interrupt handler, rather than waking up the thread, which is the same function (but in a specialized thread).

And this finally shines some light; the reason why the interrupt counter is wrong, is that the thread is waked after the irq has been processed, so the code that is supposed to clear the interrupt hasn’t been called yet, and a second interrupt is generated which is spurious. The relevant code is supposed to read a register, and the hardware clear the interrupt on a read (or write) operation. The reason handle_nested_irq works is because it doesn’t bother with the interrupt thread at all, it calls the function right away on the same thread, and thus the interrupt is marked as handled at the right time (after it has been really handled).

And this also explains why other people call handle_nested_irq instead of generic_handle_irq; you are supposed to call it inside an interrupt thread (which is configured with request_threaded_irq), but there’s a thread lying around that nobody uses. Indeed, Neil Brown noticed the same (here), so irq_set_nested_thread() will make sure that no extra thread is created.

So both irq handlers must be threaded, the one for the twl-core, and the one for the keyboard, not just the one for the keyboard like right now. Fortunately Felipe Balbi already fixed that (see this patch series), but that broke things badly, and then Neil Brown fixed them (see this). Interestingly enough, I had already tested those patches for other reasons, but at this point I wasn’t sure if they would fix the keyboard problems, I just assumed the would anyway.

Good good, it looks like we finally have the whole picture, but the job’s not done yet.

Getting it “done”

Now the question is: how to fix this? One limitation of the ‘stable’ kernel trees (at least AFAIK), is that the patches should be in the latest Linus’s tree, so whatever solution is found, should be in the main tree.

Ideally, all the stack of irq’s should be reorganized to use the nested/threaded API, or none at all. So this patch, should have been applied only after this patch series, and not two years beforehand. I pondered for quite some time whether a middle-ground could be found, either by workarounds in the core irq handler, or in the keyboard one, eventually I came to the conclusion that since the core is supposed to set irq_set_nested_thread() anyway (just trust me), the keyboard one is the one that should check if it’s nested or not, and either set the thread handler, or a normal one, and this code should work before and after the patches applied for 3.2. Unfortunately, the ‘nested’ flag is supposed to be internal to the kernel’s thread handling code.

So there’s really no way around it, either the original patch is reverted (and other possible similar ones for other twl modules), or all the twl core irq handler gets the new code for 3.2 backported, and that code isn’t even fixed yet (as of v3.2-rc5) (and maybe it will not be the way things are progressing).

Then the question arises; is all that new code on v3.2 working properly? I went ahead and tried it, and lo and behold; the keyboard works as expected. I then backported all those patches to v3.1, and they all worked too.

But does it really make sense to apply all those patches to all the kernels after v2.6.33? Or does it make sense to revert a few of them? That’s not for me to decide, so I ask the community, after I did all that work on this mail.

Sometimes the result of a week’s work ends up being a one line patch, sometimes it’s just one character, and in this case, there was no patch (at least from me), but at least the issue is clearly identified, as well as the fixes. And that means nobody has to do this work again.

Well, the fun is not over… We still need to synchronize the maintainers, so the right patches land on Linus’ tree for v3.2, and they are back-ported to the relevant linux-stable trees.

Kernel development is hard, let’s go shopping

What are you talking about? This isn’t even kernel development, it’s just some legwork 😉 I’m going to be Topper for a second and say “That’s nothing!”, there’s way more complicated and challenging issues kernel developers confront all the time. I thought this was interesting because of all the steps I had to do, and because it involved things I had no idea about.

Was this worth my time? Well, I learned what is actually a virtual terminal for real, and not some vague notion, like when common people say “My browser? Yeah, I use Google… No?”. I also learned a bit more about the IRQ handling API in the Linux kernel, and BTW, all this threaded interrupt API started because of real issues, and the removal of IRQF_DISABLED (nice LWN article about it here), and that thread was fun to read years ago, even when I didn’t understand most of it, maybe it’s time to read it again 🙂

For me, the best is the satisfaction of knowing that I really “got it”. I mean, I really understand what causes the problem, many of the possible solutions, including hacky and proper ones, and exactly what to do if I get bitten by a similar problem regarding threaded IRQ’s.

And now that the keyboard is fixed, on with the next problem on the N900 (which I stopped working on, because I though a functional keyboard would help to debug it, as opposed to shut down, remove the MMC, plug it into a PC, write a script with the commands I want to run, unmount, plug it back, and boot [I can’t use USB networking for this issue to appear]).