Android vs. Maemo power management: static vs. dynamic

Some of you might have heard about Google’s Android team proposal to introduce wakelocks (aka suspend-blockers) to the Linux kernel. While there was a real issue being solved in the kernel side, the benefits on the user-space side were dubious at best, and after a huge discussion, they finally didn’t get in.

During this discussions the dynamic and static power management were described and discussed at length, and there was a bit of talk about Maemo(MeeGo) vs Android approaches to power management, but there was so much more than that.

Some people have problems with the battery on Android devices, for some people it’s just fine, some people have problems with the Maemo, other don’t, so in general; your mileage might vary. But given the extremely different approaches, it’s easy to see in which cases you would have better luck with Maemo, and in which with Android–Although I do think it’s obvious which approach is superior, but I am biased.

An interesting achievement was shared by Thiago Maciera, who managed to get ‘5 days and a couple of minutes‘ out of the Nokia N9 while traveling, and actually using it–and let’s remember this is a 1450 mAh battery. Some Android users might have trouble believing this, but hopefully this blog post would explain what’s going on.

So lets go ahead and explore the two approaches.

Dynamic Power Management

Perhaps the simplest way to imagine dynamic power management, is the gears of a manual transmission car. You go up and down depending on how much power does the system actually needs.

In some hardware, such as OMAP chips, it’s possible to have quite a lot of fine control on the power states of quite a lot of devices, plus different operating power points on the different cores. For example, it might be possible to have some devices, such as the display on, and active, some other devices partially off, like speaker, and other completely off, like USB. And based on the status of the whole system, whole blocks can be powered off, other with low voltage levels, etc.

Linux has a framework to deal properly with this kind of hardware, the runtime power management, that originally came from the embedded world, and a lot from OMAP development, but is now available to everyone.

The idea is very simple; devices should sleep as much as possible. This means that if you have a sound device that needs chunks of 100ms, and the system is not doing anything else but playing sound, then most of the devices go to sleep, even the CPU, and the CPU is only waken up when it needs to write data for the audio device. Even better is to configure the sound device for chunks of 1 second, so the system can sleep even more.

Obviously, some co-operation between kernel and user-space is needed. Say, if you have an instant messenger program that needs to do work every minute, and a mail program that is configured to check mail every 10 minutes, you would want them to do work at the same time when they align at every 10 minutes. This is sometimes called IP heartbeat; the system wakes up for a beat, and then immediately goes back to sleep. And there are kernel facilities as well, such as range timers.

All this is possible thanks to very small latencies required for devices to go to sleep and wakeup, and have intermediary modes (e.g. on, inactive, retention, off), so, for example a device might be able to go to inactive mode in 1ms, retention in 2ms, and off in 5ms (totally invented numbers). Again, the more sleep, the better. Obviously, this is impossible on x86 chips, which have huge latencies–at least right now, and it’s something Intel is probably trying to improve effusively. All these latencies are known by the runtime pm framework in the kernel, and based on that it and the usage, it figures out what is the lowest power state possible without breaking things.

Note I’m not a power management expert, but you cant watch a colleague of mine explain the OMAP 3 power-managment on this video:

Advanced Power Management for OMAP3

And there’s plenty of more resources.

Update: That was the wrong link, here are the resources.

Static Power Management

Static power management has two modes: on and off. That’s it.

OK, that’s not exactly the case in general, but it is in the Android context; the system is either suspended, or active, and it’s easy to know in which mode you are; if the screen is on, it’s active, and if it’s off; it’ is suspended (ideally).

There’s really not much more than that. The complexity comes from the problem of when to suspend; you might have turned off the display, but there might be a system service that still needs to do work, so this service grabs a suspend blocker which, as the name suggests, prevents the system from suspending (until the lock is released). This introduces a problem; a rouge program might grab a ‘suspend blocker’ and never release it, which means your phone will never suspend, and thus the battery would drain rather quickly. So, some security mechanisms are introduced to grant permissions selectively to use suspend blockers.

And moreover, Android developers found race conditions in the suspend sequences in certain situations that were rather nasty (e.g. the system tries to suspend at the same time the user clicks a button, and the system never wakes up again), and these were acknowledged as real issues that would happen on all systems (including PC’s and servers, albeit rarely, because they don’t suspend so often), and got fixed (or at least they tried).

Versus

First of all, it’s important to know that if you have dynamic pm working perfectly, you reach exactly the same voltage usage than static pm, so in ideal cases they both behave exactly the same.

The problem is that it’s really hard for dynamic pm to reach that ideal case, in reality systems are able to sleep for certain periods of time, after which they are woken up, often times unnecessarily, and as I already explained; that’s not good. So the goal of a dynamic pm system is to increase those periods of time as much as possible, thus maximizing battery life. But there’s a point of diminished returns (read this or this for expert explanations), so, if the system manages to sleep 1s in average, there’s really not much more to gain if it sleeps 2s, or even 10s. These goals were quite difficult to achieve in the past (not these, I invented those numbers), but not so much any more thanks to several mechanisms that have been introduced and implemented through the years. So it’s fair to say that the sweet spot of dynamic pm has been achieved.

This means that today a system that has been fine-tuned for dynamic pm can reach reach a decent battery life compared to one that uses static pm in most circumstances. But for some use-cases, say, you leave your phone on your desk and you don’t use it at all, static pm would allow it to stay alive for weeks, or even months, while dynamic pm can’t possibly achieve that any time soon. Hopefully you would agree, that nobody cares about those use-cases were you are not actually using the device.

And of course, you only need one application behaving badly and waking up the system constantly, and the battery life is screwed. So in essence, dynamic pm is hard to achieve.

Android developers argued that was one of the main reasons to go for static pm; it’s easier to achieve, specially if you want to support a lot of third party applications (Android Market) without compromising battery life. While this makes sense, I wasn’t convinced by this argument; you still can have one application that is behaving badly (grabbing suspend blockers the whole time), and while permissions should help, the application might still request the permission, and the user grant it (who reads incomprehensible warnings before clicking ‘Yes’ anyway?).

So, both can get screwed by bad apps (although it’s likely that it’s harder in the static pm case, albeit not that much).

But actually, you can have both static and dynamic power management, and in fact, Android does. But that doesn’t mean Android automatically wins, as I explained, the system needs to be fine-tuned for dynamic pm, and that has never been a focus of Android (there’s no API’s or frameworks for that, etc.). So, for example, a Nokia N9 phone might be able to sleep 1s in average, while an Android phone 100ms (when not suspended). This means when you are actually using the device (the screen is on), chances are, a system fine-tuned for dynamic pm (Nokia N9) would consume less battery, than an Android device.

That is the main difference between the two. tl;dr: dynamic pm is better for active usage.

So, if Android developers want to improve the battery usage while on active usage (which I assume is what the users want), they need to fine-tune the system for dynamic pm, and thus sleep as much as possible, hopefully reaching the sweet spot. But wait a second… If Android is using dynamic pm anyway, and they tune the system to the point of diminishing returns; there is not need for static pm. Right? Well, that’s my thinking, but I didn’t manage to make Android developers realize that in the discussion.

Plus, there’s a bunch of other reasons while static pm is not palatable for non-Android systems (aka. typical Linux systems), but I won’t go into those details.

Nokia’s bet was on dynamic, Google’s bet was on static, and in the end we agreed to disagree, but I think it’s clear from the outcome in real-world situations who was right–N9 users experiencing more than one day of normal usage, and even more than two. Sadly, only Nokia N9 users would manage to experience dynamic pm in full glory (at the moment).

Upstream

But not all is lost, and this in my opinion is the most important aspect. Dynamic pm lives on the Linux kernel mainline through the runtime power management API. This is not a Nokia invention that will die with the Nokia N9; it’s a collaborative effort where Nokia, TI, and other companies worked together, and now not only benefits OMAP, but other embedded chips, and even non-embedded ones. Being upstream means it’s good, and it has been blessed by many parties, and has gone through many iterations, and finally it probably doesn’t look like the original OMAP SRF code at all. Sooner or later your phone (if you don’t have an N9) will benefit from this effort, and it might even be an Android phone, your netbook (if not already benefiting in some way), and even your PC.

Android’s suspend blockers are not upstream, and it’s quite unlikely that they will ever be, or that you would see them used in any system other than Android, and there’s good reasons for that. Matthew Garrett did an excellent job of summarizing what went wrong with the story of suspend blockers on his presentation ‘Android/Linux kernel: Lessons learned’, but unfortunately the Linux foundation seems to be doing a poor job of providing those videos (I cannot watch them any more, even though I registered, and they haven’t been helpful through email conversations).

Update: I managed to get the video from the Linux Foundation and pushed it to YouTube:
[youtube:http://www.youtube.com/watch?v=A4psPP67YMU%5D

Here is part of the discussion on LKML, if you want to read it for some strange reason. WARNING; it’s huge.

Advertisements

A tale of just another Linux kernel bug

As part of a bigger effort to get my Nokia N900 in good shape for development, I decided to track down an issue with the keyboard; I could type ‘a’, but not ‘A’ or any special characters, so no ‘shift’ or ‘ctrl’ or anything special. Trying to figure out what was going on took me through an unexpected journey, which is not remarkable, but I think it’s a good example of what many kernel developers (and low level developers) constantly go through, and as such, might be interesting for some people to read.

Keycodes

So, first things first. I recently had an issue with a PS/2 keyboard on my PC, so I had an idea how to debug this, and the first thing I did was to run keycode, which shows messages like these:

keycode  30 press
keycode  30 release
keycode  30 press
keycode  30 release
keycode  29 press

Then these keys are supposed to be converted to real characters somehow through a key map, and apparently X has a map of its own.

However, I saw keycodes being pressed when I clicked ‘shift’, so I concluded that neither hardware, nor input driver was the problem, could it be the mapping? The MeeGo project provided a mapping file that I have been using for some time, and it clearly shows “keycode 42 = Shift”, and I was getting keycode 42, so the mapping seemed correct, but was it being applied properly? I found out this can be checked by using dumpkeys, and indeed, the mapping was correct.

Everything was fine up to this level. Next.

Note: all these tools (showkey, loadkeys, and dumpkeys) are provided by the ‘kbd’ package.

Virtual terminal

Now I stumbled into a problem; I’ve no idea what it is that I am interacting with on the framebuffer console (I’m not using X). So, I try fill that gap in my knowledge by going to the ##linux IRC channel in freenode.net, and ask the question: what is it that converts they keycodes to characters in a framebuffer console? As it’s typical when I ask these sorts of tricky questions, I get useless responses, like, ‘what distribution are you using?’ I knew that didn’t matter, so I set out to investigate myself.

I thought, well, what is this thing that I have to add in my inittab to get the console working? 1:2345:respawn:/sbin/getty 9600 tty1. I tried to read documentation about getty and try different options, like try to specify vt100 as an argument, but to no avail.

Maybe it’s done in the kernel? I thought. So I quickly went through my kernel config, and I find this gem:

CONFIG_VT

If you say Y here, you will get support for terminal devices with
display and keyboard devices. These are called "virtual" because you
can run several virtual terminals (also called virtual consoles) on
one physical terminal. This is rather useful, for example one
virtual terminal can collect system messages and warnings, another
one can be used for a text-mode user session, and a third could run
an X session, all in parallel. Switching between virtual terminals
is done with certain key combinations, usually Alt-.

The setterm command ("man setterm") can be used to change the
properties (such as colors or beeping) of a virtual terminal. The
man page console_codes(4) ("man console_codes") contains the special
character sequences that can be used to change those properties
directly. The fonts used on virtual terminals can be changed with
the setfont ("man setfont") command and the key bindings are defined
with the loadkeys ("man loadkeys") command.

You need at least one virtual terminal device in order to make use
of your keyboard and monitor. Therefore, only people configuring an
embedded system would want to say N here in order to save some
memory; the only way to log into such a system is then via a serial
or network connection.

If unsure, say Y, or else you won't be able to do much with your new
shiny Linux system :-)

That makes sense, now we are getting somewhere. So, in the past, there was this notion of ‘terminals’, which are specialized pieces of hardware that send and receive ASCII (or some codes), so they are the ones that have all the needed stuff to control they keyboard, screen, and so on. I never had the need to use one of these terminals, therefore I never really knew what a “virtual” terminal was. So a virtual terminal does the job of a real terminal; it needs an input driver, and a display driver, and really puts them to do something useful.

Interesting theory, but is it true? I quickly looked at the code inside drivers/tty/vt, and I found code to control the keyboard, screen, and also that contains the keyboard mappings. Excellent! So we found the thing that uses these keyboard mappings. Now what?

Before moving on, the fact that this code resides in tty is also helpful, basically, serial console, virtual terminals (framebuffer console), and real terminals all operate through tty’s (teleprinter), which are means of communication between hosts and these “devices”. So, getty really gets a tty to be used by any of these.

All right, so now the knowledge gap seems to be filled, what next? Well, clearly there’s something wrong with this virtual terminal, but what? I immediately started looking at the code at drivers/tty/vt/keyboard.c, and I noticed something interesting: kbdmode can have different values, like RAW, and this mode, certain things are not handled, like ‘shift’. That looked promising, but how to change that mode?

I looked for tools to control the virtual terminal, I found an interesting one, setterm, which was not available in my minimal system, I couldn’t find how to get it, and anyway didn’t have any option that I wanted. terminfo and such looked interesting, but it didn’t seem like anything relevant to the issue at hand. Then I found kbd_mode, which obviously did what I wanted, and I already had it :). I couldn’t even type mbd_mode on my keyboard, so I had to write shortcuts on my PC (kbda, kbdb, etc.). Unfortunately, I found out that initially the mode was set to ‘unicode’, which seemed Ok, and changing to ‘ascii’ didn’t change anything.

So, it didn’t seem to be a user-space configuration of any sort. Next.

Getting our hands dirty

Time to actually type some code. I modified the code in ‘drivers/tty/vt/keyboard.c’, specially in kbd_keycode() to find out the true kbdmode, the keycodes coming down, and how they were being interpreted.

I quickly found out that they keycodes and mode were indeed correct, but each and every key press was immediately followed by a key release, so shift+a was interpreted as shift, a. Now we are getting somewhere; the problem has to be on the input driver.

Maybe I chose the wrong one, or maybe I’m missing some configurations. I see some CONFIG_KEYBOARD_GPIO, and CONFIG_KEYBOARD_TWL4030, and it looks like I should be using TWL4030, as that’s the chip the Nokia N900 has, but I’m not sure, so time to look at the N900 schematics. Well, it seemed like TWL4030 is indeed the right one, and there’s nothing to it; either you have it or not.

Maybe some recent change broke it… But there’s nothing recent that I can see that could do that. So it’s time to take a look at the actual code: drivers/input/keyboard/twl4030_keypad.c. After adding a few prints here and there I realize the problem starts with this code ret = twl4030_kpread(kp, &reg, KEYP_ISR1, 1) which returns 0 after a key press (that returns 1). So, time to read the TRM.

It took me some time to find the right document, and then understand what all the configuration options were actually doing (more or less), and then play around with them. After making a lot of more or less random changes I notice no difference in this particular problem of getting an extra ‘0’. So I think to myself; maybe the problem is the interrupt.

So I abandon the configuration of the keyboard, and look at the code to request the interrupt:
request_threaded_irq(kp->irq, NULL, do_kp_irq, 0, pdev->name, kp);

I had some interrupt issues before (in fact on this very keyboard), so I knew a few hacks I could try, like specifying the IRQF_ONESHOT flag. That didn’t help, so I tried to do that in the parent interrupt on ‘drivers/mfd/twl4030-irq.c’, because I saw a patch from Neil Brown on the linux-omap mailing list that fixed another issue, but that didn’t help either. Then I realized I saw one patch that affected this interrupt request (see here). So I revert the patch, and voilà, no more 0’s afterwards, and the keyboard works properly.

However, there’s a nasty warning about interrupts being enabled, which is probably the reason why the original patch happened, so I try a few random things to get rid of it, but nothing helps. So then I wonder, maybe the reason the keyboard driver worked before that patch is just pure luck, and these extra interrupts were not being detected properly.

Enough fooling around

Since I really want to fix this issue properly, I push myself to really understand what’s going on in the driver. So I slowly read all the documentation, and all the registers, and try to set different values to see what’s going on. While doing that, I noticed one function was calculating the times wrongly, and was telling the driver to use values twice as big as originally intended. It took me some time to figure out why the author chose 31 << (x + 1) instead of 2 << (x + 1) * 31, which is what a direct function conversion would return, until I realized that a shift basically means multiplying by two, and x << 1 is basically the same as (2 << 0) * x, but the author missed that x = ptv + 1, and so it should be 31 << (ptv + 2). Anyway, after being confident of these timeout values, I could set big timeouts on the range of seconds without overflowing the registers and see exactly what they were doing in timescales I could notice.

So yeah, these registers were doing what I thought they should be doing. Nice try.

Time to go back to the interrupt handling. After reading the code of the twl4030-irq, which is supposed to fire the irqs that the keyboard driver eventually gets, and then reading some kernel core code as well, it was not really clear to me how these were all weaved together, so I added some printfs.

Before going forward I’ll briefly explain a bit what TWL4030 is. To my understanding it’s basically an integrated circuit that has many functions, one of them is having a keyboard controller. So there is no dedicated IRQ for the keyboard interrupt, but TWL4030 has a level interrupt, and then the right module IRQ is demultiplexed. OK.

I was hoping there was a chain of actions like pih -> sih (from twl-core) -> keyboard, but no, there was only one action in the chain. Fortunately when I reverted the patch I got a warning with a backtrace where I could see something like twl4030_irq_thread->generic_handle_irq->handle_simple_irq. It was really straight-forward, but I couldn’t see why. There was a lot of code in twl403o-irq for a “secondary interrupt handler” but it didn’t seem to be called at all, and I didn’t see how generic_handle_irq was calling handle_simple_irq.

Time to step back for a second. I remembered a patch from Neil Brown trying to fix something regarding how this sih stuff was called (here) because it was not called at all, but these patches are supposed to be fixes for other patches that are not applied at this point, so they really wouldn’t help.

At this point I’m pretty much stuck, as I’ve no idea how all this code is connected, you might think that all this is way over my head, and it might be, even if just a little, but that has never deterred me. I know that as long as I have the code, and I have a way to run it, I can figure out how it works, so I just keep trying.

After reading the code more carefully, I noticed that only some twl “modules” had a sih setup, and the keyboard was not one of them, so that explained the sih part. And then I noticed a part of the irq handling that dealt with “chips”, and one interesting function that setup the chip’s “handler”. Wait a second… So there’s a chip handler, and a client handler, a few keystrokes after and I find this gem irq_set_chip_and_handler(i, &twl4030_irq_chip, handle_simple_irq). Finally! So I can now see the whole chain of events, but alas, I still have no idea why things are failing.

I thought maybe some states where miscalculated, so I find a function that is used to print everything related to an irq, but it was hidden in some internal code, so I had to do a few tricks to use it in the keyboard driver. All the states seemed to be OK, except one… the irq count, which showed something like 0, 0, 0, 1, 2, 2, 3 where it should have shown 0, 0, 1, 1, 2, 2. I looked at all the places where this counter is modified, and I found an interesting function: handle_nested_irq, so I replaced handle_simple_irq just for fun, and…

BAM!

Everything worked perfectly.

Why, Why, Why

So, now I found a fix, but why does it work? If I want to fix this in a way that other people will not get bitten, I have to find the proper fix, so “This seems to make things work here” usually doesn’t convince many kernel hackers.

Using handle_nested_irq makes the compiler throw a warning right away, because the number of arguments is different, so I looked at how other people use this function, and it’s indeed very different, it looks like people are calling handle_nested_irq directly, instead of generic_handle_irq, which would eventually handle handle_simple_irq, because that’s the chip’s configured handler. It all boils down to the difference between handle_nested_irq and handle_simple_irq, and looking at them it’s clear that handle_nested_irq does much less, specifically, it calls directly the action->thread_fn() callback, which is the keyboard interrupt handler, rather than waking up the thread, which is the same function (but in a specialized thread).

And this finally shines some light; the reason why the interrupt counter is wrong, is that the thread is waked after the irq has been processed, so the code that is supposed to clear the interrupt hasn’t been called yet, and a second interrupt is generated which is spurious. The relevant code is supposed to read a register, and the hardware clear the interrupt on a read (or write) operation. The reason handle_nested_irq works is because it doesn’t bother with the interrupt thread at all, it calls the function right away on the same thread, and thus the interrupt is marked as handled at the right time (after it has been really handled).

And this also explains why other people call handle_nested_irq instead of generic_handle_irq; you are supposed to call it inside an interrupt thread (which is configured with request_threaded_irq), but there’s a thread lying around that nobody uses. Indeed, Neil Brown noticed the same (here), so irq_set_nested_thread() will make sure that no extra thread is created.

So both irq handlers must be threaded, the one for the twl-core, and the one for the keyboard, not just the one for the keyboard like right now. Fortunately Felipe Balbi already fixed that (see this patch series), but that broke things badly, and then Neil Brown fixed them (see this). Interestingly enough, I had already tested those patches for other reasons, but at this point I wasn’t sure if they would fix the keyboard problems, I just assumed the would anyway.

Good good, it looks like we finally have the whole picture, but the job’s not done yet.

Getting it “done”

Now the question is: how to fix this? One limitation of the ‘stable’ kernel trees (at least AFAIK), is that the patches should be in the latest Linus’s tree, so whatever solution is found, should be in the main tree.

Ideally, all the stack of irq’s should be reorganized to use the nested/threaded API, or none at all. So this patch, should have been applied only after this patch series, and not two years beforehand. I pondered for quite some time whether a middle-ground could be found, either by workarounds in the core irq handler, or in the keyboard one, eventually I came to the conclusion that since the core is supposed to set irq_set_nested_thread() anyway (just trust me), the keyboard one is the one that should check if it’s nested or not, and either set the thread handler, or a normal one, and this code should work before and after the patches applied for 3.2. Unfortunately, the ‘nested’ flag is supposed to be internal to the kernel’s thread handling code.

So there’s really no way around it, either the original patch is reverted (and other possible similar ones for other twl modules), or all the twl core irq handler gets the new code for 3.2 backported, and that code isn’t even fixed yet (as of v3.2-rc5) (and maybe it will not be the way things are progressing).

Then the question arises; is all that new code on v3.2 working properly? I went ahead and tried it, and lo and behold; the keyboard works as expected. I then backported all those patches to v3.1, and they all worked too.

But does it really make sense to apply all those patches to all the kernels after v2.6.33? Or does it make sense to revert a few of them? That’s not for me to decide, so I ask the community, after I did all that work on this mail.

Sometimes the result of a week’s work ends up being a one line patch, sometimes it’s just one character, and in this case, there was no patch (at least from me), but at least the issue is clearly identified, as well as the fixes. And that means nobody has to do this work again.

Well, the fun is not over… We still need to synchronize the maintainers, so the right patches land on Linus’ tree for v3.2, and they are back-ported to the relevant linux-stable trees.

Kernel development is hard, let’s go shopping

What are you talking about? This isn’t even kernel development, it’s just some legwork 😉 I’m going to be Topper for a second and say “That’s nothing!”, there’s way more complicated and challenging issues kernel developers confront all the time. I thought this was interesting because of all the steps I had to do, and because it involved things I had no idea about.

Was this worth my time? Well, I learned what is actually a virtual terminal for real, and not some vague notion, like when common people say “My browser? Yeah, I use Google… No?”. I also learned a bit more about the IRQ handling API in the Linux kernel, and BTW, all this threaded interrupt API started because of real issues, and the removal of IRQF_DISABLED (nice LWN article about it here), and that thread was fun to read years ago, even when I didn’t understand most of it, maybe it’s time to read it again 🙂

For me, the best is the satisfaction of knowing that I really “got it”. I mean, I really understand what causes the problem, many of the possible solutions, including hacky and proper ones, and exactly what to do if I get bitten by a similar problem regarding threaded IRQ’s.

And now that the keyboard is fixed, on with the next problem on the N900 (which I stopped working on, because I though a functional keyboard would help to debug it, as opposed to shut down, remove the MMC, plug it into a PC, write a script with the commands I want to run, unmount, plug it back, and boot [I can’t use USB networking for this issue to appear]).

MeeGo scales, because Linux scales

To me, and a lot of people, it’s obvious why MeeGo scales to a wide variety of devices, but apparently that’s not clear to other people, so I’ll try to explain why that’s the case.

First, let’s divide the operating system:

  1. Kernel
  2. Drivers
  3. Adaptation
  4. System Frameworks
  5. Application Framework
  6. Applications

“Linux” can mean many things, in the case of Android, Linux means mostly the Kernel (which is heavily modified), and in some cases the Drivers (although sometimes they have to be written from scratch), but all the layers above are specific to Android.

On Maemo, MeeGo, Moblin, and LiMo, “Linux” means an upstream Kernel (no drastic changes), upstream Drivers (which means they can be shared with other upstream players as they are), but also means “Linux ecosystem”; D-Bus, X.org, GStreamer, GTK+/Qt/EFL, etc. Which means they take advantage of already existing System and Application Frameworks. And all they have to do, is build the Applications, which is not an easy task, but certainly easier than having to do all the previous ones.

Now, the problem when creating MeeGo, is that for reasons I won’t (can’t?) explain here, Maemo and Moblin were forced to switch from GTK+ to Qt. This might have been the right move in the long term, but it means rewriting two very big layers of the operating system, in fact, the two layers that differentiate the various mobile platforms for the most part. And this of course means letting go of a lot of talent that helped build both Maemo and Moblin.

For better or worse, the decision was made, and all we could do is ride along with it. And maturizing MeeGo, essentially means maturizing these two new layers being written not entirely from scratch (as Qt was already there), but pretty much (as you have to add new features to it, and build on top).

Now, did MeeGo fail? Well, I don’t know when this UI can be considered mature enough, but sooner or later, it will be (I do think it will be soon). The timeframe depends also on your definition of “mature”, but regardless of that, it will happen. After that, MeeGo will be ready to ship on all kinds of devices. All the hardware platform vendors have to do, is write the drivers, and the adaptation, and they already do anyway for other sw platforms.

Needless to say, the UI is irrelevant to the hardware platform.

So, here’s the proof that the lower layers are more than ready:

Just after a few months of announcing MeeGo IVI, these guys were able to write a very impressive application thanks to QML, and ignore the official UI.

The OMAP4 guys went for the full MeeGO UI. No problems.

Even though Freescale is probably not that committed to MeeGo, it’s easier to create demo using it (Qt; Nomovok) rather than other platforms. It’s even hardware accelerated.

Renesas also chose the Nomovok demo to show their hardware capabilities.

MeeGo 1.1 running on HTC’s HD2

One guy; yes, one guy. Decides to run MeeGo on his HTC, and succeeds. Of course, he uses the work already done by Ubuntu for HD2, but since MeeGo is close to upstream, the same kernel can be used. Sure, it’s slow (no hardware acceleration), and there’s many things missing, but for a short amount of time spent by hobbyists, that’s pretty great already.

This is one is not so impressive, but also shows the work of one guy porting MeeGo to Nexus S

And running on Archos 9. Not very impressive UI, but the point is that it runs on this hw.

Conclusion

So, as you can see MeeGo is already supported in many hardware platforms; not because the relevant companies made a deal with Nokia or Intel; they don’t have to. The only thing they have to do is support Linux; Linux is what allows them to run MeeGo, and Linux is what allows MeeGo to run on any hardware platform.

This is impossible with WP7 for numerous reasons; it’s closed source, it’s proprietary, it’s Microsoft, etc. It’s not so impossible to do the same with Android, but it’s more difficult than with MeeGo because they don’t share anything with a typical linux ecosystem; they are on a far away island on their own.

Nokia; from a burning platform, to a sinking platform

I’ve been thinking a lot about this decision to use WP7 from Nokia, as I’m sure many people have, but I’ve wanted to wait for the dust to settle down before blogging, so here’s what I think; it doesn’t make any sense from any point of view.

Technically, there is nothing that can compare to the linux kernel, which works on everything; supercomputers, mobile phones, TVs, routers, web servers, desktops, refrigerators, etc. Not only does it work, but it works well, much better than everything else. As an example, the work that has been done to scale linux’s vfs to many processors (64) does benefit embedded, because some operations are more granular. Or the work on power management lead by embedded helps web servers, where decreasing power consumption is also very much wanted. This creates a environment of synergy never seen before, where even competitors work together. Linux won the kernel race, and its use would only increase; the ones that try to fight against it would only fail miserably.
Continue reading

My ARM development notes

These are my notes to get useful cross-compilation, even with autotools, and GStreamer stuff.

toolchain

The convention is to have ‘arm-linux-gcc‘ and so on, so that you can compile with ‘make CROSS_COMPILE=arm-linux-‘, the kernel and many other projects assume this is the default.

First, you would need ‘~/bin‘ to be on your path, so make sure you have it on ‘~/.bash_profile‘ (export PATH="$HOME/bin:$PATH") or whatever your favorite shell uses.

I use CodeSourcery (GNU/Linux 2009q3), you can fetch it from here.

cd ~/bin
toolchain=/opt/arm-2009q3
for x in $toolchain/bin/arm-none-linux-gnueabi-*
do
ln -s $x arm-linux-${x#$toolchain/bin/arm-none-linux-gnueabi-}
done

QEMU

This is needed for sb2 in order to kind of emulate an ARM system.

git clone git://git.savannah.nongnu.org/qemu.git
cd qemu
git checkout -b stable v0.12.5
./configure --prefix=/opt/qemu --target-list=arm-linux-user
make install

sbox2

This is needed to avoid most of the pain caused by autotools (thank you GNU… not!).

git clone git://gitorious.org/scratchbox2/scratchbox2.git
cd scratchbox2
git checkout -b stable 2.1
./autogen.sh --prefix=/opt/sb2
make install

Add sb2 to the PATH:
export PATH=/opt/sb2/bin:$PATH

sb2 target

Now it’s time to configure a target.

cd /opt/arm-2009q3/arm-none-linux-gnueabi/libc/
sb2-init -c /opt/qemu/bin/qemu-arm armv7 /opt/arm-2009q3/bin/arm-none-linux-gnueabi-gcc

You can check that it works with:
sb2 gcc --version

GStreamer

We are going to install everything into ‘/opt/arm/gst‘, so:

export PKG_CONFIG_PATH=/opt/arm/gst/lib/pkgconfig

You can skip the steps here and go directly to deployment if you download and extract this tarball on your target.

zlib

This is needed by GLib’s gio (which cannot be configured out).

wget -c http://zlib.net/zlib-1.2.5.tar.gz
tar -xf zlib-1.2.5.tar.gz
cd zlib-1.2.5
sb2 ./configure --prefix=/opt/arm/gst
sb2 make install

glib

GLib has bugs (623473, 630910) detecting zlib (thank you Mattias… not!). So either apply my patches, or do the C_INCLUDE_PATH/LDFLAGS hacks below:

export C_INCLUDE_PATH='/opt/arm/gst/include' LDFLAGS='-L/opt/arm/gst/lib'

git clone git://git.gnome.org/glib
cd glib
git checkout -b stable 2.24.1
./autogen.sh --noconfigure
sb2 ./configure --prefix=/opt/arm/gst --disable-static --with-html-dir=/tmp/dump
sb2 make install

gstreamer

git clone git://anongit.freedesktop.org/gstreamer/gstreamer
cd gstreamer
git checkout -b stable RELEASE-0.10.29
./autogen.sh --noconfigure
sb2 ./configure --prefix=/opt/arm/gst --disable-nls --disable-static --disable-loadsave --with-html-dir=/tmp/dump
sb2 make install

liboil

Needed by many GStreamer components.

git clone git://anongit.freedesktop.org/liboil
cd liboil
git checkout -b stable liboil-0.3.17
./autogen.sh --noconfigure
sb2 ./configure --prefix=/opt/arm/gst --disable-static --with-html-dir=/tmp/dump
sb2 make install

gst-plugins-base

git clone git://anongit.freedesktop.org/gstreamer/gst-plugins-base
cd gst-plugins-base
git checkout -b stable RELEASE-0.10.29
./autogen.sh --noconfigure
sb2 ./configure --prefix=/opt/arm/gst --disable-nls --disable-static --with-html-dir=/tmp/dump
sb2 make install

gst-plugins-good

git clone git://anongit.freedesktop.org/gstreamer/gst-plugins-good
cd gst-plugins-good
git checkout -b stable RELEASE-0.10.23
./autogen.sh --noconfigure
sb2 ./configure --prefix=/opt/arm/gst --disable-nls --disable-static --with-html-dir=/tmp/dump
sb2 make install

Deployment

So now we have everything installed in ‘/opt/arm/gst‘, but how to run on the target? Just copy the exact same files into the target on the exact same location, and then:

export PATH=/opt/arm/gst/bin:$PATH

That’s it, you can run gst-launch, gst-inspect, and so on.

Development

Ok, it should be clear how to do development from the previous steps, but in case it wasn’t clear, here’s how to:

gst-dsp

Each time you want to cross-compile, you need to tell pkg-config where to find the packages:

export PKG_CONFIG_PATH=/opt/arm/gst/lib/pkgconfig

git clone git://github.com/felipec/gst-dsp.git
cd gst-dsp
git checkout -b stable v0.8.0
make

Note that gst-dsp doesn’t use autotools, so sb2 is not needed.

Now, once you have the plugin (libgstdsp.so), copy to ‘/opt/arm/gst/lib/gstreamer-0.10‘ on the target.

And finally, you can run real gst-launch pipelines:
gst-launch playbin2 uri=file://$PWD/file.avi

Note: If you are missing some elements, play around with flags (flags=65 for native video-only)

Do some more development, type make, copy, repeat 🙂

Enjoy 😉

GStreamer, embedded, and low latency are a bad combination

This has been a known fact inside Nokia (MeeGo) for quite a long time due to various performance issues we’ve had to workaround, but for some reason it wasn’t acknowledged as an issue when it was brought up in the mailing list.

So, in order to proof beyond reasonable doubt that there is indeed an issue, I wrote this test. It is very minimal, there’s essentially nothing of a typical GStreamer pipeline, just an element and an app that pushes buffers to it, that’s it. But then, optionally, a queue (typical element in a GStreamer pipeline) is added in the middle, which is a thread-boundary, and then the fun begins:

Graph for x86
Graph for arm

The buffer size legends corresponds to exponentiation (5 => 2 ^ 5 = 32), and the CPU time is returned by the system (getrusage) in ms. You can see that in ARM systems not only more CPU time is wasted, but adding a queue makes things worst at a faster rate.

Note that this test is doing nothing, just pushing buffers around, all the CPU is wasted doing GStreamer operations. In a real scenario the situation is much worst because there isn’t only one, but multiple threads, and many elements involved, so this wasted CPU time I measured has to be multiplied many times.

Now, this has been profiled before, and everything points out to pthread_mutex_lock which is only a problem when there’s contention, which happens more often in GStreamer when buffers are small, then the futex syscall is issued, is very bad in ARM, although it probably depends on which specific system you are using.

Fortunately for me, I don’t need good latency, so I can just push one second buffers and forget about GStreamer performance issues, if you are experiencing the same, and can afford high latency, just increase the buffer sizes, if not, then you are screwed :p

Hopefully this answers Wim’s question of what a “small buffer” means, how it’s not good, and when it’s a problem.

Update

Ok, so the discussion about this continued in the mailing list, and it was pointed out that that the scale is logarithmic, so the exponential result was expected. While that is true, the logarithmic scale matches what people experience; how else would you plot the range from 10ms to 1s? Certainly not linearly.

But there’s a valid point; the results should not be surprising. We can take the logarithmic scale out of the equation by dividing the total CPU time by the number of buffers actually pushed, as Robert Swain did in the comments, that should give a constant number, which is the CPU time it took to do a push. The results indeed converge to a constant number:

queue: 0.078, direct: 0.011

This means that in a realistic use case of pushing one buffer each 10ms through a queue, the CPU usage on this particular processor (800mhz) is 0.78%.

Also, there’s this related old bug that recently got some attention and a new patch from Wim, so I gave it a try (I had to compile GStreamer myself so the results are not comparable with the previous runs).

Before:
queue: 0.074, direct: 0.011

After:
queue: 0.065, direct: 0.007

So the improvement for the queue case is around 12%, while the direct case is 31%.

Not bad at all, but the conclusion is still the same. If you use GStreamer, try to avoid as many elements as possible, specially queues, and try to have the biggest buffer size you can afford, which means that having good performance and low latency is tricky.

Update 2

Stefan Kost suggested to use ‘queue’ instead of ‘queue2’, and I got a pandaboard, so here are the results with OMAP4.

pandaboard (2 core, 1GHz):
queue: 0.017, direct: 0.004

N900:
queue: 0.087, direct: 0.021

i386 (2 core, 1.83GHz):
queue: 0.0059, direct: 0.0015

So, either futex got better on Cortex A9, or OMAP4 is so powerful it can’t be considered embedded :p

gst-av 0.3; better performance for vorbis and mp3

So, I’ve been working on gst-av, a GStreamer plug-in to use FFmpeg codecs (only audio for now), in order to get it in good shape for ogg support. First, I had to fix oggdemux and flacparse to be compatible with tagreadbin, it seems I managed to do it (with the help of a patch from Sreerenj Balachandran), so now the custom tracker extractors are not needed any more.

Then, with a bit of work I managed to get not only vorbis, but flac, and mp3 working.

That was good, but was it really worth it? Tuomas Kulve did a nice comparison of gst-av vs the default vorbisdec, and I wanted to do something similar, however, running a series of tests each taking 20 hours to complete wasn’t so appealing.

So I asked in #meego and #maemo IRC channels for a simple way to measure battery drain reliably, and automatically. It seems powertop can do that on some platforms, but Maemo’s powertop is a very different beast. Fortunately, the folks at #maemo seem to have been busy trying to get all possible information from the battery, and they pointed me to a very nice powerscript. However, I got some tips to get even better results (from ShadowJK, DocScrutinizer, and SpeedEvil), and the result is this maemo-battery script (needs i2c-tools, and root permissions), which essentially prints the current charge of the battery each 10 minutes.

With this I was ready, but just to be clear how to properly measure battery draw; make sure you are in offline mode, plug your headphones (otherwise pulse-audio would run extra algorithms), and immediately blank the screen.

Here are the results (units in hours of battery life):

These results show that vorbis with FFmpeg is massively better than libvorbis, so my work wasn’t in vain :). But it’s also interesting that FFmpeg’s mp3 decoder is slightly better than Nokia’s proprietary one. Also, FFmpeg still needs some work to complete with libflac. My guess is that these decoders can’t be optimized much further; now the bottlenecks would have to be pulseaudio and gstreamer.

This is the raw data (in mA); I ran my script for one hour for each test, and some I ran multiple times just to verify; the results seem to vary ±1 mA.

current -- mp3: 63, vorbis: 110, flac: 62
gst-av -- mp3: 61, vorbis: 62, flac: 69

gst-ffmpeg

Why not use gst-ffmpeg? You might ask. Initially that’s what I tried, but it doesn’t support vorbis, nor flac, which seems to fit GStreamer’s tradition of getting away from FFmpeg as much as possible. Then when I read the code it was clear to me that it was overly complicated; I’m familiar with FFmpeg’s API (it’s unbelievably simple), so I decided to play around, and see if I could get something working; I did, and the result was incredibly simple, and oh so sweet 🙂 As a comparison, gst-ffmpeg is 16357 lines of code, gst-av is 563 (sure, gst-av does much less; just what is needed). Another reason that goes hand-in-hand with this, is the ability to tweak it; my goal is to get the absolutely best performance, and for that I want to be able to understand what the code is doing. And finally, gst-ffmpeg is using deprecated API.

What about performance?

The difference is not that big: ~1.6h of battery life, but it’s something.

current: 63, gst-av: 61, gst-ffmpeg: 66

What now?

Now we need to package FFmpeg; probably just include the codecs we need, and then ogg support might include these instead. Any volunteers?

Update

It turns out the issue was flacparse which is total crap: it’s using 4 times more CPU time than FFmpeg’s decoder just for parsing. After fixing it now it takes only 20%. I’m trying to get new measurements in a more automated and precise way now. I’ve pushed the code to my repo already.