No, mercurial branches are still not better than git ones; response to jhw’s More On Mercurial vs. Git (with Graphs!)

I’ve had plenty of discussions with mercurial fans, and one argument that always keeps poping up is how mercurial branches are superior. I’ve blogged in the past why I think the branching models are the only real difference between git and mercurial, and why git branches are superior.

However I’ve noticed J. H. Woodyatt’s blog post Why I Like Mercurial More Than Git More On Mercurial vs. Git (with Graphs!) has become quite popular. I tried to engage in a discussion in that blog, but commenting there is a painful ordeal (And my comments have been deleted!).

So, in this blog post I will explain why mercurial branches are not superior, and how everything can be achieved with git branches just fine.

The big difference

The fundamental difference between mercurial and git branches can be visualized in this example:

Merge example

In which branches is the commit ‘Quick fix’ contained? Is it in ‘quick-fix’, or is it both in ‘quick-fix’ and master? In mercurial it would be the former, and in git the latter. (If you ask me, it doesn’t make any sense that the ‘Quick fix’ commit is only on the ‘quick-fix’ branch)

In mercurial a commit can be only on one branch, while in git, a commit can be in many branches (you can find out with ‘git branch --contains‘). Mercurial “branches” are more like labels, or tags, which is why you can’t delete them, or rename them; they are stored forever in posterity just like the commit message.

That is why git branches are so useful; you can do absolutely anything that you want with them. When you are done with the ‘quick-fix’ branch, you can just remove it, and nobody has to know it existed (except for the fact that the merge commit message says “Merge branch ‘quick-fix'”, but you could have easily rebased instead). Then, the commit would only be on the ‘master’ branch’.

Bookmarks are not good enough

Mercurial has another concept that is more similar to git branches; bookmarks. In old versions of mercurial these were an extension, but now they are part of the core, and also new is the support for repository namespacing (so you could have upstream/master, backup/master, and so on).

However, these are still not as useful as git branches because of the fundamental design of mercurial; you can’t just delete stuff. So for example, if your ‘quick-fix’ bookmark didn’t go anywhere, you can delete it easily, but the commits won’t be gone; they’ll stay through an anonymous head (explained below). You would need to run ‘hg strip‘ to get rid of them. And then, if you have pushed this bookmark to a remote repository, you would need to do the same there.

In git you can remove a remote branch quite easily: ‘git push remote :branch‘.

And then, bookmark names are global, so you can’t push a branch with a different name like in git: ‘git push remote branch:branch-for-john‘.

Anonymous heads

Anonymous heads are probably the most stupid idea ever; in mercurial a branch can have multiple heads. So you can’t just merge, or checkout a branch, or really do any operation that needs a single commit.

Git forces you to either merge, or rebase before you push, this ensures that nobody else would need to do that; if you have a big project with hundreds of committers this is certain useful (imagine 10 people trying to merge the same two heads at the same time). In addition, you know that a branch will always be usable for all intends and purposes.

Even mercurial would try to dissuade you from pushing an anonymous head; you need to do ‘hg push -f‘ to override those checks.

The rest of the uses of anonymous heads were solved in git in much simpler ways; ‘git pull’ automatically merges the remote head, and remote namespaces of branches allow you to see their status after doing ‘git fetch’.

Anonymous heads only create problems and solve none.

Nothing is lost

So, let’s go ahead with jhw’s blog post by looking at his example repository:

Repository

According to him, it’s impossible to figure out what happened in this repository, but it’s not. In fact, git can automatically find out what is the corresponding branch for the commit with the ‘git name-rev‘ command (e.g. ‘release~1‘).

Now let’s assign colors based on the output of ‘git name-rev‘:

Repository with names

The colors are exactly the ones that jhw used for his mercurial example.

Now the only difference is that there is no ‘temp’ branch, but that is actually good; it was removed. Why would we want to see a branch that was removed? We wouldn’t. Either way, the information remains; “Merge branch ‘temp’ into release” says it all; that all those commits come from the ‘temp’ branch.

Of course, one would need to manually look through the commit messages to find those removed branches, but that is fine, because you would rarely (never?) need that. And if he really needs that information readily, he can write a prepare-commit-msg hook to store the branch name the commit was originally created from.

Real use-cases

jhw tried to defend the need for this information by presenting some use cases:

A more clever rebuttal to my question is to ask in return, “Why do you need to know?” Let me answer that preemptively:

A) I need to know which branch ab3e2afd was committed to know whether to include it in the change control review for the upcoming release

It’s easy to find out what commits are relevant for the next release with ‘git log master^..release‘:

Release commits

But then he said:

I didn’t ask for a list of all the commits that are currently included in the head of the branch currently named ‘release’ that are not included in the head of the branch currently named ‘master’. I wanted to know what was the name of the branch on which the commit was made, at the time, and in the repository, where it was first introduced.

How convenient; now he doesn’t explain why he needs that information, he just says he needs it. ‘git log master..release‘ does what he said he was looking for.

B) I need to know which change is the first change in the release branch because I’d like to start a new topic branch with that as my starting point so that I’ll be as current as possible and still know that I can do a clean merge into master and release later

Easy; ‘git merge-base master^ release‘, that would return ‘master~1′ (76ae30ef).

But then he said:

I didn’t want to know the most recent commit included in both the currently named ‘master’ and ‘release’ heads, because that may have actually occurred either prior to, or after, the creation of either the branch currently named ‘release’ or the branch currently named ‘master’.

And again he doesn’t explain why on earth would he need that.

To find the most current commit from the ‘release’ branch that can also be merged into ‘master’ cleanly you can use ‘git merge-base‘; the first commit of the ‘release’ branch doesn’t actually help as it has already diverged from ‘master’ and it’s not even “as current as possible” as there will probably be newer commits on the release branch.

Either way, if he really wants that, he can pick any commit that he wants from ‘git log master..release‘.

C) I need to know where topic branch started so that I can gather all the patches up together and send them to a colleague for review.

Easy: ‘git send-email --to john release..topic‘.

But then he said:

I didn’t want to know all the commits present in the head of the branch currently named ‘topic’ that aren’t present in head of the branch currently named ‘release. I wanted to know the first commit that went into a branch that was called ‘topic’ at the time when the change was committed. Your command may potentially include commits that were in a different branch that wasn’t called ‘topic’ at the time.

Why would you send patches for review that are dependent on commits your colleague has no visibility of? No, you want to send all the patches that comprise the ‘topic’ branch, doing anything else would be confusing

If for some reason you don’t want to send the patches that were part of another branch, you can select them out with ‘^temp’.

Conclusion

All the use-cases jhw explained are supported just fine in git, he is just looking for corner-cases and then complaining because we would need to do extra stuff.

I have never seen a sensible use-case in which mercurial “branches” (branch labels) would be more useful than git branches. And bookmarks are still not as good.

So git branching model wins.

gst-av 0.6 released; more reliable

gst-av is a GStreamer plug-in to provide support for libav (fork of FFmpeg), it is similar to gst-ffmpeg, but without GStreamer politics, which means all libav plugins are supported, even if there are native GStreamer alternatives; VP8, MP3, Ogg, Vorbis, AAC, etc.

This release takes care of a few corner-cases, and has support for more versions of FFmpeg.

Here are the goods:
http://code.google.com/p/gst-av/downloads/list

And here’s the short-log:

Felipe Contreras (19):
      adec: flush buffer on EOS
      adec: improve timestamp reset
      adec: avoid deprecated av_get_bits_per_sample_fmt()
      adec: avoid FF_API_GET_BITS_PER_SAMPLE_FMT
      vdec: properly initialize input buffer
      parse: add more H.264 parsing checks
      parse: fix H.264 parsing for bitstream format
      get_bits: add show_bits function
      build: set runpath for libav
      vdec: fix potential leaks
      vdec: use libav pts stuff
      vdec: get delayed pictures on eos
      build: trivial improvements
      parse: trivial fix
      h264enc: fix static function
      vdec: add support for old reordered_opaque
      adec: add support for old sample_fmt
      adec: add support for really old bps()
      adec: add support for all MPEG-1 audio

Mark Nauwelaerts (1):
      parse: be less picky regarding some reserved value

Android vs. Maemo power management: static vs. dynamic

Some of you might have heard about Google’s Android team proposal to introduce wakelocks (aka suspend-blockers) to the Linux kernel. While there was a real issue being solved in the kernel side, the benefits on the user-space side were dubious at best, and after a huge discussion, they finally didn’t get in.

During this discussions the dynamic and static power management were described and discussed at length, and there was a bit of talk about Maemo(MeeGo) vs Android approaches to power management, but there was so much more than that.

Some people have problems with the battery on Android devices, for some people it’s just fine, some people have problems with the Maemo, other don’t, so in general; your mileage might vary. But given the extremely different approaches, it’s easy to see in which cases you would have better luck with Maemo, and in which with Android–Although I do think it’s obvious which approach is superior, but I am biased.

An interesting achievement was shared by Thiago Maciera, who managed to get ‘5 days and a couple of minutes‘ out of the Nokia N9 while traveling, and actually using it–and let’s remember this is a 1450 mAh battery. Some Android users might have trouble believing this, but hopefully this blog post would explain what’s going on.

So lets go ahead and explore the two approaches.

Dynamic Power Management

Perhaps the simplest way to imagine dynamic power management, is the gears of a manual transmission car. You go up and down depending on how much power does the system actually needs.

In some hardware, such as OMAP chips, it’s possible to have quite a lot of fine control on the power states of quite a lot of devices, plus different operating power points on the different cores. For example, it might be possible to have some devices, such as the display on, and active, some other devices partially off, like speaker, and other completely off, like USB. And based on the status of the whole system, whole blocks can be powered off, other with low voltage levels, etc.

Linux has a framework to deal properly with this kind of hardware, the runtime power management, that originally came from the embedded world, and a lot from OMAP development, but is now available to everyone.

The idea is very simple; devices should sleep as much as possible. This means that if you have a sound device that needs chunks of 100ms, and the system is not doing anything else but playing sound, then most of the devices go to sleep, even the CPU, and the CPU is only waken up when it needs to write data for the audio device. Even better is to configure the sound device for chunks of 1 second, so the system can sleep even more.

Obviously, some co-operation between kernel and user-space is needed. Say, if you have an instant messenger program that needs to do work every minute, and a mail program that is configured to check mail every 10 minutes, you would want them to do work at the same time when they align at every 10 minutes. This is sometimes called IP heartbeat; the system wakes up for a beat, and then immediately goes back to sleep. And there are kernel facilities as well, such as range timers.

All this is possible thanks to very small latencies required for devices to go to sleep and wakeup, and have intermediary modes (e.g. on, inactive, retention, off), so, for example a device might be able to go to inactive mode in 1ms, retention in 2ms, and off in 5ms (totally invented numbers). Again, the more sleep, the better. Obviously, this is impossible on x86 chips, which have huge latencies–at least right now, and it’s something Intel is probably trying to improve effusively. All these latencies are known by the runtime pm framework in the kernel, and based on that it and the usage, it figures out what is the lowest power state possible without breaking things.

Note I’m not a power management expert, but you cant watch a colleague of mine explain the OMAP 3 power-managment on this video:

Advanced Power Management for OMAP3

And there’s plenty of more resources.

Update: That was the wrong link, here are the resources.

Static Power Management

Static power management has two modes: on and off. That’s it.

OK, that’s not exactly the case in general, but it is in the Android context; the system is either suspended, or active, and it’s easy to know in which mode you are; if the screen is on, it’s active, and if it’s off; it’ is suspended (ideally).

There’s really not much more than that. The complexity comes from the problem of when to suspend; you might have turned off the display, but there might be a system service that still needs to do work, so this service grabs a suspend blocker which, as the name suggests, prevents the system from suspending (until the lock is released). This introduces a problem; a rouge program might grab a ‘suspend blocker’ and never release it, which means your phone will never suspend, and thus the battery would drain rather quickly. So, some security mechanisms are introduced to grant permissions selectively to use suspend blockers.

And moreover, Android developers found race conditions in the suspend sequences in certain situations that were rather nasty (e.g. the system tries to suspend at the same time the user clicks a button, and the system never wakes up again), and these were acknowledged as real issues that would happen on all systems (including PC’s and servers, albeit rarely, because they don’t suspend so often), and got fixed (or at least they tried).

Versus

First of all, it’s important to know that if you have dynamic pm working perfectly, you reach exactly the same voltage usage than static pm, so in ideal cases they both behave exactly the same.

The problem is that it’s really hard for dynamic pm to reach that ideal case, in reality systems are able to sleep for certain periods of time, after which they are woken up, often times unnecessarily, and as I already explained; that’s not good. So the goal of a dynamic pm system is to increase those periods of time as much as possible, thus maximizing battery life. But there’s a point of diminished returns (read this or this for expert explanations), so, if the system manages to sleep 1s in average, there’s really not much more to gain if it sleeps 2s, or even 10s. These goals were quite difficult to achieve in the past (not these, I invented those numbers), but not so much any more thanks to several mechanisms that have been introduced and implemented through the years. So it’s fair to say that the sweet spot of dynamic pm has been achieved.

This means that today a system that has been fine-tuned for dynamic pm can reach reach a decent battery life compared to one that uses static pm in most circumstances. But for some use-cases, say, you leave your phone on your desk and you don’t use it at all, static pm would allow it to stay alive for weeks, or even months, while dynamic pm can’t possibly achieve that any time soon. Hopefully you would agree, that nobody cares about those use-cases were you are not actually using the device.

And of course, you only need one application behaving badly and waking up the system constantly, and the battery life is screwed. So in essence, dynamic pm is hard to achieve.

Android developers argued that was one of the main reasons to go for static pm; it’s easier to achieve, specially if you want to support a lot of third party applications (Android Market) without compromising battery life. While this makes sense, I wasn’t convinced by this argument; you still can have one application that is behaving badly (grabbing suspend blockers the whole time), and while permissions should help, the application might still request the permission, and the user grant it (who reads incomprehensible warnings before clicking ‘Yes’ anyway?).

So, both can get screwed by bad apps (although it’s likely that it’s harder in the static pm case, albeit not that much).

But actually, you can have both static and dynamic power management, and in fact, Android does. But that doesn’t mean Android automatically wins, as I explained, the system needs to be fine-tuned for dynamic pm, and that has never been a focus of Android (there’s no API’s or frameworks for that, etc.). So, for example, a Nokia N9 phone might be able to sleep 1s in average, while an Android phone 100ms (when not suspended). This means when you are actually using the device (the screen is on), chances are, a system fine-tuned for dynamic pm (Nokia N9) would consume less battery, than an Android device.

That is the main difference between the two. tl;dr: dynamic pm is better for active usage.

So, if Android developers want to improve the battery usage while on active usage (which I assume is what the users want), they need to fine-tune the system for dynamic pm, and thus sleep as much as possible, hopefully reaching the sweet spot. But wait a second… If Android is using dynamic pm anyway, and they tune the system to the point of diminishing returns; there is not need for static pm. Right? Well, that’s my thinking, but I didn’t manage to make Android developers realize that in the discussion.

Plus, there’s a bunch of other reasons while static pm is not palatable for non-Android systems (aka. typical Linux systems), but I won’t go into those details.

Nokia’s bet was on dynamic, Google’s bet was on static, and in the end we agreed to disagree, but I think it’s clear from the outcome in real-world situations who was right–N9 users experiencing more than one day of normal usage, and even more than two. Sadly, only Nokia N9 users would manage to experience dynamic pm in full glory (at the moment).

Upstream

But not all is lost, and this in my opinion is the most important aspect. Dynamic pm lives on the Linux kernel mainline through the runtime power management API. This is not a Nokia invention that will die with the Nokia N9; it’s a collaborative effort where Nokia, TI, and other companies worked together, and now not only benefits OMAP, but other embedded chips, and even non-embedded ones. Being upstream means it’s good, and it has been blessed by many parties, and has gone through many iterations, and finally it probably doesn’t look like the original OMAP SRF code at all. Sooner or later your phone (if you don’t have an N9) will benefit from this effort, and it might even be an Android phone, your netbook (if not already benefiting in some way), and even your PC.

Android’s suspend blockers are not upstream, and it’s quite unlikely that they will ever be, or that you would see them used in any system other than Android, and there’s good reasons for that. Matthew Garrett did an excellent job of summarizing what went wrong with the story of suspend blockers on his presentation ‘Android/Linux kernel: Lessons learned’, but unfortunately the Linux foundation seems to be doing a poor job of providing those videos (I cannot watch them any more, even though I registered, and they haven’t been helpful through email conversations).

Update: I managed to get the video from the Linux Foundation and pushed it to YouTube:

Here is part of the discussion on LKML, if you want to read it for some strange reason. WARNING; it’s huge.

A tale of just another Linux kernel bug

As part of a bigger effort to get my Nokia N900 in good shape for development, I decided to track down an issue with the keyboard; I could type ‘a’, but not ‘A’ or any special characters, so no ‘shift’ or ‘ctrl’ or anything special. Trying to figure out what was going on took me through an unexpected journey, which is not remarkable, but I think it’s a good example of what many kernel developers (and low level developers) constantly go through, and as such, might be interesting for some people to read.

Keycodes

So, first things first. I recently had an issue with a PS/2 keyboard on my PC, so I had an idea how to debug this, and the first thing I did was to run keycode, which shows messages like these:

keycode  30 press
keycode  30 release
keycode  30 press
keycode  30 release
keycode  29 press

Then these keys are supposed to be converted to real characters somehow through a key map, and apparently X has a map of its own.

However, I saw keycodes being pressed when I clicked ‘shift’, so I concluded that neither hardware, nor input driver was the problem, could it be the mapping? The MeeGo project provided a mapping file that I have been using for some time, and it clearly shows “keycode 42 = Shift”, and I was getting keycode 42, so the mapping seemed correct, but was it being applied properly? I found out this can be checked by using dumpkeys, and indeed, the mapping was correct.

Everything was fine up to this level. Next.

Note: all these tools (showkey, loadkeys, and dumpkeys) are provided by the ‘kbd’ package.

Virtual terminal

Now I stumbled into a problem; I’ve no idea what it is that I am interacting with on the framebuffer console (I’m not using X). So, I try fill that gap in my knowledge by going to the ##linux IRC channel in freenode.net, and ask the question: what is it that converts they keycodes to characters in a framebuffer console? As it’s typical when I ask these sorts of tricky questions, I get useless responses, like, ‘what distribution are you using?’ I knew that didn’t matter, so I set out to investigate myself.

I thought, well, what is this thing that I have to add in my inittab to get the console working? 1:2345:respawn:/sbin/getty 9600 tty1. I tried to read documentation about getty and try different options, like try to specify vt100 as an argument, but to no avail.

Maybe it’s done in the kernel? I thought. So I quickly went through my kernel config, and I find this gem:

CONFIG_VT

If you say Y here, you will get support for terminal devices with
display and keyboard devices. These are called "virtual" because you
can run several virtual terminals (also called virtual consoles) on
one physical terminal. This is rather useful, for example one
virtual terminal can collect system messages and warnings, another
one can be used for a text-mode user session, and a third could run
an X session, all in parallel. Switching between virtual terminals
is done with certain key combinations, usually Alt-.

The setterm command ("man setterm") can be used to change the
properties (such as colors or beeping) of a virtual terminal. The
man page console_codes(4) ("man console_codes") contains the special
character sequences that can be used to change those properties
directly. The fonts used on virtual terminals can be changed with
the setfont ("man setfont") command and the key bindings are defined
with the loadkeys ("man loadkeys") command.

You need at least one virtual terminal device in order to make use
of your keyboard and monitor. Therefore, only people configuring an
embedded system would want to say N here in order to save some
memory; the only way to log into such a system is then via a serial
or network connection.

If unsure, say Y, or else you won't be able to do much with your new
shiny Linux system :-)

That makes sense, now we are getting somewhere. So, in the past, there was this notion of ‘terminals’, which are specialized pieces of hardware that send and receive ASCII (or some codes), so they are the ones that have all the needed stuff to control they keyboard, screen, and so on. I never had the need to use one of these terminals, therefore I never really knew what a “virtual” terminal was. So a virtual terminal does the job of a real terminal; it needs an input driver, and a display driver, and really puts them to do something useful.

Interesting theory, but is it true? I quickly looked at the code inside drivers/tty/vt, and I found code to control the keyboard, screen, and also that contains the keyboard mappings. Excellent! So we found the thing that uses these keyboard mappings. Now what?

Before moving on, the fact that this code resides in tty is also helpful, basically, serial console, virtual terminals (framebuffer console), and real terminals all operate through tty’s (teleprinter), which are means of communication between hosts and these “devices”. So, getty really gets a tty to be used by any of these.

All right, so now the knowledge gap seems to be filled, what next? Well, clearly there’s something wrong with this virtual terminal, but what? I immediately started looking at the code at drivers/tty/vt/keyboard.c, and I noticed something interesting: kbdmode can have different values, like RAW, and this mode, certain things are not handled, like ‘shift’. That looked promising, but how to change that mode?

I looked for tools to control the virtual terminal, I found an interesting one, setterm, which was not available in my minimal system, I couldn’t find how to get it, and anyway didn’t have any option that I wanted. terminfo and such looked interesting, but it didn’t seem like anything relevant to the issue at hand. Then I found kbd_mode, which obviously did what I wanted, and I already had it :). I couldn’t even type mbd_mode on my keyboard, so I had to write shortcuts on my PC (kbda, kbdb, etc.). Unfortunately, I found out that initially the mode was set to ‘unicode’, which seemed Ok, and changing to ‘ascii’ didn’t change anything.

So, it didn’t seem to be a user-space configuration of any sort. Next.

Getting our hands dirty

Time to actually type some code. I modified the code in ‘drivers/tty/vt/keyboard.c’, specially in kbd_keycode() to find out the true kbdmode, the keycodes coming down, and how they were being interpreted.

I quickly found out that they keycodes and mode were indeed correct, but each and every key press was immediately followed by a key release, so shift+a was interpreted as shift, a. Now we are getting somewhere; the problem has to be on the input driver.

Maybe I chose the wrong one, or maybe I’m missing some configurations. I see some CONFIG_KEYBOARD_GPIO, and CONFIG_KEYBOARD_TWL4030, and it looks like I should be using TWL4030, as that’s the chip the Nokia N900 has, but I’m not sure, so time to look at the N900 schematics. Well, it seemed like TWL4030 is indeed the right one, and there’s nothing to it; either you have it or not.

Maybe some recent change broke it… But there’s nothing recent that I can see that could do that. So it’s time to take a look at the actual code: drivers/input/keyboard/twl4030_keypad.c. After adding a few prints here and there I realize the problem starts with this code ret = twl4030_kpread(kp, &reg, KEYP_ISR1, 1) which returns 0 after a key press (that returns 1). So, time to read the TRM.

It took me some time to find the right document, and then understand what all the configuration options were actually doing (more or less), and then play around with them. After making a lot of more or less random changes I notice no difference in this particular problem of getting an extra ‘0’. So I think to myself; maybe the problem is the interrupt.

So I abandon the configuration of the keyboard, and look at the code to request the interrupt:
request_threaded_irq(kp->irq, NULL, do_kp_irq, 0, pdev->name, kp);

I had some interrupt issues before (in fact on this very keyboard), so I knew a few hacks I could try, like specifying the IRQF_ONESHOT flag. That didn’t help, so I tried to do that in the parent interrupt on ‘drivers/mfd/twl4030-irq.c’, because I saw a patch from Neil Brown on the linux-omap mailing list that fixed another issue, but that didn’t help either. Then I realized I saw one patch that affected this interrupt request (see here). So I revert the patch, and voilà, no more 0’s afterwards, and the keyboard works properly.

However, there’s a nasty warning about interrupts being enabled, which is probably the reason why the original patch happened, so I try a few random things to get rid of it, but nothing helps. So then I wonder, maybe the reason the keyboard driver worked before that patch is just pure luck, and these extra interrupts were not being detected properly.

Enough fooling around

Since I really want to fix this issue properly, I push myself to really understand what’s going on in the driver. So I slowly read all the documentation, and all the registers, and try to set different values to see what’s going on. While doing that, I noticed one function was calculating the times wrongly, and was telling the driver to use values twice as big as originally intended. It took me some time to figure out why the author chose 31 << (x + 1) instead of 2 << (x + 1) * 31, which is what a direct function conversion would return, until I realized that a shift basically means multiplying by two, and x << 1 is basically the same as (2 << 0) * x, but the author missed that x = ptv + 1, and so it should be 31 << (ptv + 2). Anyway, after being confident of these timeout values, I could set big timeouts on the range of seconds without overflowing the registers and see exactly what they were doing in timescales I could notice.

So yeah, these registers were doing what I thought they should be doing. Nice try.

Time to go back to the interrupt handling. After reading the code of the twl4030-irq, which is supposed to fire the irqs that the keyboard driver eventually gets, and then reading some kernel core code as well, it was not really clear to me how these were all weaved together, so I added some printfs.

Before going forward I’ll briefly explain a bit what TWL4030 is. To my understanding it’s basically an integrated circuit that has many functions, one of them is having a keyboard controller. So there is no dedicated IRQ for the keyboard interrupt, but TWL4030 has a level interrupt, and then the right module IRQ is demultiplexed. OK.

I was hoping there was a chain of actions like pih -> sih (from twl-core) -> keyboard, but no, there was only one action in the chain. Fortunately when I reverted the patch I got a warning with a backtrace where I could see something like twl4030_irq_thread->generic_handle_irq->handle_simple_irq. It was really straight-forward, but I couldn’t see why. There was a lot of code in twl403o-irq for a “secondary interrupt handler” but it didn’t seem to be called at all, and I didn’t see how generic_handle_irq was calling handle_simple_irq.

Time to step back for a second. I remembered a patch from Neil Brown trying to fix something regarding how this sih stuff was called (here) because it was not called at all, but these patches are supposed to be fixes for other patches that are not applied at this point, so they really wouldn’t help.

At this point I’m pretty much stuck, as I’ve no idea how all this code is connected, you might think that all this is way over my head, and it might be, even if just a little, but that has never deterred me. I know that as long as I have the code, and I have a way to run it, I can figure out how it works, so I just keep trying.

After reading the code more carefully, I noticed that only some twl “modules” had a sih setup, and the keyboard was not one of them, so that explained the sih part. And then I noticed a part of the irq handling that dealt with “chips”, and one interesting function that setup the chip’s “handler”. Wait a second… So there’s a chip handler, and a client handler, a few keystrokes after and I find this gem irq_set_chip_and_handler(i, &twl4030_irq_chip, handle_simple_irq). Finally! So I can now see the whole chain of events, but alas, I still have no idea why things are failing.

I thought maybe some states where miscalculated, so I find a function that is used to print everything related to an irq, but it was hidden in some internal code, so I had to do a few tricks to use it in the keyboard driver. All the states seemed to be OK, except one… the irq count, which showed something like 0, 0, 0, 1, 2, 2, 3 where it should have shown 0, 0, 1, 1, 2, 2. I looked at all the places where this counter is modified, and I found an interesting function: handle_nested_irq, so I replaced handle_simple_irq just for fun, and…

BAM!

Everything worked perfectly.

Why, Why, Why

So, now I found a fix, but why does it work? If I want to fix this in a way that other people will not get bitten, I have to find the proper fix, so “This seems to make things work here” usually doesn’t convince many kernel hackers.

Using handle_nested_irq makes the compiler throw a warning right away, because the number of arguments is different, so I looked at how other people use this function, and it’s indeed very different, it looks like people are calling handle_nested_irq directly, instead of generic_handle_irq, which would eventually handle handle_simple_irq, because that’s the chip’s configured handler. It all boils down to the difference between handle_nested_irq and handle_simple_irq, and looking at them it’s clear that handle_nested_irq does much less, specifically, it calls directly the action->thread_fn() callback, which is the keyboard interrupt handler, rather than waking up the thread, which is the same function (but in a specialized thread).

And this finally shines some light; the reason why the interrupt counter is wrong, is that the thread is waked after the irq has been processed, so the code that is supposed to clear the interrupt hasn’t been called yet, and a second interrupt is generated which is spurious. The relevant code is supposed to read a register, and the hardware clear the interrupt on a read (or write) operation. The reason handle_nested_irq works is because it doesn’t bother with the interrupt thread at all, it calls the function right away on the same thread, and thus the interrupt is marked as handled at the right time (after it has been really handled).

And this also explains why other people call handle_nested_irq instead of generic_handle_irq; you are supposed to call it inside an interrupt thread (which is configured with request_threaded_irq), but there’s a thread lying around that nobody uses. Indeed, Neil Brown noticed the same (here), so irq_set_nested_thread() will make sure that no extra thread is created.

So both irq handlers must be threaded, the one for the twl-core, and the one for the keyboard, not just the one for the keyboard like right now. Fortunately Felipe Balbi already fixed that (see this patch series), but that broke things badly, and then Neil Brown fixed them (see this). Interestingly enough, I had already tested those patches for other reasons, but at this point I wasn’t sure if they would fix the keyboard problems, I just assumed the would anyway.

Good good, it looks like we finally have the whole picture, but the job’s not done yet.

Getting it “done”

Now the question is: how to fix this? One limitation of the ‘stable’ kernel trees (at least AFAIK), is that the patches should be in the latest Linus’s tree, so whatever solution is found, should be in the main tree.

Ideally, all the stack of irq’s should be reorganized to use the nested/threaded API, or none at all. So this patch, should have been applied only after this patch series, and not two years beforehand. I pondered for quite some time whether a middle-ground could be found, either by workarounds in the core irq handler, or in the keyboard one, eventually I came to the conclusion that since the core is supposed to set irq_set_nested_thread() anyway (just trust me), the keyboard one is the one that should check if it’s nested or not, and either set the thread handler, or a normal one, and this code should work before and after the patches applied for 3.2. Unfortunately, the ‘nested’ flag is supposed to be internal to the kernel’s thread handling code.

So there’s really no way around it, either the original patch is reverted (and other possible similar ones for other twl modules), or all the twl core irq handler gets the new code for 3.2 backported, and that code isn’t even fixed yet (as of v3.2-rc5) (and maybe it will not be the way things are progressing).

Then the question arises; is all that new code on v3.2 working properly? I went ahead and tried it, and lo and behold; the keyboard works as expected. I then backported all those patches to v3.1, and they all worked too.

But does it really make sense to apply all those patches to all the kernels after v2.6.33? Or does it make sense to revert a few of them? That’s not for me to decide, so I ask the community, after I did all that work on this mail.

Sometimes the result of a week’s work ends up being a one line patch, sometimes it’s just one character, and in this case, there was no patch (at least from me), but at least the issue is clearly identified, as well as the fixes. And that means nobody has to do this work again.

Well, the fun is not over… We still need to synchronize the maintainers, so the right patches land on Linus’ tree for v3.2, and they are back-ported to the relevant linux-stable trees.

Kernel development is hard, let’s go shopping

What are you talking about? This isn’t even kernel development, it’s just some legwork ;) I’m going to be Topper for a second and say “That’s nothing!”, there’s way more complicated and challenging issues kernel developers confront all the time. I thought this was interesting because of all the steps I had to do, and because it involved things I had no idea about.

Was this worth my time? Well, I learned what is actually a virtual terminal for real, and not some vague notion, like when common people say “My browser? Yeah, I use Google… No?”. I also learned a bit more about the IRQ handling API in the Linux kernel, and BTW, all this threaded interrupt API started because of real issues, and the removal of IRQF_DISABLED (nice LWN article about it here), and that thread was fun to read years ago, even when I didn’t understand most of it, maybe it’s time to read it again :)

For me, the best is the satisfaction of knowing that I really “got it”. I mean, I really understand what causes the problem, many of the possible solutions, including hacky and proper ones, and exactly what to do if I get bitten by a similar problem regarding threaded IRQ’s.

And now that the keyboard is fixed, on with the next problem on the N900 (which I stopped working on, because I though a functional keyboard would help to debug it, as opposed to shut down, remove the MMC, plug it into a PC, write a script with the commands I want to run, unmount, plug it back, and boot [I can't use USB networking for this issue to appear]).

No project is more important than its users

Let’s begin by looking at some quotes from ELCE 2011 kernel panel:

Linus Torvalds:

The biggest thing any program can do is not the technical details of the program itself; it’s how useful the program is to users.

So any time any program (like the kernel or any other project), breaks the user experience, to me, that’s the absolute worst failure that a software project can make.

No project is more important than the users of the project.

Alan Cox:

If you want to understand the importance of not suddenly changing your users’ experience.
I would go and take a look at GNOME 3.0.
[Laughter]
Irrespective of whether it is a good user interface design or not.
It’s a demonstrattion of why you don’t suddenly change everything on people who rely on what you were doing.

If you look at it from the point of view of the people producing the hardware and technology. Then there’s always a pressure from them to get every new feature and every new thing enabled.

But if you talk to the users of these systems, most of them are primarily concerned that what used to work continues to work and in the same way.

You can see the whole video to get a sense of how important it is for Linux developers not to break user experience. Also, how to don’t break the API while still moving forward, and not increasing code complexity. Or breaking API without users noticing.

This should be common sense, but some people don’t agree and there are various arguments I have heard over the years.

Developers vs. Users

According to some people, it doesn’t matter what the users want, only what developers want, because at the end of the day, it’s the developers that will implement it. This mentality even lead to the term Free Software User Entitlement Syndrome; the idea that somehow users have a say in what happens in the project is dismissed with passion by some people.

This kind of thinking ignores a very basic fact; there’s no such thing as a user/developer divide. We were not branded at the moment of birth; “you shall be a developer”. No, developers were mere users at some point in time. Maybe they were not so good at programming when they started, but they eventually learned all the tricks. Or maybe they already were good programmers, but didn’t want to contribute right away until they saw the software was actually useful, saw an area of opportunity, and had enough time.

Moreover, there are all kinds of developers; the ones that contribute only one patch, the ones that contribute a series of patches they did in one weekend, the ones that contribute sporadically, the ones that contribute often (but are not part of the core team), and the core team. There is no natural division either; it’s a continuum, and people often move back and forth in it as time and motivation permits.

In fact, this continuum of developers follows a power law distribution, also referred to as the long tail. It is hard to see in the following image, but the total area from the right (yellow) is the same as the left (green), that’s why the term long tail is used to point out that one should not dismiss certain people that might be labeled as outsiders, or not part of the “core”:

Here is a graph of all the contributors to the git project ordered by the number of commits, which is plotted on the y axis:

It’s not easy to see, but there’s a long tail there. Here it’s plotted in logarithmic scale, and with a graph of a pareto function:

This means that 70% of the people contributing to git have only provided less than 6 patches, and 37% only one patch. Sure, one patch might not look like a lot, but as Clay Shirky points out; what if it’s a patch for a buffer overflow? what if it’s the difference between a malevolent hacker being able to exploit your system or not? Traditional companies can’t make use of people like Zack Brown who contribute only one patch to the project, but open source organizations can, so lets make use of them.

And lets not forget that this continuum is not static; people often move up (and down) in the ladder. Good projects would exploit this long tail by not dividing groups of developers, and making things easy for newcomers. Junio C Hamano, the maintainer of git, points out:

During the entire history leading to 1.6.6, 710 authors contributed code
and documentation. The changes since 1.6.5 were made by 99 authors, among
which 17 are the new contributors to the project.

Source.

And of course, this long tail graph doesn’t even take into account the thousands of users, many which are potential contributors. Plus you have people like Ingo Molnar, who is a very active Linux kernel developer, but has only committed 6 patches to the git project, does that mean he is not a “developer”?

Lets get rid of this idea that there is a division between users and developers; in open source projects, this division just exists in people’s imagination.

Lack of resources

As Linus and other developers pointed out in the video above, you can find ways to provide both the old, and the new behavior. Often this is not difficult at all, but requires some imagination.

Unfortunately, some people disagree and just drop features without any way to use the previous behavior and they rationalize the decision using the “lack of resources” card. Sometimes they don’t even try it, but sometimes it’s true that it’s difficult to implement something while keeping the old behavior. However, something being difficult is not an excuse not to do it; it might take more time, but it’s still doable.

This is the point that is often ignored; they don’t want to spend more time.

But there’s another important point; perhaps the lack of developers is due to the amount of features dropped. Think about it. As I explained, all developers were users at some point, and dropping features means dropping users (current and potential), which inevitably means dropping potential developers. The harder you try to keep you users (through keeping features), the more confidence these users would have that they will be using your project in the future, and the more they will see contributing as an investment.

This is one of the reasons Linux is so popular (as I explained in another post; it’s the most important software project in history) everybody that has been using it for years knows (s)he will still be using it in the years to come, as they are not going to suddenly break everything. It’s a contract between users and developers that builds trust. With this trust it’s easy to gather more and more users, and thus, more and more developers.

This also applies the other way; not only about not dropping features, but adding unpopular ones. It is possible to add features not everybody will use, and still keep them isolated from the main experience–although this requires time, and resources.

Imagine a developer that uses project X, which works pretty well from him, but he wants to add a feature. He might be reluctant to do so because a) there’s no guarantee that they will not drop it in the future, and b) they might reject it right away because they don’t want it in their main experience. Until there is a project like Linux that incorporates all the needs of all the people, this developer would have to either fork project X, or stay in apathy.

So, don’t drop any feature, and incorporate all the features; it’s difficult, but this this is the only sure way of being as successful as Linux, and avoid forks.

Users don’t know what they want

While there is some truth to this claim, it’s a very crude oversimplification of reality. For example, users didn’t know they wanted a touch screen before the iPhone, that is true, but after that, they knew they wanted it, and if you were say Nokia, you couldn’t say “oh, users don’t know what they want” and stay with non-touch interfaces; sometimes they do know. Plus, what they definitely know is what they don’t want; like an application that crashes constantly.

This argument is often used as a rationalization when features are removed and users complain. Some developers bring this argument in a patronized way, implying that they know better. Sometimes they do know better, but this should not be used as an excuse to stop listening the complaints.

Conclusion

So, there you have it, it’s really simple; try really hard not to break your user’s experience, because unlike what some people like to think; it’s the users that make a project successful. The user-base is from where all the contributors come from, and the publicity. If you do break user experience, try to do it in a way that can be reverted if your users complain, or even better, provide both an old and new way of doing things; eventually the new way would be as good as the old, and everybody would move eventually.

Appreciate your users, listen to them, and never threat them as expendable, or even worst; a nuisance. Basic stuff that many big projects forget.

Scrobbler for Maemo, now both on N900, and N9

Version 2.0 finally moved to Fremantle stable, so everybody can start using it :)

If you are not familiar with it, this package will see what music you are listening on Maemo devices, and scrobble to your favorite service, either last.fm, libre.fm, or both.

I already explained the features in an earlier blog entry, along with an explanation of how to make use of the “love” feature.

But now I also managed to port this to Harmattan, and it works perfectly on my Nokia N9. Interestingly enough, the new UI has a “favorite” feature directly integrated, it took me some time, as it’s not publicly documented, but I finally managed to hook into it, so everything works seamlessly :)

I was rather impressed by how easy it was to port it, I was able to leave all the GLib bits intact, even libsoup is still supported, and libconio, so I only had to make changes regarding the new qmafw. Thanks to the Qt guys for using the GLib’s mainloop by default, it certainly made things easier for me :)

Update

You can find a debian package here: maemo-scrobbler 2.0-2.

Then, you would need to create a file ~/.config/scrobbler like this:

[lastfm]
username=foo
password=bar
            
[librefm]
username=foo
password=bar

That’s it :)

N9 Swipe undocumented feature; activate sane behavior

Ed Page recently blogged about his idea to improve the Swipe UI. Fortunately for him, a bunch of people and I had the same idea inside Nokia :)

Update: apparently this is not present in the images distributed with the Nokia N950’s, it was introduced after 25-3.

If you open ~/.config/mcompositor/mcompsitor.conf, you’ll see a bunch of swipe-action-foo configs, all of them set to “away” (by default).

You can, however, change them to something like:
swipe-action-up: switcher
swipe-action-down: close
swipe-action-left: events
swipe-action-right: launcher

And voilà! Now depending on which direction you swipe, is the action that would happen. You need to kill mcompositor for them to become active (or send SIGTERM/SIGINT, I don’t remember which one).

This was possible because I got fed up arguing about the benefits of swiping down to close applications, and decided to implement the thing by myself. While doing that, I also decided to implement an idea that was flying around, which was to do something different depending on the direction of the swipe. It took me about two days to do, including making it configurable, mostly trying to familiarize myself with the code that was split into multiple packages, and wrapping myself around C++, which I have avoided as much as possible. That was much less time than the time spent discussing before and after I implemented the patches. Fortunately, after seeing the thing in action, many people jumped into the wagon, and at least swipe down to close is available to the masses, which people seem to like. Guerrilla design ;)

However, after a lot of though, I realized the configuration I mentioned above is ideal. And this is my rationale:

Rationale

The current design is inconsistent; when you swipe away, you never know in which desktop view you are going to end up, specially if you just unlocked the device. You might end up in ‘switcher’, or ‘launcher’, or ‘events’. Same action, different behaviour equals inconsistency.

There’s a simple solution; utilize the simple mental model required to navigate the desktop views.

Swipe mental model

In this way, it’s easy to know exactly what would happen when swiping on each direction, decreasing the amount of actions needed from the user. The same action always produces the same behavior.

It is unclear if this mode is ever going to be officially supported on the device, despite the fact that I think it’s obviously superior (and Ed Page seems to agree, as he came with the exact same idea by himself), maybe some people are worried about updating manuals and what not, but at least it’s trivial to activate, and anybody can create a 3rd party app for that ;)

BTW. This is yet another reason why I distrust people telling me I’m a “geek”, and “designers” know better when it comes to design. In the words of Elaine Morgan; yes, they can all be wrong, history is strung with occasions when they all got it wrong.

Enjoy :)