GStreamer, embedded, and low latency are a bad combination

This has been a known fact inside Nokia (MeeGo) for quite a long time due to various performance issues we’ve had to workaround, but for some reason it wasn’t acknowledged as an issue when it was brought up in the mailing list.

So, in order to proof beyond reasonable doubt that there is indeed an issue, I wrote this test. It is very minimal, there’s essentially nothing of a typical GStreamer pipeline, just an element and an app that pushes buffers to it, that’s it. But then, optionally, a queue (typical element in a GStreamer pipeline) is added in the middle, which is a thread-boundary, and then the fun begins:

Graph for x86
Graph for arm

The buffer size legends corresponds to exponentiation (5 => 2 ^ 5 = 32), and the CPU time is returned by the system (getrusage) in ms. You can see that in ARM systems not only more CPU time is wasted, but adding a queue makes things worst at a faster rate.

Note that this test is doing nothing, just pushing buffers around, all the CPU is wasted doing GStreamer operations. In a real scenario the situation is much worst because there isn’t only one, but multiple threads, and many elements involved, so this wasted CPU time I measured has to be multiplied many times.

Now, this has been profiled before, and everything points out to pthread_mutex_lock which is only a problem when there’s contention, which happens more often in GStreamer when buffers are small, then the futex syscall is issued, is very bad in ARM, although it probably depends on which specific system you are using.

Fortunately for me, I don’t need good latency, so I can just push one second buffers and forget about GStreamer performance issues, if you are experiencing the same, and can afford high latency, just increase the buffer sizes, if not, then you are screwed :p

Hopefully this answers Wim’s question of what a “small buffer” means, how it’s not good, and when it’s a problem.

Update

Ok, so the discussion about this continued in the mailing list, and it was pointed out that that the scale is logarithmic, so the exponential result was expected. While that is true, the logarithmic scale matches what people experience; how else would you plot the range from 10ms to 1s? Certainly not linearly.

But there’s a valid point; the results should not be surprising. We can take the logarithmic scale out of the equation by dividing the total CPU time by the number of buffers actually pushed, as Robert Swain did in the comments, that should give a constant number, which is the CPU time it took to do a push. The results indeed converge to a constant number:

queue: 0.078, direct: 0.011

This means that in a realistic use case of pushing one buffer each 10ms through a queue, the CPU usage on this particular processor (800mhz) is 0.78%.

Also, there’s this related old bug that recently got some attention and a new patch from Wim, so I gave it a try (I had to compile GStreamer myself so the results are not comparable with the previous runs).

Before:
queue: 0.074, direct: 0.011

After:
queue: 0.065, direct: 0.007

So the improvement for the queue case is around 12%, while the direct case is 31%.

Not bad at all, but the conclusion is still the same. If you use GStreamer, try to avoid as many elements as possible, specially queues, and try to have the biggest buffer size you can afford, which means that having good performance and low latency is tricky.

Update 2

Stefan Kost suggested to use ‘queue’ instead of ‘queue2′, and I got a pandaboard, so here are the results with OMAP4.

pandaboard (2 core, 1GHz):
queue: 0.017, direct: 0.004

N900:
queue: 0.087, direct: 0.021

i386 (2 core, 1.83GHz):
queue: 0.0059, direct: 0.0015

So, either futex got better on Cortex A9, or OMAP4 is so powerful it can’t be considered embedded :p

About these ads

11 thoughts on “GStreamer, embedded, and low latency are a bad combination

  1. Felipe, the git repository is not accessible.

    What is the buffer “size” exactly? Does that change the rate at which the buffers are pushed indirectly in your test? The smaller the buffer, the more pushes?

  2. @Marc-Andre cgit in fd.o takes a while to show repos, it’s visible now.

    The test is pushing 0×1000000 bytes always, the only thing that varies is the buffer sizes, and thus the amount of buffers used as well. So yes, the smaller the buffers, the more the pushes, which is what happens on a real low latency use-case.

  3. The worst case will be the most buffers per second. Video will have at most 30 buffers per second, or perhaps on some really capable device, 60 buffers per second. Compressed audio frames are usually at their shortest for low-latency compressed audio conversations about 10ms (g.729) which is 100 buffers per second. Your graph plots the CPU time used to push through 16MB of 2^size buffers.


    buffer size (log2(size)) | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3
    ----------------------------------------------------------------------------------------------------------------------------------------------
    buffers per 16MB | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524888 | 1048576 | 2097152
    ----------------------------------------------------------------------------------------------------------------------------------------------
    cpu time spent (ms) | ? | ? | ? | ? | ? | ? | 500 | 1000 | 2500 | 5000 | 10000 | 20000 | 40000 | 82500 | 165000
    ----------------------------------------------------------------------------------------------------------------------------------------------

    The CPU time in ms spent is taken from your ARM graph with a queue element. I apologise for any inaccuracies, you’re welcome to conduct the calculations yourself with the original data.

    Now, I just quickly looked at the CPU time spent per buffer for 2^11 and 2^3 sized buffers and it’s in the 0.06-0.08ms range. Perhaps it varies a bit through the course of the data, but we’re in the ballpark.

    So with our worst case of 160 buffers per second of audio and video, that’s 130 * 0.08 milliseconds per second, which works out as 100 * 160 * 0.08 / 1000 = 1.28% CPU time spent pushing the data through your simple graph with a queue on an ARM.

  4. Yes, 1.28% of CPU in this particular processor, and this minimal pipeline. Realistic pipelines would have more queues, and more elements, and some people might have much slower processors, or the frequency might be dynamically scaled down. We’ve seen extra CPU usages of 5%-10% just by decreasing the buffer size.

    Anyway, my purpose was to plot exactly how this waste happens, and how to avoid it: less elements, less queues, bigger buffer sizes (which goes against low latency). Anything after 2^7 is just to make clear where things are going :)

  5. Felipe, did you use “–enable-gobject-cast-checks” on configure for core?
    For others, the above tests use queue2, where in most cases one can use “queue silent=true” which is cheaper too.

  6. @Rajeev You would have to be more specific than that. Which kind of clip? Which codec elements are you using? Which video sink are you using? Which X.org driver?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s