Nürnberger Knoblauchsland from my bicycle.
Here’s something I wrote a couple years ago. Better here than lost in my hard drive. Cheers. (Note: this was before Bela)
Overview and state of the art
Up until recently, the only way to efficiently implement audio processors on low power platforms was to use a modular approach of separated embedded devices, with each module taking care of a specific function (i.e. user interface, signal processing or memory management) by making use of an accordingly optimized architecture. The trade-off to this very efficient configuration is the difficulty of portability or updates, low level development and lack of flexibility.
Recent embedded devices manufactured under the category of System on Chip (SoC) offer the possibility of running operating systems akin to those on desktop PCs or Laptops, with a highly developed interface for its peripherals. System integration for a possible audio processor/synthesizer is therefore more straightforward. Digital signal processing algorithms can be developed on a higher layer and, on one hand -since its interaction with the processing core and peripherals is already handled by the operating system- its portability becomes easier.
Real time performance on general-purpose systems
On the other hand, these devices are not optimized for real-time audio processing, at least by default. Task schedulers on these general-purpose operating systems are designed to maximize CPU usage within the available processing power. Therefore the audio thread –and any other task handled by the OS- will be scheduled in order to maximize throughput and to cause minimum idle processor time among tasks. This isn’t necessarily compatible with the notion of real-time processing, where an audio frame needs to be fully treated before the next one comes. If this deadline is not met, audio drops will occur. Under a default configuration, the task scheduler will accept not meeting the audio deadline as long as the CPU resource usage is deemed optimal in a statistical way. This kind of scheduling might be enough and not cause any drops on higher power systems (i.e. desktop/laptop computers) under sufficient buffering, but low-power systems will easily show the limitations of this scheduling mechanism .
In order to overcome this problem, the task scheduling of the operating system needs to be modified in order to give a designated thread the highest priority. The scheduling mechanism has to be designed to use all resources available to meet the real time deadline of the given thread (i.e. audio), even if the resource use is not optimally distributed among all available tasks.
Real-Time Linux kernel
In recent years, a modification of the Linux Kernel to support full task preemption (RT PREEMPT) has emerged, where the behavior described in the previous paragraph can take place \cite. Although initially meant for industrial control applications, its use for audio applications has proven to give new perspectives on stability not reached before for this kind of general-purpose OS processors.
Modular Audio Signal Processing using Pure Data
Pure Data (PD) is a modular signal processing system created by Miller Puckette . Its main focus is to implement signal processing and synthesis of audio and multimedia streams for artistic purposes. Although in later years there has been a divergence in the development’s direction, Pure Data bears a great amount of similarity to its analog visual programming environment Max/MSP (also created by Puckette himself). To a certain point, Pure Data could be considered as a free, open-source version of Max/MSP. Some of the reasons for using PD are:
- It is modular, visual style is very similar to other low level audio processing software such as Reaktor. It can implement block-wise and/or sample-based processing. Instruments/processors already developed in Reaktor should be easy to port.
- Due to its open-source nature, its high configurability is especially well suited for low-power devices, where very specific tuning of resource handling might take place in order to reach a higher efficiency or stability.
- It runs under Linux and it is free. No software license has to be paid.
- PD can also be embedded as a library via libpd to be integrated on external apps, games, web pages and art projects .
Choosing the right device
Choosing the right device for implementation requires a relatively extended feasibility study of the different choices available in the market. There are many topics that need to be considered, amongst them are, for example:
- Computational power
- Power consumption
- Integration on end product and development tools
- Reliability and availability of the parts in the long term
An extended explanation of the trade-offs involved on each of the previous items would constitute a book by itself. Nevertheless, some rough general guidelines for estimating computing power requirements can be mentioned for motivating the next sections:
- A first version of the software must be already written, possible in a relatively low level language like C or modular environment. In this way, profiling information about the most resource-demanding modules can be gathered. It is also important to know the nature of the modules involved and whether it is possible to algorithmically optimize them. For example, frequency analysis can be very efficiently implemented through Fast Fourier Transform algorithms. Many processor architectures include some feature on their instruction set compatible with the kind of optimizations involved for this task. Some of them even dedicated hardware. The same can be said about large matrix multiplications with the help of block processing and caching.
- Floating point performance can also be an issue if the software is to be implemented in this numerical representation. In short, floating point number representation has qualities that make it desirable for high quality audio processing such as higher dynamic range and smaller error propagation on intermediate calculations. Development time in floating point is usually shorter due to easiness of implementation in comparison to fixed point. Nevertheless, floating point units are expensive and most of the available low power devices perform poorly with this representation, or do not have this possibility at all. This has been changing nonetheless in the recent years, but floating point varies still varies a lot among solutions within the same application range.
Example application on Beagle Board Black
The next section will outline an example application of a PD patch on the Beagle Bone Black. The main aspects of this example should be easily extrapolated to further newer-generation SoCs with Linux that might show better performance. The general outline is as follows:
- Installation and configuration of a real-time kernel
- Configuring sound and additional tuning for audio performance
- Installation/launching of Pure Data and adapting of patches for command line operation (no graphical user interface)
- General stress/latency measurements and comparison with a non-real-time kernel.
The setup consists of a Beagle Bone Black (BBB) with the following:
- AM335x 1GHz ARM® Cortex-A8, 512MB DDR RAM
- Revision C: 4GB 8-bit eMMC on-board flash storage
- 3D graphics accelerator, NEON floating-point accelerator, 2x PRU 32-bit microcontrollers
For audio I/O a Saffire USB 2i2 (First Generation) is used. No additional drivers are needed as Linux handles everything. All other configurations are considered default unless noted otherwise.
Installation and configuration of a real-time kernel
The basic, tried and true steps for installing a real-time kernel can be found in : (https://eewiki.net/display/linuxonarm/BeagleBone+Black). The kernel must be cross-compiled from a desktop PC with the gnu ARM compiler. The procedure basically consists in indicating to the operating system the path to the cross compiler and then executing a script that checks out the source code, installs the real-time patches and builds. It can last quite long depending on the computer. The command line sequence is:
For am33x-rt-v4.4 (Longterm 4.4.x + Real-Time Linux):
git checkout origin/am33x-rt-v4.4 -b tmp
eewiki.net: [user@localhost:~$ export kernel_version=4.4.11-bone-rt-r10]
Good symptoms: LED D2 of board flashes heartbeat after successful initialization.
In order to be able to log in via USB through PuTTy or similar in windows, update /etc/network/interfaces to add virtual Ethernet port:
cat >> /etc/network/interfaces <<EOF
add the following lines:
iface usb0 inet static
host name: 192.168.7.2
- Root/root login not possible:
By default, the SSH server denies password-based login for root.
In /etc/ssh/sshd_config, change:
There are a couple of similar posts suggesting that this could be a problem with spawning a shell because of incorrect settings for the shell path in /etc/passwd
To check this, determine that your user shell path exists and is executable, for example:
# cat /etc/passwd | grep tomh
tomh:x:1000:1000:Tom H:/home/tomh:/bin/bash <-- check this exists
Check shell exists:
# file /bin/bash
/bin/bash: ELF 64-bit
Configuring the sound (ALSA) driver
By default, the Linux distribution on BBB is configured to handle audio via the HDMI connection. So the configuration must be changed in order to redirect audio to (in this case) the USB card:
to see which number is assigned to USB and then change accordingly in:
Some other useful commands: alsamixer, aplay –L
mplayer -ao alsa:device=hw=1.0 voice.wav -format s32le
How to find the hardware address of the sound card:
How to measure latency:
$ wget http://alsa.cybermirror.org/lib/alsa-lib-1.0.26.tar.bz2
$ tar xjvf alsa-lib-1.0.26.tar.bz2
$ cd alsa-lib-1.0.26/test
$ gcc latency.c -lasound -o latency
If it cannot find a header, install libasound2-dev and compile again
$ sudo apt-get install libasound2-dev
$ ./latency -m 256 -r 16000
Other useful reference: http://elinux.org/images/8/82/Elc2011_lorriaux.pdf
Additional tuning for audio performance
The following links will give a deeper insight on the whole process, in case needed:
Newer SoC’s implement a Linux feature (scaling governor) that dynamically scales voltage and frequency according to used resources. If the processor is idle most of the time, the operational frequency decreases to save power.
While this is important for efficiency, it is suboptimal for real-time operation where computational power must be fully exploited within the deadlines audio block processing.
Useful result: When the governor is turned to performance (maximal power), the analog to digital converter (for analog inputs) can be used without problems, otherwise there might be some clicking in the audio signal.
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to email@example.com, please.
analyzing CPU 0:
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 300 us.
hardware limits: 300 MHz - 1000 MHz
available frequency steps: 300 MHz, 600 MHz, 800 MHz, 1000 MHz
available cpufreq governors: conservative, ondemand, userspace, powersave, performance
current policy: frequency should be within 300 MHz and 1000 MHz.
The governor "ondemand" may decide which speed to use
within this range.
current CPU frequency is 300 MHz (asserted by call to hardware).
root@beaglebone:~/# cpufreq-set –g performance
Should set the governor to performance and CPU to 1 GHz.
Disabling unneeded services
BeagleBone comes by default with a lot of unnecessary services activated for general purpose development. The most of these services are present on the factory Linux, and should be already be stripped in a fresh install of the Real Time kernel, but nevertheless it is good to double-check.
Useful Result: So far deactivating js.node (if active) can impact also on ADC performance. We definitely do not need java for now.
Here is a bash script that takes care of deactivating common services by default in Linux:
## Stop the ntp service
sudo service ntp stop
## Stop the triggerhappy service
sudo service triggerhappy stop
## Stop the dbus service. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
sudo service dbus stop
## Stop the console-kit-daemon service. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
sudo killall console-kit-daemon
## Stop the polkitd service. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
sudo killall polkitd
## Only needed when Jack2 is compiled with D-Bus support (Jack2 in the AutoStatic RPi audio repo is compiled without D-Bus support)
## Remount /dev/shm to prevent memory allocation errors
sudo mount -o remount,size=128M /dev/shm
## Kill the usespace gnome virtual filesystem daemon. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
## Kill the userspace D-Bus daemon. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
## Kill the userspace dbus-launch daemon. Warning: this can cause unpredictable behaviour when running a desktop environment on the RPi
## Uncomment if you'd like to disable the network adapter completely
#echo -n “1-1.1:1.0” | sudo tee /sys/bus/usb/drivers/smsc95xx/unbind
## In case the above line doesn't work try the following
#echo -n “1-1.1” | sudo tee /sys/bus/usb/drivers/usb/unbind
## Set the CPU scaling governor to performance
echo -n performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Disabling kernel modules
Apparently not needed, but in any case it might be useful to have a reference: https://kernel-handbook.alioth.debian.org/ch-modules.html
Disabling virtual capes
To display the overlays currently enabled by the cape manager, type:
4: ff:P-O-L Bone-LT-eMMC-2G,00A0,Texas Instrument,BB-BONE-EMMC-2G
5: ff:P-O-L Bone-Black-HDMI,00A0,Texas Instrument,BB-BONELT-HDMI
uENv is actually on root fs on /boot/uEnv.txt
Installation of Pure Data
Pure Data comes by default with a graphical user interface for patching. Nevertheless, if a patch is going to be executed in a resource-constrained platform, a “headless” version is preferred (no gui). The patch must be then prepared beforehand and be fully functional. Pure Data can be either compiled from source from the information gathered on the webpage or installed as a ready-compiled package in most Linux distributions. For Debian, the packages can be found via apt-get.
Preparing a PD patch for “headless” runtime
PD features a DSP on/off switch that must be activated each time pure data is initialized. This can –and must- be done within the patch if an auto initialization takes place (i.e. we want our embedded system to be an autonomous effect processor without the need of a command line prompt).
Figure 1: example of a PD patch prepared for runtime without GUI.
From Fig. 1 it can be seen that the short 3-block section to the lower left end takes care of sending an execution order at start “loadbang”, followed by a 1000 ms. delay and a message box that tells PD to activate DSP processing. The delay is placed empirically in order to give some time to PD for initialization.
General audio latency measurements can be made via a set of tools available with varying degrees of accuracy.
Round trip latency
The latency test  can be used for measuring latency between capture and playback with a round robin scheduler SCHED_RR .
root@beaglebone:~/alsa_repo/alsa_lib/alsa-lib-1.1.1/test# ./latency -m 64 -M 64 -P hw:1,0 -C hw:1,0 -p -s 1 -r 48000 -f S32_LE
“When called, the test/latency.c program will attemp to set period/buffer sizes based on the latency entered, starting from -m,–min option (or the default minimum latency = 64 if not specified). If the run succeeds without errors with that setting, the program exits; otherwise, the latency is increased, and the run repeated – if the run is succesful here, then program exits, else the process continues until the -M,–max latency is reached.”
The problem with this approach is that it does not consider system and audio stream stability when the system is under heavy load (CPU or Memory).
Real case-scenario measurement
A good measurement of system performance in an application context could be, for example, running a simple PD patch and some CPU stress program at the same time, and then reducing the audio processing block size at a given sampling rate until audio drops start to occur. This should give an idea of the threshold on overall system performance under heavy computational load.
There are basically two ways PD can handle buffering and latency tradeoffs:
- Setting the sample rate and audio processing block size will automatically set the buffering time.
- Setting the sample rate and the buffering time (in ms.) will set the audio processing block size. This is usually the most convenient approach since it is directly associable to latency.
The main block size can be set via command line as a parameter, but multiple block sizes can be used within the same patch for handling different time/frequency resolutions of the DSP algorithms.
An example command line would be:
pd -nogui -alsa -audiooutdev 3 -rt -r 48000 -audiobuf 50 -verbose -stderr -noadc sinewave.pd
Where pd will be called “headless” (no gui), using the alsa driver and the device number 3 for output duties (pd –listdev will give the list of available sound devices), with real-time priority at a sampling rate of 48000, an audio buffer of 50 ms., no analog-digital converter (no input available) and errors redirected to stderr. The patch will output a simple sinewave. Inhibiting the analog-to-digital input when not needed will allow for some smaller buffering (lower latency) without audio thread overruns.
Once the PD patch is running we can use a second terminal to launch some kind of stress system in order to load the CPU, and see how the audio thread responds. A good stress test is provided by the rt-stress test suite  :
root@beaglebone:~/rt_test/rt-tests# ./pi_stress --rr –uniprocessor
The priority inversion  test pi_stress provides a heavy CPU load within seconds and will immediately affect the scheduling of the audio thread if the scheduling is not properly configured. The –rr switch means a real time priority of round robin again, SCHED_RR, although SCHED_FIFO can also be used.
Useful Result: For reference, without a real-time kernel at 50 ms buffering @ 48 kHz in PD, audio starts glitching when pi_test is running. Priority is not paid that much attention by the non RT kernel either.
Measuring under a real time kernel
A real-time patched kernel allows to set scheduling priorities at runtime for individual processes. It is useful then to set everything related to audio (and possible interfacing with sensors when it comes to instruments) to a high priority. Priorities are numbered from 1 to 99, 99 being the highest priority for the scheduler. Additionally, and as explained earlier, two types of real time scheduling methods are possible: SCHED_FIFO and SCHED_RT for each process.
This priority assignment must be done in runtime from within the program or externally. Issuing:
ps -e | grep usb
will give a list of processes currently running, performing a grep search will only list those associated with USB traffic (in this case, the sound card). From this command a process ID can be gathered. If the in/out processes associated with the USB sound card are, say, 68 and 69, executing:
chrt -f -p 98 68
chrt -f -p 98 69
will change these two processes to real-time priority 98. This number can be lower, around 96 or even less and can be tuned by hand depending on the other processes to be scheduled.
PD can also be assigned a higher priority when running in case it is needed. Issuing
pd -nogui -alsa -audioindev 1 -audiooutdev 1 -r 48000 -audiobuf 10 -verbose -stderr -rt sinewave.pd
will hopefully run in a stable manner even with stress tests going on at the same time.
Another stress test possible can be the stress tool for Linux :
- To spawn N workers spinning on sqrt() function, use the –cpu N option as follows.
- To spawn N workers spinning on sync() function, use the –io N option as follows.
- To spawn N workers spinning on malloc()/free() functions, use the –vm N
- To allocate memory per vm worker, use the –vm-bytes N
- Instead of freeing and reallocating memory resources, you can redirty memory by using the –vm-keep
- Set sleep to N seconds before freeing memory by using the –vm-hang N
- To spawn N workers spinning on write()/unlink() functions, use the –hdd N
- You can set a timeout after N seconds by using the –timeout N
- Set a wait factor of N microseconds before any work starts by using the –backoff N option as follows.
- To show more detailed information when running stress, use the -v
- Use –help to view help for using stress or view the manpage.
Therefore issuing stress –C 120000 renders the system unresponsive, but if PD is also high priority, the audio thread never overrunsJ. The current system load as well as priorities assigned can be seen with the Linux top  command.
||R. Birkett, “Enhancing Real-time Capabilities with the PRU (Sitara™ ARM® Processors),” 2015.
||M. Puckette, “https://puredata.info/,” [Online]. Available: https://puredata.info/.
||M. Puckette, “libpd,” Pure Data pdlib, 2016. [Online]. Available: https://puredata.info/downloads/libpd. [Accessed 13 June 2016].
||eewiki.net, “Installing a RT kernel,” [Online]. Available: https://eewiki.net/display/linuxonarm/BeagleBone+Black.
||A. Project, “ALSA,” [Online]. Available: http://www.alsa-project.org/main/index.php/Test_latency.c.
||P. Krzyzanowski, “Process Scheduling,” Rutgers University, 2015. [Online]. Available: https://www.cs.rutgers.edu/~pxk/416/notes/07-scheduling.html.
||C. Williams and J. Kacur, “Cyclictest,” [Online]. Available: https://rt.wiki.kernel.org/index.php/Cyclictest.
||F. Rownland, “Using and Understanding the RT Cyclictest Benchmark,” Sony Mobile Communications, [Online]. Available: http://events.linuxfoundation.org/sites/events/files/slides/cyclictest.pdf.
||M. Barr, “Introduction to Priority Inversion,” [Online]. Available: http://www.barrgroup.com/Embedded-Systems/How-To/RTOS-Priority-Inversion.
||A. Kili, “How to Install ‘stress’ Tool in Linux,” [Online]. Available: http://www.tecmint.com/linux-cpu-load-stress-test-with-stress-ng-tool/.
||http://linux.die.net/man/1/top, Writer, top(1) – Linux man page. [Performance].
 Real time switch –rt does not seem to be working for some versions of PD. Nevertheless, priority can be externally set with a real time kernel, so it is not really an issue anymore.