Past meeting reports > ‘Surround sound audio codecs in broadcasting’
Title: ‘Surround sound audio codecs in broadcasting – an introduction and latest results from independent listening tests’
Location: The Royal Academy of Engineering, London
Description: Lecture by David Marston, BBC R&D
Start Time: 18:30
Date: 14th April 2009
Abstract
Surround sound systems are now becoming a popular addition to many people’s homes. This means there is now a demand for surround sound content to be delivered to homes via broadcasting, Internet or recorded media. Whichever way it gets to its destination, it is going to require data reduction along its journey. This may be in the transmission end of a broadcast chain, or in the transport of audio from a studio out over a broadcaster’s network.
This data reduction uses audio coders designed for surround sound. There are currently numerous different audio coders available, often with different attributes and performance. Choosing which coder to use is not a simple choice, and one of the key factors in this choice is the sound quality. It is inevitable that for serious data reduction, the coder will have to be lossy and therefore compromise sound quality. Our work assessed the sound quality of a selection of audio coders using the most accurate instrument of measurement available: the human ear. Here we present the codecs tested, how the tests were done, and of course the results.
Meeting Report
This paper described the methodology used for a series of evaluation tests conducted by members of the EBU on a range of commercially available audio codecs.
In his introduction, David explained that the measurement of perceptual audio coding systems cannot be carried out using conventional objective measuring tools, as one would do for wow and flutter, for example. An objective measure based on psychoacoustic principles such as PEAQ can work reasonable well with MPEG-style stereo codecs, but there is nothing available yet for surround systems. A disadvantage of using such measurement is that any new method is likely to be incorporated into a codec’s design to ensure good test results.
The only effective test, therefore, is subjective listening using humans – a slow and expensive process if a good sized sample is employed, although you do get useful results.
There are various parameters that can be looked at: overall quality, spatial quality in the case of surround sound, intelligibility, cascaded codecs, and so on. When a selection of different codecs and coding rates have to be tested in multiple combinations, the complexity increases further. In these instances a measurement system such as PEAQ can be used as a pre-filter.
The main subjective testing methods today are MUSHRA (MUlti Stimulus test with Hidden Reference and Anchors); BS1534, which is designed for mid-range to higher quality codecs, can test multiple codecs at the same time and was used in the EBU tests; BS1116, designed for high quality codecs but only samples one at a time; and P800. The latter is for speech and was not relevant for these tests.
MUSHRA produces a quality value and BS1116 an impairment value for each codec. On occasions it may be relevant to have more than one value, for example for temporal and spatial quality. A single value makes testing faster, as well as being easier for the listener and for analysis, however it can hide differences in listeners’ perceptions.
Ensuring a gender balance has also been a problem as most of the listeners have been male. Training is important, whoever takes the tests. Listeners must be taught to identify coding artefacts and other problems, as well as how to use the assessment interface. For scoring, a numerical scale is useful because it avoids interpretations of words like ‘Fair’ or ‘Good’.
Each listener hears five codecs, any more would make the test too tiresome and could degrade the accuracy of results. During the MUSHRA test listeners are always given the reference, and also included in the randomised sequence is a hidden low quality anchor reference, a 3.5kHz low-pass filtered version of the original. In the EBU test another, spatially reduced, anchor was added. For BS1116, listeners hear one codec at time, which is compared with a hidden reference and the known reference. This takes much longer, therefore each listener is limited to four codecs.
It is important to select a cross-section of experienced and novice listeners. Some may prove to have poor listening skills, or have a hearing impairment, but it is not always possible to identify this in advance. So it is better to use them for the test and reject their findings afterwards, often based on their ability to rank the hidden reference and low quality anchor.
David showed a slide of the MUSHRA test interface and explained how the listener can select each of the examples in order to make direct comparisons with the reference. He went on describe listening set-up at Kingswood Warren – soon to disappear!
Choosing the test material is always difficult. It must be critical, in order to highlight coding artefacts, but at the same time be unbiased, eg not material that is known to disadvantage a specific codec. The material must also be appropriate for the application: a mixture of music, speech and jingles (which will already have been compressed) for a broadcast codec, for example. The final choice of ten pieces of test material was made by a selection panel.
One of the techniques used by Institut für Rundfunktechnik (IRT), which analysed the results, was the Spearman Rank Correlation. This looks at the ranking of all the scores, and if anybody’s ranking was massively different from the average they were rejected. Around ten percent of listeners were eliminated at this stage.
There are three phases to this series of tests. The first two covered the most commonly used codecs for emission (transmission), the last link in the chain and usually the one with the lowest bit rate. Phase three looked at combinations of higher bit rate codecs used in the production/distribution chain – which are designed to be cascaded – combined with low bit rate emission codecs, and how they interact.
To ensure randomisation it was decided to split the codecs into three groups, based on their bit rates. Each listener’s five codecs contained at least one from each of these high, medium and low bit-rate groups, with the remaining two being from a single group to ensure a strong intra-group comparison within the test; eg a listener might hear one high, three medium and one low bit-rate codec.
Ten test items were used covering a varied selection of material, including applause, harpsichord, sax and piano, a church organ and Robert Plant.
IRT carried out the analysis to produce the test results. Some listeners were rejected if they fell outside of the Spearman Rank Correlation threshold, which compares the ranking given by each listener with the overall rankings. After this process some codecs dropped below the minimum of 15 listeners and so extra listening tests had to be carried out.
David went on to show the various test results and explained that some of the codecs used for the test were pre-production prototypes, or have since been upgraded. One common element was that the most difficult item to encode – usually the applause – normally ranked much lower than the mean. For example, one codec was rated 30 on applause but 90 on music, proving that perceptual coding is very content-dependent. [Note: This report does not list the codecs involved or their rankings due to the risk of misrepresenting the current performance of those codecs.]
The conclusion from Phase 1, as would be expected, was that higher bit rates produce better quality. The detailed results for each codec have hopefully given their developers something to work on in terms of improving their performance.
Phase 2 retained the applause sample from Phase 1 as a reference item but the other samples, although similar in terms of content type, were different. When results from Phases 1 and 2 were compared they were similar, proving that the testing methodology was valid. Phase 2 again showed that excellent quality can be achieved from low bite-rate codecs, but not for every type of content, and again it gave the developers guidance on areas where improvements can be made.
Phase 3 combined cascaded high bit-rate distribution codecs such as Dolby E, apt-x and Linear Acoustics with a selection of emission codecs. Ten items were selected from the samples used in the previous tests and these were cascaded five times through the same distribution codec before being passed through one or two different emission codecs. Various combinations were tested.
It was decided to use BS1116 rather than MUSHRA for this phase. Because this is an impairment scale, it was not possible to make any direct comparisons with the results of Phases 1 and 2. The conclusion was that distribution codecs still introduce some impairment, having the effect of creating a ‘ceiling’ to the overall quality attainable. The recommendation therefore is to use the highest bit rate possible.
Overall conclusions from these listening tests were that perceptual coding is still an imperfect art and there is room for improvement. Analysis is not easy, but these tests do reveal things that objective tests could never do, as well as uncovering things you wouldn’t expect.
Meeting report by Bill Foster
Location: Royal Academy of Engineering, London
Description: Lecture by John Vanderkooy, Audio Research Group, University of Waterloo, Canada, with Steyning Research Establishment, B&W Group Ltd, UK
Start Time: 18:30 for 19:00
Date: Tuesday 10th March, 2009