Meetings Archive – 2010

Remastering and Audio Restoration at Abbey Road Studios

Date: 11 May 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by Simon Gibson of Abbey Road Studios

Abstract

EMI has an archive going back to 1898 and, since Abbey Road Studios opened in 1931, there has been a gradual increase in the remastering of that back catalogue for new formats. Starting with a potted history of EMI, the early years of recording and the work of Alan Blumlein, we move on to the emergence of remastering at Abbey Road and the systems and techniques used today. The talk will then concentrate on the use made of CEDAR Audio’s Retouch software in the audio restoration of The Beatles album remasters as well as its more usual use in the creation of music for the video game The Beatles Rockband. Along the way we will hear rare audio extracts from EMI’s archive and clips from The Beatles’ recordings to demonstrate these remastering and restoration techniques.

We regret that due to the large amount of copyrighted material played during this lecture we are unable to provide a recording.

Lecture report

Simon’s work at Abbey Road Studios focuses on audio restoration. Much of this is for EMI’s own vast catalogue, which dates back to the first recording by The Gramophone Company (EMI’s predecessor) in 1898. EMI’s archive comprises hundreds of thousands of items, not just audio discs and tapes but also artwork, photographic records and other materials.

In the case of recordings from the pre-tape era, the preference is to transcribe from the metal masters as these deliver a superior quality to a shellac disc. Before 1925 there was no 78rpm standard so the only way to set the correct playback speed is by musical pitch. 1925 also saw the introduction of electrical recordings,  with microphones replacing horns. An early example of a UK electrical recording was Handel’s Messiah conducted by Thomas Beecham.

A notable past EMI employee was Alan Blumlein. He joined The Gramophone Company in 1929 and was with them when Abbey Road Studios opened in 1931, the same year that Electrical Musical Industries (EMI) was formed from the merger of The Gramophone Company and the Columbia Gramophone Company. During his tenure with EMI this remarkable man developed moving coil microphones, a binaural cutter head and a stereo ribbon microphone. His wide-ranging stereo patent, lodged in 1931, expired in 1952 and, incredibly, it wasn’t renewed.

A stereo recording of the Royal Philharmonic Orchestra, again with Beecham as conductor, was made by EMI in 1934, while the team at the Hayes research laboratory’s stereophonic tests included recording a train passing. However, EMI didn’t consider stereo important at that time! Blumlein died during World War II while testing an airborne radar system but his microphones are still used today.

Until the early 1950s recordings were made directly to discs using a pair of ‘gravity-fed’ cutting machines driven by weights. As is well documented, when the Allies liberated Germany in 1945 they discovered the Magnetophon, a recording device using ¼” tape. Ampex in the US and EMI developed their own versions, the EMI machine being the BTR-1, which ran at 30ips. Simon noted that test recordings made on that machine still sound good today.

Stereo recording at 15ips began in 1955. A recording made of a Beecham rehearsal at Kingsway Hall in 1958 used a Vortexion mixer, EMI BTR-2 and Reslo microphones. The reason? All other mics were in use at the time!

Having provided this potted history, Simon moved on to talk about restoration. Early transfers from disc at Abbey Road were made using EMI’s own (analogue) equipment which could remove some of the clicks and pops, the bigger clicks being edited out after the transfer to tape.

In the mid-80s computer technology started to be employed and is now widely used for audio restoration. Abbey Road has an extensive array of tools developed by Cambridge, UK-based Cedar Audio. This includes what Simon referred  to as “declickle”, a combination of de-click, de-crackle and broadband noise reduction. These tools have to be used with discretion, Simon noted, particularly where the human voice is concerned as it is prone to suffer if processing is over used.

Simon sees the role of a remastering engineer as being like that of a curator, re-presenting works for successive generations. He will recommend that a work is remastered from scratch where the technology has improved sufficiently to make a significant difference to the end result.

By far the biggest problem when dealing with material on analogue tape is old edits. The splices can come unstuck.  Oxide  shedding is another issue, often solved by baking the tape at 50° for three days. EMI’s tapes are not usually a problem in this respect as they are stored in a good environment at the company’s library in Hayes.

The remastering process involves finding the best source – which in itself can be a painstaking process – transferring the material to digital, treating it and editing on SADiE. Lastly, some EQ or compression may be applied, but only where appropriate. An engineer’s ears are the final arbiter, Simon noted. Knowing when to leave alone.

Equipment employed include TC Electronic’s System 6000 dynamic processor and, as mentioned earlier, restoration tools from Cedar Audio, in particular, Retouch. A full description of this remarkable system is not possible here, but suffice to say that it acts a bit like an audio version of Photoshop. The different elements within a piece of audio are represented as differently coloured visual  images and the engineer can then remove an individual element by effectively painting it out.

Simon went on to discuss a particular project which used Retouch in a rather unusual way to create the soundtrack for the Beatles’ Rock Band game title. This involved isolating various instruments and vocals to create individual tracks that those playing the game can reproduce by ‘playing’ their instruments.

Because the original Beatles recordings were made on 2-, 3- or 4-track machines, it is not possible to simply mute individual instruments or voices, as it would be today with the virtually unlimited number of tracks offered by hard disk recording systems. Each instrument or voice therefore had to be extracted on to a separate track by identifying its visual pattern on the Retouch display.

The results were impressive, but unfortunately the only way to hear them will be to play the game!

Report by Bill Foster


Improved methods for controlling touring loudspeaker arrays

Date: 9 Mar 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by Ambrose Thompson, Martin Audio

Download the recording of this lecture here (12MB MP3)

The focus of this paper is a popular type of line array loudspeaker used for large- and medium-scale sound reinforcement. These systems are required to deliver very high SPL to a large audience area sometimes as far as 100m from the array, but typically in the 30-70m range. This class of line array is characterised by relatively widely spaced acoustic sources, each with high vertical directionality compared to the more traditional steered column loudspeaker where the acoustic sources are small and tightly spaced. These differences, together with the fact that large audience regions are typically in the near-field, preclude the use of the existing techniques to control linear arrays.

Currently successful methods of control were examined and found to be inadequate for meeting a new more stringent set of user requirements. This paper describes how users of the modern articulated line array loudspeakers used for high level sound reinforcement can control these systems with more precision, and explains how these requirements can be formed into a mathematical model of the system suitable for numerical optimisation. The primary design variable for optimisation was the complex transfer functions applied to each acoustic source. How the optimised transfer functions were implemented with IIR/FIR filters on typically available hardware is explained, and a comparison made between the predicted and measured output for a large array.


Cutting Edge Research — from the University of Surrey

Date: 16 Nov 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

‘Psychoacoustic Engineering at the Institute of Sound Recording (IoSR)’

The IoSR is responsible for world-class research in audio-related subject areas, and offers postgraduate research-based MPhil and PhD programmes, as well as being home to the world-famous Tonmeister™ BMus undergraduate degree course in Music & Sound Recording.

Since the creation of the Institute of Sound Recording (IoSR) in 1998 it has become known internationally as a leading centre for research in psychoacoustic engineering, with world-class facilities and with significant funding from research councils (in particular EPSRC) and from industry (we have successfully completed projects in collaboration with Adrian James Acoustics, Bang & Olufsen, BBC R&D, Genelec, Harman-Becker, Institut für Rundfunktechnik, Meridian Audio, Nokia, Pharos Communications and Sony BPE). Additionally, the IoSR was a founding partner in the EPSRC-funded Digital Music Research Network (DMRN) and Spatial Audio Creative Engineering Network (SpACE-Net).

We are interested in human perception of audio quality, primarily of high-fidelity music signals. Our work combines elements of acoustics, digital signal processing, psychoacoustics (theoretical and experimental), psychology, sound synthesis, software engineering, statistical analysis and user-interface design, with an understanding of the aesthetics of sound and music.

One particular focus of our work is the development of tools to predict the perceived audio quality of a given soundfield or audio signal. If, for example, a new concert hall, hi-fi or audio codec is being designed, it is important to know how each candidate prototype would be rated by human listeners and how it would compare to other products which may be in competition. Traditional acoustic and electronic measurements (e.g. RT60, SNR, THD) can give some indication but a truly representative assessment requires lengthy listening tests with a panel of skilled human listeners. Such tests are time-consuming, costly and often logistically difficult. The tools that we are developing will describe the quality of the prototype without the need for human listeners.

An introduction to our research will be given by the Director of research, Dr.Tim Brookes, followed by demonstrations and posters from our postgraduate researchers. We welcome those working in industry and academia to attend the presentation and to discuss our recent findings and overall research goals.

For additional information please visit:

http://www.surrey.ac.uk/soundrec/


Headphone processing for a three-dimensional world

Date: 13 Jul 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

The recording is available for download here (16MB mp3).

Lecture by Ben Supper, Focusrite Audio Engineering Ltd.

The practice of processing audio signals to impose lifelike room acoustics on them for headphone presentation is called auralization. The two most commercially exploited applications of auralisation are the conversion of headphone stereo listening into an experience more like loudspeaker stereo listening, and the simulation of proposed architectural spaces.

Although the tools required for auralization are fairly well understood, experiments that test the response of the human auditory system to spatial cues are generally designed to investigate one changing parameter in complicated sound field, and the way in which stimuli are synthesised is not standardised. These limitations mean that little has been written recently of the ways in which the various parts of the human auditory system interact to experience a spatial illusion presented over headphones.

This talk presents, informally, some observations learned from several years’ experience trying to analyse and deceive the spatial parts of the human auditory system. It discusses how we perceive the spatial cues present in direct sound and reverberation that are central to auralisation, and the most effective and efficient ways of presenting a convincing illusion without causing listening fatigue.


Active Acoustic Absorbers: A Feasibility Study

Date: 27 Apr 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

A recording of this lecture is available here (mp3; 18MiB)

Lecture by John Vanderkooy, University of Waterloo and B&W Loudspeakers Ltd.

An active acoustic absorber must sense the sound field in a space, and generate a signal to absorb energy from that field.In 1-D such absorbers work very well in situations such as ducts, and in 3-D systems they are effective if source and canceller are much closer than a wavelength.In actual rooms with loudspeakers, such conditions never apply.A tutorial is presented outlining the known theory and possibilities of active absorbers that work over a wide band.The self-pressure of the absorber complicates its operation, and an analysis is presented in which this self-pressure is cancelled by a signal related to cone motion.The resulting device may still suffer from implementation problems.Experiments are discussed that determine the required microphone signal that needs to be applied to the adjacent absorber driver.Active absorbers can also act as subwoofers.To conclude the talk, some FDTD calculations are presented which show how a subwoofer excites room resonances, and the influence of different configurations.Delay-and-cancel techniques lead to a very flat bass response, but that probably removes too much of the room acoustics.


When All the Songs Sound the Same: Insights into the Musical Brain

Date: 10 Jun 2010
Time: 19:00

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

A recording of the lecture is available here (69MB mp3)

Lecture by Dr Lauren Stewart, Goldsmiths University, London.

The ability to make sense of musical sound has been observed in every culture since the beginning of recorded history. In early infancy, it allows us to respond to the sing-song interactions from a primary caregiver and to engage in musical play. In later life it shapes our social and cultural identities and modulates our affective and emotional states. But a few percent of the population fail to develop the ability to make sense of or engage with music. The study of disordered musical development sets in sharp relief the perceptual and cognitive abilities which most of us take for granted and give us a unique chance to investigate how musical perceptual ability develops, from the level of the gene to the brain development and the emergence of a complex and fundamental human behaviour.

Lecture report

For most of us, music forms an integral part of our lives. Even if we’re not music fans as such we still associate particular pieces of music with events or period in our lives, memories of which are evoked when we hear them. More than that, across all cultures since the beginning of recorded history music has played a role in the raising of children and in adult cultural interaction.

But a minority of people, suffering from a condition termed amusia, are unable to perform the sophisticated processes of analysis and prediction that make up human musical enjoyment. Amusia can result from brain injury but in the form of congenital amusia – the condition that Dr Lauren Stewart of Goldsmiths, University of London, is studying and described in this lecture – is apparently a genetic condition passed from parent to child. Certainly, familiar cases of amusia are common.

Amusia was first described in a case report by Grant-Allen in 1878, who noted that the patient concerned required two sounds to have an unusually large pitch differential for that difference to be perceived, and that the perception of octave equivalence and differentiation of consonance and dissonance were missing. Isabelle Peretz of the University of Montreal named congenital amusia in 2003 and developed the Montreal Battery for the Evaluation of Amusia (MBEA), a series of tests to probe subjects’ impairment with respect to six different aspects of musical form: scale, contour, interval, rhythm, metre and memory. He also gathered normative data from non-affected subjects. This work showed that amusia is not merely an inability to sing but a symptom of a primary perceptual problem.

Lauren Stewart’s research in this field began with the creation of an online screening test at the University of Newcastle upon Tyne to identify subjects with amusia, which has now been completed by more than 170,000 people (www.delosis.com/listening/home.html). A normal score in the test is 26 or 27 out of 30; subjects who scored 21 or lower were identified as amusic and invited to attend a full battery of tests. A total of 40 people agreed, and control subjects then matched for age and musical background. Questions the research hopes to address are:

1) What is the core underlying deficit (or deficits if more than one)?
2) Are there implications outside the musical domain, eg for speech perception?
3) Does musical appreciation depend upon intact perception?
4) Given that this is a genetic condition, are there brain structural correlates?

The detailed testing has shown that amusics’ difficulties do not arise from a pitch discrimination deficit – they are able to detect a pitch difference of a semitone – but they do perform less well than normal subjects on pitch direction tests. This ability to perceive the direction in which pitch changes is essential to distinguishing one tune from another. Thresholds for the detection of intensity change are currently being measured, initial results suggesting that amusics have a higher threshold – so their ‘contour’ problem may involve more than a shortfall in pitch direction discrimination.

Amusics perform well on speech perception, which has led to the suggestion that there may be a different pitch mechanism for speech and music. But the difference may also be explained by amusics being able to ‘tag’ certain words, which they listen out for. A new experiment using normal speech, gliding tones and natural speech has confirmed that amusics have difficulty with discrimination of subtle pitch changes. This may represent less of a problem with non-tonal languages like English than with tonal languages like Chinese, where pitch varies rapidly within words and sentences, making pitch discrimination more important.

Although amusics generally use music less than control subjects, there is overlap between the groups with about a third of amusics using music as much as the controls. This might suggest the existence of different sub-groups of amusics but it has been observed that those who use music more in fact have similar perceptual deficits but are younger. So there may be a socio-cultural explanation, younger amusics using music for its associations rather than for enjoyment of the music itself. Or it could be that some amusics have problems with timbre as well as pitch, making appreciation of music more difficult still.

Structural neuroimaging studies have confirmed subtle difference in brain structure between the amusic and control groups, with both the frontal and temporal lobes displaying differences in grey matter density. Functional studies have shown that the frontal cortex is recruited for complex pitch discrimination tasks, so these structural differences may well be associated with amusics’ pitch perception difficulties. It appears that crucial connections are weaker in amusic subjects.

In closing out this fascinating lecture and inviting questions from the floor, Dr Stewart recommended David Huron’s book ‘Sweet Anticipation: Music and the Psychology of Expectation’ to those interested in learning more about current understanding of the perception of music. In this book Prof Huron, head of Ohio State University’s Cognitive and Systematic Musicology Laboratory, expands on the role of expectation in musical appreciation.

Report by Keith Howard

A brief description of congenital amusia, written by Dr Stewart, is available here.

Dr Stewart is Senior Lecturer and director of a new MSc course: Music, Mind and Brain at Goldsmiths, University of London

Lauren originally studied Physiological Sciences at Balliol College Oxford, but transferred from bodies to brains with an MSc in Neuroscience and doctoral and postdoctoral training at the Institute of Cognitive Neuroscience, the Wellcome Department of Imaging Neuroscience (both UCL) and Harvard Medical School.

Her current research interests ranges from studying those with congenital amusia who have an inability to make sense of musical sound to studying the acquisition of perceptual, cognitive and motor skills in trained musicians.


Who’s the bad guy now? Maintaining audio/video sync in today’s broadcast environment

Date: 12 Jan 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

A recording of this lecture is available here.

Lecture by Andy Quested, Head of Technology, BBC R&D.

To complain that “the audio is out of sync” was, in the past, doing audio an injustice. The use of visual effects units, time base correctors and other digital processing in the video chain, while the audio continued to pass through an analogue signal path, meant that it was, in fact, the video which was usually out of sync. However, the move to digital audio processing, and in particular surround sound broadcasting – which often requires six channels to be passed through a two-channel infrastructure – has significantly moved the goalposts. The advent of HD, with its more clearly defined imaging, has exacerbated the problem. Andy Quested will highlight some of the audio/video synchronisation issues that the BBC HD channel has had to deal with, and will outline the measures it is taking to put audio back into its rightful place.

Andy’s BBC blog provides some more background.

The lecture recording is available to download here (45MB mp3)

Meeting Report

Andy was joined for the lecture by a colleague from BBC Future Media & Technology, Rowan de Pomerai, who provided details of BBC HD’s audio/video transmission infrastructure and the points where sync errors can be introduced. This was comprehensively illustrated by slides showing block diagrams of the various elements in the chain, many of which can be found in an excellent white paper on the EBU’s website. Andy’s contribution was more anecdotal, highlighting the actual problems encountered, and this report will focus primarily on his part of the lecture.

Andy opened with some statistics on HD adoption in the UK. Sky has 1.8m HD subscribers, Virgin has 280,000 and 48,000 watch HD via Freesat. Freeview HD is launching and is expected to become the biggest single platform. In 2009, Wimbledon and Torchwood attracted HD audiences of 1.75m. Overall, 2009 was not a bumper year for sport but there will be plenty in 2010, including the Winter Olympics and the World Cup. Launched in April 2009, the HD iPlayer is now the most successful version of the BBC’s catch-up service.

In a recent survey, viewers were asked what they considered to be the most important elements of an HD channel. Not surprisingly, picture quality was placed top by 56 per cent, followed by choice of programming by 48 per cent. Sound quality was fourth at 34%, a figure that hasn’t really changed since BBC HD was launched in 2007. Part of the problem is that, unlike cinemas which have laid-down standards for audio replay, home speaker layouts can vary enormously, particularly in the placement of the centre speaker. This can make it difficult to predict the listening experience.

Moving on to the specific topic of audio sync, Andy noted that the BBC HD channel suffered from several audio sync and metadata problems in the early days. Programmes affected were the Proms, Electric Proms, Olympics and Strictly Come Dancing.

One of the earliest instances of a major problem involved the 2008 Eurovision Song Contest. Andy was watching at home and immediately noticed that there was no music track on the HD broadcast, only vocals from the centre speaker. Somehow what should have been a 5.1 track was actually 1.0, which shouldn’t happen because the BBC HD channel is locked to 5.1 even when broadcasting stereo in order to prevent clicks or mutes which happen when some AV receivers switch modes.

Andy phoned the broadcast centre, which was unaware on the problem – they were hearing 5.1 all the way through the chain. Andy suspected a metadata issue but where was the problem occurring? The broadcast chain includes many elements, not helped by the BBC’s outsourcing policy which means that there are several companies involved (see Rowan’s white paper). The decision was made to switch to an upconverted BBC 1 feed with stereo audio while the problem was investigated because taking audio only would have resulted in sync problems.

At Andy’s request the Dolby encoder was checked and it was found to be set to disable the metadata (a option that has since been removed by a software update). With no metadata the Dolby decoders in the set-top boxes revert to their default mode, which is 1.0. This is a legacy from Dolby systems in cinemas where the centre dialogue channel is the most important element and is therefore the most logical default.

With regard to maintaining sync, BBC HD has taken the approach that audio and video should be in sync at every stage of the chain – known as in-sync encoded. However, this hasn’t stopped numerous complaints about audio/video sync from viewers.

Rowan de Pomerai explained that many of the problems are due to delays created within set-top boxes and flat panel displays, the latter creating a video delay of up to 100ms. Hearing audio before the video is counter-intuitive because light travels faster than sound and we’re therefore used to hearing the audio delayed relative to the picture, not vice-versa. Many set-top boxes have a delay function, but this has to be configured. The BBC has developed a sync test to assist in setup which is broadcast a regular intervals during the daytime. (For a full description of the test see Rowan’s white paper.)

Providing a sync test is a great idea but for it to work correctly it’s essential that the audio and video signals arriving at the set-top box are in sync. The broadcast chain was measured all the way through to the broadcast encoder and adjustments made for minor sync errors introduced throughout the system. A duplicate system at BBC R&D Kingswood Warren was also measured to verify the figures. However, the only way to check categorically that everything was OK was actually to broadcast a test.

The final problem was how to measure the sync off-air. A set-top box was not reliable enough so the solution was to record the MPEG transport stream, decode it offline and measure the analogue waveform and video frame numbers. The BBC’s was aiming for ±5ms – a quarter of the EBU’s recommendation – but the result of this test was measured to be ±2ms. “So, it’s no longer just ‘OK leaving me’, it’s also ‘OK arriving at you,’” Andy noted, adding that servers do drift so ±5ms is BBC HD’s target as an average. This is still an excellent figure when taking into account that there’s around 8ms sound delay between a TV and the viewer.

Before transmitting this test BBC HD received 20-30 complaints a week regarding sync but after the test these dropped to zero. The only complaints received since were for one live broadcast that actually was out of sync. In that instance BBC HD knew the feed was out of sync because they had the confidence that the broadcast chain was 100 per cent in sync.

In conclusion, Andy stressed that audience education is essential. The BBC receives about 90,000 hits on its website and 3,000 calls per month about HD. The days of just plugging everything in and it all working are gone. Users need to understand about 5.1, adjusting audio delay and speaker positions, and – very important – removing the SCART lead. Countless viewers are watching HD programmes in SD because pin 8 on their SCART has switched the TV from HDMI to the AV input!


An introduction to forensic audio

Date: 15 Apr 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by Gordon Reid, CEDAR Audio Ltd.

Note: we are unable to provide a recording of this lecture. Some of CEDAR’s police and security customers place strict constraints on the public dissemination of audio clips and details of cases used in demonstrations of CEDAR’s forensic technology.

Lecture Report

Gordon Reid is the Managing Director of CEDAR Audio, a leading manufacturer of audio restoration and speech enhancement products. He kicked off his lecture with a scenario of how video surveillance, without audio content, can give ambiguous or even completely misleading indications of intent.

Audio forensics is a relatively new field that first entered common use in the 1960s/1970s. Thanks to the technology of companies like CEDAR, audio forensics is now an established field, and the most recent trend is for audio and video surveillance data to be integrated. Before the arrival of digital technology in the 1990s, audio forensics was relatively crude, using often poorly-maintained analogue tape recorders, no single-ended noise reduction, and often just analogue EQ and dynamics processes for clean-up.

Nowadays, recordings are mostly digital, and can be made using low-cost consumer equipment. But this brings some new problems. Recordings are often made by untrained people using small, cheap recorders: he highlighted a divorce litigation case in which a woman concealed a recorder at the bottom of her handbag, covered by a scarf and jumper to make sure it wasn’t found. Unsurprisingly, there was almost no discernable speech data on the resulting recording. So there are new problems to face, but fortunately, DSP algorithms and powerful computers can help get around many of these. But even these have limits: Gordon described a phenomenon known as the “CSI Effect”, whereby the public has unrealistic and fantasy-based expectations of surveillance restoration technology. He cited the apparently genuine example of a person who’d snapped a photo of the side of a speeding getaway vehicle on a mobile phone, and handed it to the police in the expectation that by rotating the side-on (and blurred, low-quality) image in a 3-D computer imaging system, they could read the license plate! But absurd cases aside, there is an increasing problem: the bad guys are increasingly aware of surveillance techniques, making (for example) body wires impractical because criminals know how to frisk for them effectively. They also know to hold sensitive conversations in locations where there is loud, effective masking noise such as running water or TV noise.

Gordon broke noise reduction technologies for audio forensics into two main applications: real-time surveillance and non-real-time laboratory investigation. Surveillance systems have live listeners (typically police or security officers), who may need to make fast, accurate and life-critical decisions based on what is heard. The principal requirements are low latency, high intelligibility and low listener fatigue. Non-real-time systems are typically used to produce evidence admissible for the courts, so the requirements are for high transcription accuracy, the retrieval of otherwise unintelligible speech, and to reduce transcriptor fatigue. Also, jurors are not trained listeners and courtrooms typically have very poor acoustics, so the presence of background noise may affect their judgement. He cited the case of a defence lawyer who used the presence of modest traffic and street noise on an intelligible recording of incriminating statements to cast doubt on the transcription of the recorded speech — and won.

Gordon listed the long-established principles of good non-covert audio evidence: a suitable recorder, competent operator, authentic recordings, recordings preserved such that they are demonstrable in court, speakers identified, evidence made voluntarily and in good faith — and no edits or changes made. The last point is potentially problematic as, in principle, it could exclude the enhancement processes that render noisy evidence intelligible. This is a grey area, with the degree of processing admissible dependent on the judge, court and jurisdiction. Clearly, there is a need to demonstrate that the processing has not modified the meaning of the evidence. For example, it’s not possible for the microscopic editing of a real-time declicking algorithm to change phonemes, and so change the meaning, but the court may need to be convinced of this. Additionally, proposed UK government regulations on handling evidence may be applied to audio evidence, potentially causing substantial problems when regulations designed to protect physical items are applied to digital media.

Gordon moved on to talk about the specifics of the technology used: it’s usually some combination of noise reduction, equalisation and level processing (e.g. dynamics processing). Dialogue noise suppression is a technology originally developed for the film industry, and CEDAR’s first product in this field, a real-time and very easy to use device, was aimed at post-production for film, video and TV: the typical application was to save a take that had been spoiled by ambient sound intrusion. This was contrasted with lab systems: large computer-based systems intended for off-line batch processing rather than real-time use.

The use of declickers was demonstrated. The earliest algorithms in this field were originally developed for 78rpm archives, but have been developed much further and are now extremely helpful in removing GSM noise, the familiar buzzing/pulsing interference caused by mobile phones. GSM noise can be shown to comprise buzz at around 217Hz and a series of pulses. The declicker can remove the impulsive noises, and the buzz can be removed with a dedicated Debuzz algorithm. The results of this were demonstrated with a 999 call recording, originally almost completely inaudible, but which when processed revealed much more information and the presence of a second, previously-unheard speaker in the background — of crucial importance to the court case in which the recording was presented as evidence!

Gordon next discussed the use of adaptive filters. If the statistics of the noise are relatively constant, it’s possible to design a filter to separate speech (which tends to change rapidly) from the noise. Additional improvements can sometimes be achieved by treating low and mid frequencies differently to high frequencies, based on perceptual models of hearing and intelligibility.

Some of the interesting applications of adaptive filters include cleaning-up reverberant spaces such as holding cells and transfer vans, and removal of the 400Hz buzz from aircraft power systems that can degrade air traffic control recordings. And, in a reversal of the normal filtering, it was described how CEDAR removed the shouting from a cockpit voice recording in a helicopter that had just suffered a catastrophic mechanical failure, so the investigators could listen to the mechanical sounds to trace the cause of the accident.

Cross-channel adaptive filters can overcome steps taken to defeat surveillance, such as using loud radio or TV to mask a conversation. This type of filter exploits the correlation between the direct broadcast signal (if available) and the tonally altered broadcast signal present in the surveillance, and can effectively remove it from the surveillance signal. If there isn’t a convenient reference of the broadcast, use of multiple microphone locations causes some to have more speech and others to have more interfering signal, giving the cross-channel adaptive filter enough to work with. A reconstructed demonstration was played in which, when using a single mic recording of some speech in the presence of loud music from a radio, a transcription expert obtained approximately 30% accuracy. Adding a second mic positioned closer to the radio than the first and using this as the reference channel for the cross-channel adaptive filter, the intelligibility was hugely improved, and the transcription accuracy increased to 100%.

The form of broadband noise reduction known as spectral subtraction is an impressive tool in music production and restoration, but in forensics its use can be more limited: although it improves listenability and reduces fatigue, the best that can be hoped for regarding intelligibility is that it doesn’t damage it. Nonetheless, it has significant other uses in audio forensics, such as removing the hiss that can be added by adaptive filters. EQ, despite its simplicity and ubiquity, has been a staple processor for forensics since long before the days of DSP and adaptive filters. Removal of low frequencies and the addition of a little boost in the upper mids can hugely increase intelligibility. Limiters are used to reduce the impact of sudden loud noises. By its nature, forensic audio can involve extreme dynamic ranges. When a surveillance officer or transcriptor is listening closely to very low-level signals at very high gain, loud sounds such as gunshots/vehicle crashes/etc. can, without limiting, damage the listener’s hearing. In other cases, such as a recording of a telephone conversation made using a hand-held recorder, balancing the levels of the local and remote speaker can help render the evidence more intelligible and therefore more useful in court.

Gordon mentioned the increasingly widespread suspicion that audio data mining is being deployed by security agencies: that is, mass interception of all voice communications with automatic recognition of certain key words (e.g. bomb, jihad, etc.). Gordon’s view is that this is not currently technically practical, but that its use may increase within a decade or two. What is currently feasible, and is being used to an ever greater degree, is automatic speaker recognition: commercial solutions are developing fast, but their robustness to voice signals that have been altered by enhancement processing is an ongoing research field. Another significant recent development is the prevalence of low bit-rate, highly-compressed perceptual codecs, which can make both enhancement and automatic speaker recognition more problematic.

Gordon concluded his lecture with a mention of spectrographic editing, which was invented by CEDAR. Time-domain editing can be recognised in a spectrograph, making this kind of evidence-tampering obvious. But spectrographic editing allows powerful manipulation of the signal, often invisible to future investigation. This tampering can be very dangerous in the wrong hands, but when used ethically can reduce or remove masking signals, making it a powerful enhancement tool.

Many thanks to Gordon for an eye-opening lecture, and his fascinating insights into the remarkable technology his company has created.

Report by Michael Page


Santa Baby, Come Creep a Codec Under the Tree for Me

Date: 7 Dec 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by Prof. Jamie Angus, Professor of Audio Technology, University Of Salford.

A recording of the lecture is available here (50MB mp3)

Lecture Report

By the time it reaches the listener, most recorded music has been processed using at least one lossy encoder. Love it or hate it, such a process is central to the convenience of personal music devices, digital broadcasts, and domestic video technology. Countless people awoke on Christmas morning to discover a new codec, in one guise or another, beneath the tree.

So how does bit-rate reduction work, and what can be done to improve both its efficiency and the quality of its output? A codec can be seen as a four-stage process, as illustrated below:

Codec flowchart

The black arrows symbolise audio data, while the green arrows represents side data: information that informs neighbouring processes, but is not directly related to the audio samples.

Apart from the psychoacoustic model, every stage reduces the bit rate of incoming data. The signal redundancy remover is tailored to audio, and can be a process such as a discrete cosine transform or a predictive filter that alters the statistical distribution of the data to make it easier to compress. The entropy coder exploits the non-uniformity of the input data to represent it in a more compact way. Psychoacoustic quantisation removes data which is perceptually masked, and therefore inaudible to the listener. This stage makes the decoded output data non-identical to the input data, but enables considerably higher compression ratios to be obtained by discarding a proportion of the input signal.

We know from demonstrations of existing systems that digital audio can be compressed satisfactorily to a ratio of between 2:1 and 3:1 using lossless methods. To explain how these work, it is convenient to start at the end of the chain, the entropy coder, and approach the problem from the point of view of information theory.

Entropy coding

Change and surprise are what makes information interesting, and audio is no exception. A sine wave is not interesting to listen to. Silence is interesting only when it interrupts what has come before, or changes what follows. A human voice is more interesting when the speaker modulates pitch and speed, and is conveying information that is engaging. Our input data, then, is a background of predictable information, punctuated at intervals by unpredictable elements, and it is this unpredictability that we and our codecs work hardest to convey.

All information is composed of an alphabet of symbols, and the use of these symbols is seldom uniform. The 65 536 sample levels that comprise 16-bit audio are such an alphabet, and their use is very non-uniform. This is due partly to the statistical nature of sound, but also to our desire for dynamic novelty in music. The data below comes from two commercial recordings. A sine wave does not present such a distribution, and neither does white noise.

Graph: distribution of samples

This graph shows the distribution of samples in two CDs [click to enlarge]. Blue: Kind of Blue by Miles Davis. Red: Come To Daddy EP by Aphex Twin. The former is a re-mastering of a 1950s jazz recording. The latter, released in 1997, is an archetype of high-ratio compression and distortion. The distribution of samples is nonetheless similar. Although the dynamic range of the programme material under investigation may change the offset or initial slope of this graph, it is clear that the frequency of occurrence falls by approximately half for every linear increase of 2000 sampling intervals. Sinusoids do not behave like this, but sinusoids are not musically interesting.

We need 16 bits to convey 16-bit samples, but if we look at the frequency of use, the most common symbols are used with a probability of 1/2-11 (about twenty per second at 44.1kHz), and the least-used symbols with a probability of about 1/2-24 (fewer than one every six minutes). If we use symbols of variable lengths, with short symbols for the most frequently-used sample values, and longer symbols for the least-used ones, the size of the data is considerably reduced. The number of bits of information we would require to encode an arbitrary sample in the data set is referred to as the self-information of the data. This value is around 12.3 bits for the Miles Davis example above, and 13.2 bits for Aphex Twin.

Huffman coding

The most commonly-used method to exploit self-information is Huffman coding. A binary tree is built from the bottom up using a recursive algorithm:

  1. Join the two symbols or structures with the lowest probability of occurring, so that ‘0’ symbolises the first and ‘1’ symbolises the second.
  2. Add their probabilities together: this is now the probability of that structure occurring.
  3. Repeat from stage 1, until all the symbols are connected.

David A. Huffman, incidentally, was a graduate student when he was assigned this problem as an exercise by his professor, Robert M. Fano. Fano and Claude Shannon had together spent some years developing the theory of creating binary trees, and could not find a method that was optimal in every case. They were building their binary trees from the top down, tackling the highest probabilities first. After many months of effort, Huffman had a sudden realisation that the solution was to build the trees the other way. Professor Fano’s response to this revelation: ‘Is that all there is to it!’

Huffman binary tree: audio samples

This is a Huffman binary tree of a five-bit rendition of Kind of Blue [click to enlarge]. Starting at the top, a binary zero is used for movement left down the tree, and one for a movement right. The binary code for zero is thus given by 1; for -2 it is 0100; 5, which is used about one hundredth as often, is 0101010001. Data that does not have the same binomial distribution as audio will generate a bushier tree; for the purposes of illustration, a Huffman binary tree based on letter frequencies in the complete works of Shakespeare is shown below.

Huffman binary tree: complete works of Shakespeare

The average number of bits we would need to represent our data using the Huffman code is 1.55, which is fairly close to its self-information coefficient of 1.36 bits, and considerably less than the 4 bits that would be needed using plain sample data.

By encoding our data this way, we can remove a substantial proportion of storage demand without changing a single sample. However, we cannot easily use Huffman encoding for large sample sizes, as we need to distribute the binary tree along with the audio. The memory requirements for storing a deep binary tree quickly become unreasonable, since the tree doubles in size for every bit added.

We can instead use the distribution of data to our advantage in a slightly different way: first, to restrict our alphabet to a number of symbols of increasing length that get us to the region of interest, and then to convey the rest of the data in raw binary. This approach is known as Golomb-Rice coding. The data that conveys the region of interest is generally conveyed using a thermometer code: 0 for the first region, 01 for the second, then 011, 0111, and so on. This is the same code as would be encountered by rotating some of the branches of the calculated tree above, but is much simpler to manipulate.

Redundancy removal

Further savings can be obtained by considering the nature of audio, and by moving somewhat towards the frequency domain. Audio is to some extent predictable: it is the response of a number of resonant systems to an excitation. The resonance and excitation components may be conveyed separately, and fairly compactly, by sending the parameters of a predictive filter together with an excitation signal. Since human speech is also produced by an excited resonant system, this approach, called adaptive or predictive encoding, forms the basis of many speech compression algorithms. The disadvantage of such systems is that they are not robust to errors: an undetected error between the encoder and decoder will upset the filter coefficients. This causes instability, and makes the audio data diverge from its proper values.

The greatest economies, those found in MPEG encoders, are when the frequency domain is considered. The discrete cosine transform (DCT) is used to shift to the frequency domain, and a number of tricks are then played to simplify the data in that domain. Spectral masking, where some content masks coincident, quieter content — particularly that at higher frequencies — allows many of the coefficients of the DCT to be ignored, or stored at a lower resolution than would otherwise be necessary. Similar economies are obtained using temporal masking, where a transient sound masks events that follow closely. Audio content assumed to be below the threshold of perception can also be removed. The coefficients of the simplified data are then compressed using Huffman encoding, which is rendered more effective by the greater simplicity of data distribution.

AAC includes an extra component to improve the transient response: quantisation error introduced by the bit-rate reduction process is fed back into the system via a noise-shaping filter to improve the result. This is temporal noise shaping, or TNS.

These processes form the basis of every audio bit-rate reduction system in use today. However, there is plenty of room for improvement. Assuming that Santa’s elves are handy at numerical methods and DSP algorithms, what codec should we wish for next Christmas? Even greater economies and better perceived quality would be obtained by informing our codec using the most recent development in audio engineering, exploiting more sophisticated psychoacoustic models, auditory scene decomposition, dereverberation, and pitch tracking.

Report by Ben Supper


Lord’s Cricket Ground: Voice Alarm

Date: 14 Oct 2010
Time: 18:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by Roland Hemming, RH Consulting.

A recording of the lecture is available here (13MB mp3)

Lecture Report

With an portfolio encompassing the Millennium Dome, Ascot Racecourse, Twickenham, and St. Pancras International station, Roland Hemming most recently turned his sound system design and project management expertise to Lord’s Cricket Ground, which is currently undergoing a multi-year renovation project.

Three interrelated aspects of the new system were covered in Roland’s talk. The first of these concerned the correct approach to standards compliance. The second covered the more general technical challenges involved in fitting out a large and complex sports venue, where the public address system is used routinely to entertain as well as inform. Thus the system must at once be versatile enough to cope with any conceivable situation, simple enough for a novice to use, and robust enough to withstand partial failures and still carry on working in an emergency. The third aspect is the diplomatic side of the work, and the importance of communication and commercial skills in managing a large installation project.

Voice alarm systems for public address are covered by a number of standards. These specify such things as the speech intelligibility of the system and the need for distributed redundancy of circuits and amplification to avoid any single point of failure. They also stipulate requirements for fire resistance, remote fault monitoring, and operability in the event of a power failure. Many of these needs are specified fairly loosely, providing scope for interpretation. Consequently, much of the skill in working with voice alarm systems is in knowing how much redundancy and fireproofing to build into the system, and where to put it. The latest of these standards is EN54, which will be enforced from 2011, and introduces product testing to the mix. Further to this, there are other standards that cover specific installations, including BS7827, which concerns sound systems for sports venues.

The best practice for voice alarm installation often diverges from that found in professional audio engineering, as the emphasis is on failsafe design that entrusts dynamic control of loudspeaker amplification and routing to pre-programmed paths that are self-regulating, without the need for human intervention. Redundancy is harder to achieve over data networks: unlike audio, Ethernet must be singly-connected and wired point-to-point, and cannot normally be run in loops, but this can be done in stadia using spanning tree technology.

Certain special cases are exempt from EN54, including self-powered loudspeakers and loudspeakers for ‘special’ applications (those, for example, with particular directivity characteristics). Also exempt are ‘kit systems’, made from individual elements of non-Voice Alarm equipment that together comprise the system: the discretion is then left to the project manager to justify the safety of the resulting system, and the local safety authority to approve it.

The Lord’s system comprises eight distributed digital rack rooms. This distribution not only assists redundancy, but also keeps down the length of cable runs. The system is designed to be truly expandable, which can mean anything from moving the walls around in a hospitality box to razing and rebuilding an entire stand. Four control stations provide the opportunity for live announcements, controlled by a touch-screen user interface that allows these to be directed appropriately. Although the venue will eventually divide into 165 sound zones, this is greatly simplified for normal operation so that the system can be employed by an announcer with just a few minutes of training. Hybrid transformers allow audio to be injected into individual areas to provide localised input where this is required. The system features Dante audio networking technology providing audio over IP, ASL Vipedia audio processors, Lab.gruppen amplifiers, and DAS loudspeakers. A local analogue loop provides emergency backup in each rack room, and the use of many small speakers allows every spectator to be reached without disturbing the neighbours.

One of the most difficult elements of any installation, not least one concerning such a historic venue in such an exclusive neighbourhood, is the balancing of the many vested interests. The local council, residents, the various committees, the operators themselves, and the safety team must all be satisfied with the system’s specifications and performance. Discussions prior to installation take years, and can continue after the major part of the installation is complete. The installation itself, being by far the most expensive stage of the operation, is often over in a matter of weeks and must therefore be planned with precision. The potential for catastrophe makes these large projects an exercise in risk management. Without satisfaction, there can be no compliance; without compliance, there can be no venue; without a venue, there can be no business.

Report by Ben Supper

For those who wish to know more about voice alarm, Roland has co-written a book on the subject that is available from avitas-global.com.


Synchronising the Synchronisation Standards

Date: 16 Feb 2010
Time: 19:30

Location: Royal Academy of Engineering
3 Carlton House Terrace
London SW1Y 5DG

Lecture by John Emmett.

Download recording of lecture here (20MB MP3)

Lecture Report

Dr Emmett opened the lecture by summarising the audio-video synchronisation challenges encountered when putting together a television programme. It is better to correct synchronisation problems as they occur in the broadcasting chain than to attempt to correct them all immediately prior to transmission, as the former practice greatly simplifies video editing. With this achieved, attention turns to keeping audio synchronised during broadcast transmission and reception. This is particularly important for human speech: humans are exquisitely sensitive to lip sync. We develop this facility almost as soon as we can see, and the psychological need for lip movement to be attached to speech is so great that it extends even to characters without mouths. Each Dalek needs a light that pulses in sync with its speech, to bond the dialogue to that character.

A number of techniques were employed in the days of purely analogue transmission to ensure that audio and video were kept in sync. It was not unusual for a programme’s video signal to be relayed via satellite and its audio via telephone, and a compensating audio delay had to be inserted to offset uplink and downlink delays. An example of this was used in ITN in the early 1980s. An in-band masked ‘bong’ was timed to follow any video cut in the programme by exactly one second. It was possible then for engineers to adjust the audio delay manually to maintain sync, even where this varied during the programme. Similar timestamps must still be maintained in digital systems, although this facility is now generally accommodated within the channel code.

It is increasingly common for audio and video to be streamed by piggy-backing on a packet-based protocol and transmitting via existing IT infrastructure. This works as long as there is sufficient bandwidth. Otherwise, heavy-duty interleaving is required to compensate for dropped packets, which increases transmission delay, and the chances of sync loss and system failure. As with real piggy-backs, the heavier the payload, the slower the system, and the greater the likelihood of collapse.

Now consider what the word ‘standard’ means: this is where problems are compounded. The word has two distinct meanings. It can refer to an outgoing or obsolescent paradigm (such as ‘standard definition’), or to standard-bearing in its original sense — at the technological vanguard. We frequently encounter problems when it is necessary to choose between a plenitude of competing standards of different ages, some of which have yet to be adopted, and many of which should not. Standards are necessary only when the current best practice is unclear, but there are usually clues about which standards are ‘good’. A good standard must be fit for purpose, timely, and robustly defined: if the plug fits, the signal should work. There are caveats, too: not all standards are intended to be friendly (DRM systems are such an example), and even de facto standards undergo sudden and complete changes. Finally, although a standard needs to be owned by a company or committee to avoid obsolescence, it should contain no element for revenue generation.

The emergence of competing delivery standards in broadcasting has brought the synchronisation problem into the home. Many digital multichannel audio transport layers can be conveyed over S/PDIF channel code using IEC 61937 (Dolby Digital; DTS; linear PCM), and a home cinema amplifier may typically accommodate sixty connectors and a dozen multichannel formats. As for the picture, high-definition video formats such as 720p and 1080p co-exist with conventional 625-line 4:3 and 16:9 broadcasts. There are a number of video interconnection formats with different costs, advantages, and limitations. Any of four digital video broadcasting standards are in use in different regions throughout the world, encompassing several standard frame rates. Meanwhile, individual consumer products are designed for world markets, and are simultaneously compatible with many of these standards. In fact, UK broadcasters have been unable to rely on viewers possessing ‘standard’ receiving equipment since 625-line broadcasts began in the 1960s.

Now that it can take half a day for a professional engineer to set up a domestic television, it is quite likely that a set-top box in a typical home, set up without specialist expertise, may be configured to down-convert 720p video to standard definition, and transmit this signal over RGB SCART to a plasma television, which will then up-convert it to 1080p. Audio-video synchronisation is then at the mercy of equipment manufacturers.

Dr Emmett summarised his lecture with advice from Antoine de Saint-Exupéry: ‘No design is finished until the last superfluous item has been removed.’

Report by Ben Supper