Speech signal processing technology of microphone array

- Jan 09, 2021-

The significance of array microphone to artificial intelligence

Spatial selectivity: the effective location of sound source can be obtained through spatial positioning technology such as electric scanning array. Intelligent devices can obtain accurate sound source location information to make our voice more intelligent and obtain high-quality voice signal quality through algorithm.

Microphone array can automatically detect the location of the sound source, track the speaker, and obtain the advantages of multiple sound sources and tracking mobile sound sources. No matter where you go, the intelligent device will enhance your position and direction.

The array microphone adds spatial processing, which makes up for the shortcomings of single signal in noise suppression, echo suppression, reverberation suppression, sound source localization and speech separation, so that our intelligent devices can obtain high-quality speech signals in complex environments and provide better intelligent speech experience.

The technical difficulties of microphone array technology are as follows

The traditional array signal processing technology is often not ideal when it is directly applied to microphone array processing system. The reason is that microphone array processing has different processing characteristics

Establishment of array model

Microphone is mainly used to process speech signal, and its pickup range is limited, and it is mostly used in near-field model, which makes conventional array processing methods such as radar, sonar and other plane wave far-field model no longer applicable. In near-field model, more accurate spherical wave is needed, and different amplitude attenuation caused by different propagation path needs to be considered.

Wideband signal processing

Generally, the array signal processing is narrow-band, that is, the receiving delay and phase difference of different array elements are mainly reflected in the carrier frequency, while the voice signal is not modulated and has no carrier, and the ratio of high and low frequency is large. The phase delay of different array elements is closely related to the characteristics of the sound source itself - frequency, which makes the traditional array signal processing methods no longer fully applicable.

Nonstationary signal processing

In the traditional array processing, most of the signals are stationary signals, while the processing signals of microphone array are mostly non-stationary signals or short-term stationary signals. Therefore, microphone array generally processes the signals in short-term frequency domain, and each frequency domain corresponds to a phase difference. The wideband signal is divided into several subbands in frequency domain, and each subband is processed in narrow band, and then combined into a wideband spectrum.


Sound propagation is greatly affected by space. Due to space reflection and diffraction, the signal received by microphone is not only direct signal, but also multi-path signal superposition, which makes the signal interfered, that is reverberation. In the indoor environment, by the room boundary or obstacle diffraction, reflection leads to the continuation of sound, which greatly affects the intelligibility of speech.

Sound source location

Sound source localization technology is widely used in the field of artificial intelligence. The microphone array is used to form a spatial Cartesian coordinate system. According to different linear array, plane array and spatial array, the sound source position in space is determined. First of all, the intelligent device can further enhance the voice of the sound source. When the intelligent device obtains your location information, it can combine with other sensors for further intelligent experience. For example, the robot will hear your call and come to you, the video device will focus and lock the speaker, and so on. Before we understand the sound source localization technology, we need to understand the near-field model and far-field model.

Near field model and far field model

Generally, the distance of microphone array is 1 ~ 3M, and the array is in the near-field model. The microphone array receives spherical wave instead of plane wave. The sound wave will attenuate in the process of propagation, and the attenuation factor is proportional to the propagation distance, so the amplitude of sound wave from the sound source to the array element is also different. In the far-field model, the distance difference between the sound source and the array element is relatively small and can be ignored. Generally, we define 2L 2 / λ as the critical value of far-field and near-field, l as the array aperture, and λ as the acoustic wave wavelength, so the received signal of the array element has not only phase delay but also amplitude attenuation.

Sound source location technology

The methods of sound source localization include beamforming, super-resolution spectrum estimation and TDOA. The relationship between the sound source and the array is transformed into spatial beam, spatial spectrum and time difference of arrival respectively, and the corresponding information is used for localization.

Scanning array

The beam formed by the array is scanned in space, and the direction is determined according to the suppression of different angles. By controlling the weighting coefficient of each element to control the output direction of the array, scanning. When the system scans the maximum output signal power, the corresponding beam direction is the DOA direction of the sound source, so the sound source can be located. There are some limitations in the way of scanning array, which is only suitable for a single sound source. If multiple sound sources are in the same main beam of array pattern, they cannot be distinguished. And this positioning accuracy is related to the array width - at a given frequency, the beam width is inversely proportional to the array aperture, so the microphone array with large aperture is difficult to achieve on hardware in many occasions.

Super resolution spectrum estimation

For example, music, Esprit and so on, the covariance matrix (correlation matrix) is eigendecomposed to construct the spatial spectrum. It is suitable for the case of multiple sound sources, and the resolution of the sound source is independent of the array size, which breaks through the physical limitations, so it becomes a super-resolution spectrum scheme. This kind of method can be extended to wideband processing, but it is very sensitive to errors, such as microphone unit error and channel error. It is suitable for far-field model and has a huge amount of matrix computation.


TDOA is to estimate the time delay difference of the sound source arriving at different microphones, calculate the distance difference through the time delay, and then use the distance difference and the spatial geometric position of the microphone array to determine the position of the sound source. It is divided into two steps: TDOA estimation and TDOA location

1. TDOA estimation

Generalized cross correlation (GCC), generalized cross correlation (GCC) and LMS adaptive filtering are commonly used

Generalized cross correlation

Generalized cross correlation

GCC is mainly used for time delay estimation in TDOA based source localization. GCC has the advantages of simple calculation method, small delay and good tracking ability, which is suitable for real-time applications. It has good performance in medium noisy intensity and low reverberation noise, and the positioning accuracy will decline in noisy and unstable noise environment.

LMS adaptive filtering

In the state of convergence, the estimation of TDOA does not need the prior information of noise and signal, but it is sensitive to reverberation. In this method, two microphone signals are taken as the target signal and input signal, and the input signal is used to approximate the target signal. The TDOA is obtained by adjusting the filter coefficients.

2. TDOA positioning

TDOA estimation is used to locate the sound source. Three microphone arrays can determine the location of the spatial sound source. Increasing the microphone will increase the data accuracy. The location methods include MLE, MLE, minimum variance, spherical difference and linear intersection. TDOA is relatively widely used, with high positioning accuracy, minimum amount of calculation and good real-time performance. It can be used for real-time tracking. At present, most of the intelligent positioning products use TDOA technology as positioning technology.


Beamforming can be divided into conventional beamforming CBF, conventional beam forming and adaptive beam forming ABF, adaptive beam forming. CBF is the simplest non adaptive beamforming, which obtains the beam by weighted summation of the output of each microphone. In CBF, the weight of each channel is fixed, and its function is to suppress the sidelobe level of the array pattern, so as to filter the interference and noise in the sidelobe area. On the basis of CBF, ABF performs spatial adaptive filtering for interference and noise. In ABF, different filters are used to get different algorithms, that is, the amplitude weighting values of different channels are adjusted and optimized according to some optimal criteria. Such as LMS, LS, maximum SNR, LCMV (linearly constrained minimum variance). The LCMV criterion is used to get the MVDR beamformer (minimum variance distortionless response). The criterion of LCMV is to minimize the output power of the array while keeping the main lobe gain of the pattern unchanged, which means that the interference plus noise power of the array output is the minimum. It can also be understood as the maximum SINR criterion, so as to receive the signal and suppress the noise and interference as much as possible.

CBF traditional beamforming

The delay sum beamforming method is used for speech enhancement. It delays the received signal of the microphone, compensates the time difference between the sound source and each microphone, makes each output signal in a certain direction in phase, makes the incident signal in that direction get the maximum gain, and makes the main beam have the direction of maximum output power. Spatial filtering is formed to make the array directional selective.

CBF + adaptive filter enhanced beamforming

Combined with Weiner filter to improve the effect of speech enhancement, the noisy speech is filtered by Weiner filter to get pure speech signal based on LMS criterion. Compared with the traditional CBF, the filter coefficients can be updated and iterated continuously, which can remove the unsteady noise more effectively.

ABF adaptive beamforming

GSLC is an active noise cancellation method based on ANC. The noisy signal passes through the main channel and the auxiliary channel at the same time, and the blocking matrix of the auxiliary channel filters out the speech signal to get the reference signal only containing multi-channel noise. Each channel gets an optimal signal estimation according to the noise signal to get the pure speech signal estimation.