Deep filtering has helped improve speech enhancements and signal extraction by using complex-valued processing rather than just real numbers. Models usually work by applying a time-frequency mask to a noisy spectrogram. Complex masks are better because they can modify the phase of the signal.

Recent research has used complex filters instead of masks. Filters can use information from before and after each time step, exploiting correlations within frequencies.

The researchers propose DeepFilterNet, a two-stage framework. The first stage enhances the spectral envelope using human-like frequency perception. The second stage uses deep filtering to enhance periodic components in speech.

The network is made sparse using separable convolutions and grouping layers. This helps keep the model complexity low.

The two-stage filtering approach outperforms complex masks across different frequency resolutions and latencies. It also performs competitively with other state-of-the-art models.

The researchers propose a two-stage deep filtering network that is efficient yet effective at speech enhancement. The approach combines perceptual properties of human hearing with complex-valued filtering to improve on existing complex mask-based methods.

DeepFilterNet2 Architecture, Image Credits: Researchgate

DeepFilterNet compares favourably to other speech enhancement models in the following ways:

• Performance - The researchers show that DeepFilterNet outperforms complex mask-based methods across different frequency resolutions and time latencies. It also performs competitively with other state-of-the-art speech enhancement models in terms of objective metrics like PESQ and STOI.

• Approach - DeepFilterNet uses a two-stage approach of firstly enhancing the spectral envelope followed by deep filtering of periodic components. This divides the enhancement process into simpler tasks that the model can learn effectively.

• Frequency perception- By modelling human-like frequency perception through ERB-scaled gains in the first stage, DeepFilterNet is able to more accurately enhance important spectral regions for speech intelligibility.

• Filtering vs masking - Complex filtering allows DeepFilterNet to exploit correlations within frequency bands by incorporating information from before and after each time step. This provides an advantage over complex mask-based methods.

•  Model Complexity- The use of separable convolutions and grouping layers helps keep DeepFilterNet's complexity low. This makes it more practical for real-world applications compared to larger speech enhancement models.

Overview of DeepFilterNet2 Algorithm, Stage 1, Image Credits: Researchgate

Potential applications of DeepFilterNet in speech enhancement include:

• Noise Reduction - The most direct application is reducing noise and background interference in speech intelligibility and quality. This could be used for enhancing noisy phone calls, recordings of lectures or meetings, audiobooks, etc.

• Speaker separation - By separating speech from different speakers in a mixture, DeepFilterNet's approach could potentially help with speaker separation tasks. This could enable applications like digitization and transcription of multi-speaker recordings.

• Hearing aids - By effectively enhancing speech signals, DeepFilterNet or similar models could potentially be integrated into digital hearing aids to improve speech intelligibility for users with hearing loss. This would depend on further optimization for low-latency on-device processing.

• Cochlear Implants- For users with severe hearing loss, DeepFilterNet's complex spectral filtering approach may also benefit the processing of speech signals for cochlear implants. This could help improve speech recognition abilities for implant users.

• Voice commands - For applications involving voice commands in noisy environments, DeepFilterNet's noise suppression capabilities could enhance the accuracy of voice recognition systems. This could benefit smart assistants in-car systems, etc.

• Teleconfering - DeepFilterNet's speech enhancement could potentially improve the audio quality and intelligibility of teleconference calls with background noise. This would require integration into teleconferencing software.

Read More:

DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering
Complex-valued processing has brought deep learning-based speech enhancementand signal extraction to a new level. Typically, the process is based on atime-frequency (TF) mask which is applied to a noisy spectrogram, while complexmasks (CM) are usually preferred over real-valued masks due to their…

We research, curate and publish daily updates from the field of AI. Paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events and open-source tools.
Consider becoming a paying subscriber to get the latest!