December learnings (Audio analysis)

Week of 12/26 – 12/31

  • Input Function Keras
  • Normalization layer

References

  1. Keras
    1. Input object – tf.keras.Input(shape=( ))
    2. Preprocessing – Normalization
    3. Working with Preprocessing layers
    4. The Model Class
  2. What are symbolic and imperative APIs in Tensorflow 2.0?
  3. Understanding Sequential Vs Functional API in Keras
  4. Masking and Padding with Keras

Week of 12/6 – 12/12

  • What is a Spectrogram? It is a visual representation of a signal that gives us the frequency content present in a signal. It contains time in the x-axis and frequency in y-axis. Then, we can visualize not only what frequency is present at a specific time, it also tells us the amplitude of a frequency at that time. This is shown by the color of the lines in the spectrogram. In other words, it tells us how the frequency content of the signal changes with time.
    • Spectrograms can be generated by either dividing the signal in time domain and finding the Fourier Transform of each segment.. and then stacking these fourier transforms in time. This is called windowing. Or by using a stack of band pass filters.
  • Discrete Fourier Transform (DFT):
  • Fast Fourier Transform (FFT): The fast Fourier (FFT) is an optimized implementation of a DFT. It is used to convert the signal in time domain to a frequency domain, a process called Fourier Transform
  • What is windowing? When the number of periods in the signal is not an integer, the endpoints are discontinuous. These artificial discontinuities show up in the FFT as high-frequency components not present in the original signal. These frequencies can be much higher than the Nyquist frequency and are aliased between 0 and half of your sampling rate. The spectrum you get by using a FFT, therefore, is not the actual spectrum of the original signal, but a smeared version. It appears as if energy at one frequency leaks into other frequencies. This phenomenon is known as spectral leakage, which causes the fine spectral lines to spread into wider signals. This can be minimized by the technique called windowing. Windowing consists of multiplying the time record by a finite length window with an amplitude that varies smoothly and gradually toward zero at the edges. This makes the endpoints of the waveform meet and, therefore, results in a continuous waveform without sharp transitions. This technique is also referred to as applying a window.
    • There are many types of window functions. In general, the Hanning (Hann) window is satisfactory in 95 percent of cases.
  • Short Term Fourier Transform (STFT): Typically, a spectrogram is calculated by computing the FFT over a series of overlapping windows extracted from the original signal. The process of dividing the signal in short term sequences of fixed size and applying FFT on those independently in called STFT. The spectrogram is then calculated as the complex magnitude of the STFT. …. Extracting the short term windows of the original image affects the calculated spectrum by producing aliasing artifacts. This is called spectral leakage. To control the leakage, we use different types of window functions when extracting the windows.
Recommendation for choosing a window per Reference 8 below. Always, experiment with different window types to see what works best for your problem. There is no universal answer.

References

  1. Seeing sound: What is a spectrogram?
  2. Spectrogram explained and more – BEST Explanation among all
  3. Spectrogram – an Introduction (video)
  4. Analyzing the Spectrogram (video)
  5. Spectrogram explained (video)
  6. What is a spectrogram?
  7. Understanding Windowing and FFT – GOOD SOURCE to refer to
  8. Audio Spectrogram – NVIDIA
  9. Theory behind MelSpectrogram – Great Youtube channel
  10. Extracting Mel Spectrogram in Python – Youtube video
  11. https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53 —— A good article on MelSpectrograms
  12. Mel-Spectrogram and MFCCs | Lecture 72 (Part 1) | Applied Deep Learning
  13. LSTMs
    1. Stateful LSTMs with Keras
      1. Code
    2. Bayesian Optimization Example
    3. Stateful LSTM model training in Keras
    4. Counting number of parameters in a LSTM cell
    5. Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras – Jason Brownlee
    6. https://github.com/keras-team/keras/issues/6168 – Helpful discussion on stateful/stateless LSTM

Week of 11/29 – 12/5

  • Audio terminology and its analysis in Python
    • Amplitude, Phase, Frequency, Wavelength
    • Fourier Transform: converting a signal from time domain to frequency domain
    • Spectrogram: frequency vs time graph
    • Short Term Fourier Transform (STFT)
    • Audio feature extraction
      • Spectral Centroid: The spectral centroid indicates at which frequency the energy of a spectrum is centered upon. This is like a weighted mean. In librosa, spectral centroid can be found as : librosa.feature.spectral_centroid(x, sr=sr)[0], where x is the audio signal and sr is the sampling frequency.
      • Spectral Rolloff: It is a measure of the shape of the signal. It represents the frequency at which high frequencies decline to 0. librosa.feature.spectral_rolloff computes the rolloff frequency for each frame in a signal.
      • Spectral Bandwidth: librosa.feature.spectral_bandwidth computes the order-p spectral bandwidth
      • Zero Crossing Rate: librosa.zero_crossing(x[n0:n1], pad=False)
      • Mel frequency Cepstral Coefficients (MFCC): The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice. librosa.feature.mfcc(x, sr=fs)
      • Chroma feature: A chroma feature or vector is typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, …, B}, is present in the signal. In short, It provides a robust way to describe a similarity measure between music pieces. librosa.feature.chroma_stft is used for the computation of Chroma features
import librosa
import sklearn

# load the audio file
x = librosa.load(.../file.wav, sr=44000) 

# Find the STFT of x and display it
X = librosa.stft(x) 
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

# convert to a log scale if needed
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')  
plt.colorbar()

# Find spectral centroid at each frame of the audio signal
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
(775,)
# Computing the time variable for visualization
plt.figure(figsize=(12, 4))frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
# Normalising the spectral centroid for visualisation
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='b')

# Finding the spectral roll off
spectral_rolloff = librosa.feature.spectral_rolloff(x+0.01, sr=sr)[0]
plt.figure(figsize=(12, 4))librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

# Find the spectral bandwidth
spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=3)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=4)[0]
plt.figure(figsize=(15, 9))librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))

# Zero Crossing Rate
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(sum(zero_crossings))#16

# MFCC
mfccs = librosa.feature.mfcc(x, sr=fs)
print(mfccs.shape)
(20, 97)
#Displaying  the MFCCs:
plt.figure(figsize=(15, 7))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

# Chroma feature
chromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')

References

  1. Audio Analysis using Python – extracts features from the audios, creates a CSV file and constructs a simple neural network with dense layers.
  2. CNNs for audio classification
  3. Bidirectional LSTM for audio labeling with Keras
  4. An Overview of Automatic Audio Segmentation – paper
  5. Audio Augmentation for Speech Recognition – paper
  6. Exploring data augmentation for improved singing voice detection with neural networks – paper
  7. mixup: Beyond Empirical Risk Minimization – paper