What is a Spectrogram? It is a visual representation of a signal that gives us the frequency content present in a signal. It contains time in the x-axis and frequency in y-axis. Then, we can visualize not only what frequency is present at a specific time, it also tells us the amplitude of a frequency at that time. This is shown by the color of the lines in the spectrogram. In other words, it tells us how the frequency content of the signal changes with time.
Spectrograms can be generated by either dividing the signal in time domain and finding the Fourier Transform of each segment.. and then stacking these fourier transforms in time. This is called windowing. Or by using a stack of band pass filters.
Discrete Fourier Transform (DFT):
Fast Fourier Transform (FFT): The fast Fourier (FFT) is an optimized implementation of a DFT. It is used to convert the signal in time domain to a frequency domain, a process called Fourier Transform
What is windowing? When the number of periods in the signal is not an integer, the endpoints are discontinuous. These artificial discontinuities show up in the FFT as high-frequency components not present in the original signal. These frequencies can be much higher than the Nyquist frequency and are aliased between 0 and half of your sampling rate. The spectrum you get by using a FFT, therefore, is not the actual spectrum of the original signal, but a smeared version. It appears as if energy at one frequency leaks into other frequencies. This phenomenon is known as spectral leakage, which causes the fine spectral lines to spread into wider signals. This can be minimized by the technique called windowing. Windowing consists of multiplying the time record by a finite length window with an amplitude that varies smoothly and gradually toward zero at the edges. This makes the endpoints of the waveform meet and, therefore, results in a continuous waveform without sharp transitions. This technique is also referred to as applying a window.
There are many types of window functions. In general, the Hanning (Hann) window is satisfactory in 95 percent of cases.
Short Term Fourier Transform (STFT): Typically, a spectrogram is calculated by computing the FFT over a series of overlapping windows extracted from the original signal. The process of dividing the signal in short term sequences of fixed size and applying FFT on those independently in called STFT. The spectrogram is then calculated as the complex magnitude of the STFT. …. Extracting the short term windows of the original image affects the calculated spectrum by producing aliasing artifacts. This is called spectral leakage. To control the leakage, we use different types of window functions when extracting the windows.
Fourier Transform: converting a signal from time domain to frequency domain
Spectrogram: frequency vs time graph
Short Term Fourier Transform (STFT)
Audio feature extraction
Spectral Centroid: The spectral centroid indicates at which frequency the energy of a spectrum is centered upon. This is like a weighted mean. In librosa, spectral centroid can be found as : librosa.feature.spectral_centroid(x, sr=sr)[0], where x is the audio signal and sr is the sampling frequency.
Spectral Rolloff: It is a measure of the shape of the signal. It represents the frequency at which high frequencies decline to 0. librosa.feature.spectral_rolloff computes the rolloff frequency for each frame in a signal.
Zero Crossing Rate: librosa.zero_crossing(x[n0:n1], pad=False)
Mel frequency Cepstral Coefficients (MFCC): The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice. librosa.feature.mfcc(x, sr=fs)
Chroma feature: A chroma feature or vector is typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, …, B}, is present in the signal. In short, It provides a robust way to describe a similarity measure between music pieces. librosa.feature.chroma_stftis used for the computation of Chroma features
import librosa
import sklearn
# load the audio file
x = librosa.load(.../file.wav, sr=44000)
# Find the STFT of x and display it
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
# convert to a log scale if needed
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')
plt.colorbar()
# Find spectral centroid at each frame of the audio signal
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
(775,)
# Computing the time variable for visualization
plt.figure(figsize=(12, 4))frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
# Normalising the spectral centroid for visualisation
def normalize(x, axis=0):
return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='b')
# Finding the spectral roll off
spectral_rolloff = librosa.feature.spectral_rolloff(x+0.01, sr=sr)[0]
plt.figure(figsize=(12, 4))librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')
# Find the spectral bandwidth
spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=3)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=4)[0]
plt.figure(figsize=(15, 9))librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))
# Zero Crossing Rate
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()
zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(sum(zero_crossings))#16
# MFCC
mfccs = librosa.feature.mfcc(x, sr=fs)
print(mfccs.shape)
(20, 97)
#Displaying the MFCCs:
plt.figure(figsize=(15, 7))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
# Chroma feature
chromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')
References
Audio Analysis using Python – extracts features from the audios, creates a CSV file and constructs a simple neural network with dense layers.