May learnings – Audio classification basics – My Research Work and Life Adventures

Week of 5/24/21 – 5/30/21

Audio Classification
Sample
Sample Rate: Number of samples per second. It is typically 44100 samples per second for the audio signal
Spectrum: It represents the set of frequencies that are combined together to produce a signal.
Spectrogram: These are produced using the Fourier Transform of the signal
- Mel Spectrograms
- Mel Frequency Cepstral Coefficients (MFCC)
How to convert from .mp3 to .wav and How to download audios from youtube videos

Python libraries for processing audio: Librosa, OpenL2
Representation Learning[3]: How to represent input data, usually done through various transformations applied to the input data.
- Automatic Speech Recognition (ASR)
- Speaker Recognition (SR)
- Speaker Emotion Recognition (SER)
Feature Extraction techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA), Multi Dimensional, Scaling (MDS), and Independent Component Analysis (ICA), NNMF,
Gaussian Mixture Model (GMM)
Hidden Markov Model (HMM)
Maximum Likelihood Estimation (MLE)
Deep Belief Networks (DBN)
Metrics for evaluation: Word Error Rate (WER) for speech recognition
Training loss:
s-plane [4]
Fourier transform
Contrastive loss

References

Week of 5/17/21 – 5/23/21

Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7 Times Faster on GPUs

References

Resources to prep for the GCP Professional ML Certification Exam.
FNet: Mixing Tokens with Fourier Transforms (paper)
FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained) (Video)
FNet: Mixing Tokens with Fourier Transforms (Unofficial implementation)
FNet: Mixing Tokens with Fourier Transforms – Paper Explained (Video)
https://www.jezzamon.com/fourier/index.html – excellent explanation of the Fourier Series
An Interactive Guide To The Fourier Transform

Week of 5/10/21 – 5/16/21

None

Week of 5/3/21 – 5/9/21

Audio classification:
- Raw audio wave and 1D convolutions
- Log-Mel spectrogram
Libraries:
- librosa – python library
Datasets:
- FSDKaggle2018: FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
  - 41 categories are: “Acoustic_guitar”, “Applause”, “Bark”, “Bass_drum”, “Burping_or_eructation”, “Bus”, “Cello”, “Chime”, “Clarinet”, “Computer_keyboard”, “Cough”, “Cowbell”, “Double_bass”, “Drawer_open_or_close”, “Electric_piano”, “Fart”, “Finger_snapping”, “Fireworks”, “Flute”, “Glockenspiel”, “Gong”, “Gunshot_or_gunfire”, “Harmonica”, “Hi-hat”, “Keys_jangling”, “Knock”, “Laughter”, “Meow”, “Microwave_oven”, “Oboe”, “Saxophone”, “Scissors”, “Shatter”, “Snare_drum”, “Squeak”, “Tambourine”, “Tearing”, “Telephone”, “Trumpet”, “Violin_or_fiddle”, “Writing”.
- Urbansound8K: This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.
  - All excerpts are taken from field recordings uploaded to www.freesound.org. The files are pre-sorted into ten folds (folders named fold1-fold10) to help in the reproduction of and comparison with the automatic classification results reported in the article above.
Transforming audios:
- https://www.kaggle.com/tanulsingh077/audio-albumentations-transform-your-audio
https://www.skytopia.com/software/sonicphoto/: Software to convert picture to audio

References

Audio Classification : A Convolutional Neural Network Approach:
1. https://github.com/CVxTz/audio_classification

Week of 5/24/21 – 5/30/21

References

Week of 5/17/21 – 5/23/21

References

Week of 5/10/21 – 5/16/21

Week of 5/3/21 – 5/9/21

References

Share this:

Leave a Reply Cancel reply