May learnings – Audio classification basics

Week of 5/24/21 – 5/30/21

  • Audio Classification
  • Sample
  • Sample Rate: Number of samples per second. It is typically 44100 samples per second for the audio signal
  • Spectrum: It represents the set of frequencies that are combined together to produce a signal.
  • Spectrogram: These are produced using the Fourier Transform of the signal
    • Mel Spectrograms
    • Mel Frequency Cepstral Coefficients (MFCC)
  • How to convert from .mp3 to .wav and How to download audios from youtube videos
From Reference 1 below
  • Python libraries for processing audio: Librosa, OpenL2
  • Representation Learning[3]: How to represent input data, usually done through various transformations applied to the input data.
    • Automatic Speech Recognition (ASR)
    • Speaker Recognition (SR)
    • Speaker Emotion Recognition (SER)
  • Feature Extraction techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA), Multi Dimensional, Scaling (MDS), and Independent Component Analysis (ICA), NNMF,
  • Gaussian Mixture Model (GMM)
  • Hidden Markov Model (HMM)
  • Maximum Likelihood Estimation (MLE)
  • Deep Belief Networks (DBN)
  • Metrics for evaluation: Word Error Rate (WER) for speech recognition
  • Training loss:
  • s-plane [4]
  • Fourier transform
  • Contrastive loss

References

  1. Audio Deep Learning Made Simple (Part 1): State-of-the-Art Techniques
  2. https://towardsdatascience.com/audio-classification-with-pre-trained-vgg-19-keras-bca55c2a0efe
  3. Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends (paper)
  4. https://dsp.stackexchange.com/questions/40491/on-the-meaning-of-s-plane-and-its-link-to-a-transfer-function (s-plane)
  5. http://werner.yellowcouch.org/Papers/zvss/index.html (s-plane vs z-plane)
  6. Relation of Z-transform with Fourier and Laplace transforms – DSP

Week of 5/17/21 – 5/23/21

References

  1. Resources to prep for the GCP Professional ML Certification Exam.
  2. FNet: Mixing Tokens with Fourier Transforms (paper)
  3. FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained) (Video)
  4. FNet: Mixing Tokens with Fourier Transforms (Unofficial implementation)
  5. FNet: Mixing Tokens with Fourier Transforms – Paper Explained (Video)
  6. https://www.jezzamon.com/fourier/index.html – excellent explanation of the Fourier Series
  7. An Interactive Guide To The Fourier Transform

Week of 5/10/21 – 5/16/21

None

Week of 5/3/21 – 5/9/21

  • Audio classification:
  • Libraries:
    • librosa – python library
  • Datasets:
    • FSDKaggle2018: FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
      • 41 categories are: “Acoustic_guitar”, “Applause”, “Bark”, “Bass_drum”, “Burping_or_eructation”, “Bus”, “Cello”, “Chime”, “Clarinet”, “Computer_keyboard”, “Cough”, “Cowbell”, “Double_bass”, “Drawer_open_or_close”, “Electric_piano”, “Fart”, “Finger_snapping”, “Fireworks”, “Flute”, “Glockenspiel”, “Gong”, “Gunshot_or_gunfire”, “Harmonica”, “Hi-hat”, “Keys_jangling”, “Knock”, “Laughter”, “Meow”, “Microwave_oven”, “Oboe”, “Saxophone”, “Scissors”, “Shatter”, “Snare_drum”, “Squeak”, “Tambourine”, “Tearing”, “Telephone”, “Trumpet”, “Violin_or_fiddle”, “Writing”.
    •  Urbansound8K: This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.
      • All excerpts are taken from field recordings uploaded to www.freesound.org. The files are pre-sorted into ten folds (folders named fold1-fold10) to help in the reproduction of and comparison with the automatic classification results reported in the article above.
  • Transforming audios:
  • https://www.skytopia.com/software/sonicphoto/: Software to convert picture to audio

References

  1. Audio Classification : A Convolutional Neural Network Approach:
    1. https://github.com/CVxTz/audio_classification

Leave a Reply

Your email address will not be published. Required fields are marked *