SoundAnalysisPreprocessing

A model which takes audio signal samples as input and outputs an array of preprocessed samples according to the specified preprocessing types

message SoundAnalysisPreprocessing {

    // Specific preprocessing types for sound analysis

       be fed to Vggish feature extractor.
       c.f. https://arxiv.org/pdf/1609.09430.pdf

       The preprocessing takes input a single channel (monophonic) audio samples
       975 milliseconds long, sampled at 16KHz, i.e., 15600 samples 1D multiarray
       and produces preprocessed samples in multiarray of shape [1, 96, 64]

     (1) Splits the input audio samples into overlapping frames, where each
         frame is 25 milliseconds long and hops forward by 10 milliseconds.
         Any partial frames at the end are dropped.

     (2) Hann window: apply a periodic Hann with a window_length of
         25 milliseconds, which translates to 400 samples in 16KHz sampling rate

         w(n) = 0.5 - 0.5 * cos(2*pi*n/window_length_sample),
         where 0 <= n <= window_lenth_samples - 1 and window_lenth_samples = 400

         Then, the Hann window is applied to each frame as below

         windowed_frame(n) = frame(n) * w(n)
         where 0 <= n <= window_lenth_samples - 1 and window_lenth_samples = 400

     (3) Power spectrum: calculate short-time Fourier transfor magnitude, with
         an FFT length of 512

     (4) Log Mel filter bank: calculates a log magnitude mel-frequency
         spectrogram minimum frequency of 125Hz and maximum frequency of 7500Hz,
         number of mel bins is 64, log_offset is 0.01, number of spectrum bins
         is 64.

    message Vggish {
        // no specific parameter
    }

    // Vision feature print type
    oneof SoundAnalysisPreprocessingType {
        Vggish vggish = 20;
    }

}

SoundAnalysisPreprocessing.Vggish

message Vggish {
    // no specific parameter
}