MelSpectrogram
classkeras.layers.MelSpectrogram(
fft_length=2048,
sequence_stride=512,
sequence_length=None,
window="hann",
sampling_rate=16000,
num_mel_bins=128,
min_freq=20.0,
max_freq=None,
power_to_db=True,
top_db=80.0,
mag_exp=2.0,
min_power=1e-10,
ref_power=1.0,
**kwargs
)
A preprocessing layer to convert raw audio signals to Mel spectrograms.
This layer takes float32
/float64
single or batched audio signal as
inputs and computes the Mel spectrogram using Short-Time Fourier Transform
and Mel scaling. The input should be a 1D (unbatched) or 2D (batched) tensor
representing audio signals. The output will be a 2D or 3D tensor
representing Mel spectrograms.
A spectrogram is an image-like representation that shows the frequency spectrum of a signal over time. It uses x-axis to represent time, y-axis to represent frequency, and each pixel to represent intensity. Mel spectrograms are a special type of spectrogram that use the mel scale, which approximates how humans perceive sound. They are commonly used in speech and music processing tasks like speech recognition, speaker identification, and music genre classification.
References
Examples
Unbatched audio signal
>>> layer = keras.layers.MelSpectrogram(num_mel_bins=64,
... sampling_rate=8000,
... sequence_stride=256,
... fft_length=2048)
>>> layer(keras.random.uniform(shape=(16000,))).shape
(64, 63)
Batched audio signal
>>> layer = keras.layers.MelSpectrogram(num_mel_bins=80,
... sampling_rate=8000,
... sequence_stride=128,
... fft_length=2048)
>>> layer(keras.random.uniform(shape=(2, 16000))).shape
(2, 80, 125)
Input shape
1D (unbatched) or 2D (batched) tensor with shape:(..., samples)
.
Output shape
2D (unbatched) or 3D (batched) tensor with
shape:(..., num_mel_bins, time)
.
Arguments
window
to each audio frame. If None
, defaults to fft_length
."hann"
and "hamming"
. If window
is a tensor, it will be
used directly as the window and its length must be
sequence_length
. If window
is None
, no windowing is
used. Defaults to "hann"
.None
, defaults to sampling_rate / 2
.max(10 * log10(S)) - top_db
.10 * log10(S / ref_power)
.ref_power
.