My architecture contains a CNN as the feature extractor and dense layers that map into my final classes to perform the classification of audio clips. The CNN takes as input a ST