Automatic speaker profiling from short duration speech data

Kalluri S.B.; Vijayasenan D.; Ganapathy S.

Please use this identifier to cite or link to this item: https://idr.l2.nitk.ac.in/jspui/handle/123456789/15989

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kalluri S.B.
dc.contributor.author	Vijayasenan D.
dc.contributor.author	Ganapathy S.
dc.date.accessioned	2021-05-05T10:29:41Z	-
dc.date.available	2021-05-05T10:29:41Z	-
dc.date.issued	2020
dc.identifier.citation	Speech Communication , Vol. 121 , , p. 16 - 28	en_US
dc.identifier.uri	https://doi.org/10.1016/j.specom.2020.03.008
dc.identifier.uri	http://idr.nitk.ac.in/jspui/handle/123456789/15989	-
dc.description.abstract	Many paralinguistic applications of speech demand the extraction of information about the speaker characteristics from as little speech data as possible. In this work, we explore the estimation of multiple physical parameters of the speaker from the short duration of speech in a multilingual setting. We explore different feature streams for age and body build estimation derived from the speech spectrum at different resolutions, namely – short-term log-mel spectrogram, formant features and harmonic features of the speech. The statistics of these features over the speech recording are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features is able to achieve results that outperform previously published results in height and age estimation. Furthermore, the estimation errors from these different feature streams are complementary, which allows the combination of estimates from these feature streams to further improve the results. The combined system from short audio snippets achieves a performance of 5.2 cm, and 4.8 cm in Mean Absolute Error (MAE) for male and female respectively for height estimation. Similarly in age estimation the MAE is of 5.2 years, and 5.6 years for male, and female speakers respectively. We also extend the same physical parameter estimation to other body build parameters like shoulder width, waist size and weight along with height on a dataset we collected for speaker profiling. The duration analysis of the proposed scheme shows that the state of the art results can be achieved using only around 1–2 s of speech data. To the best of our knowledge, this is the first attempt to use a common set of features for estimating the different physical traits of a speaker. © 2020 Elsevier B.V.	en_US
dc.title	Automatic speaker profiling from short duration speech data	en_US
dc.type	Article	en_US
Appears in Collections:	1. Journal Articles

Files in This Item:

There are no files associated with this item.

Show simple item record