My interest in offline TTS is actually entirely unrelated to the automation space:
I'm interested in Text to Speech for creative pursuits, such as video game voice dialogue and animated videos.
This is one of the reasons why the range & quantity of available voices is particularly important to me.
After all, you can't really have scene set in a board room with nine characters[3] if you've only got three voices to go around. :)
I've actually been spending time this week on updating my "Dialogue Tool"[1] application (originally created to work with Larynx to help with narrative dialogue workflows such as voice "auditioning", intelligent caching & multiple voice recordings) to work with Piper.
Which is where I ran into the question of how to navigate/curate a collection of more than 900+ voices.
The main approaches I'm using so far are:
(1) Random luck--just audition a bunch of different voices with your sample dialogue & see what you like.
(2) Curation/sorting based on quality-related meta-data from the original dataset.
(3) Generating a different dialogue line for each voice that includes their speaker number for identification purposes that also (hopefully) isn't tedious to listen to for 900+ voices. :)
The recording has two sets of 10 voices which had the lowest Word Error Rate scores in the original dataset--which doesn't mean the resulting voice model is necessary good but is at least a starting point for exploring.
I'd also like to explore more analysis-based approaches for grouping/curation (e.g. vocal characteristics such "softer", "lower", "older") but as I'm not getting paid for this[2], that's likely a longer term thing.
A different approach which I've previously found really interesting is to use voices as a prompt for writing narrative dialogue. It really helps to hear the dialogue as you write it and the nuances of different voices can help spur ideas for where a conversation goes next...
I'm interested in Text to Speech for creative pursuits, such as video game voice dialogue and animated videos.
This is one of the reasons why the range & quantity of available voices is particularly important to me.
After all, you can't really have scene set in a board room with nine characters[3] if you've only got three voices to go around. :)
I've actually been spending time this week on updating my "Dialogue Tool"[1] application (originally created to work with Larynx to help with narrative dialogue workflows such as voice "auditioning", intelligent caching & multiple voice recordings) to work with Piper.
Which is where I ran into the question of how to navigate/curate a collection of more than 900+ voices.
The main approaches I'm using so far are:
(1) Random luck--just audition a bunch of different voices with your sample dialogue & see what you like.
(2) Curation/sorting based on quality-related meta-data from the original dataset.
(3) Generating a different dialogue line for each voice that includes their speaker number for identification purposes that also (hopefully) isn't tedious to listen to for 900+ voices. :)
I haven't quite finished/uploaded results from (3) yet but example output based on approaches (3) & (2) can be heard here: https://rancidbacon.gitlab.io/piper-tts-demos/
The recording has two sets of 10 voices which had the lowest Word Error Rate scores in the original dataset--which doesn't mean the resulting voice model is necessary good but is at least a starting point for exploring.
I'd also like to explore more analysis-based approaches for grouping/curation (e.g. vocal characteristics such "softer", "lower", "older") but as I'm not getting paid for this[2], that's likely a longer term thing.
A different approach which I've previously found really interesting is to use voices as a prompt for writing narrative dialogue. It really helps to hear the dialogue as you write it and the nuances of different voices can help spur ideas for where a conversation goes next...
[1] See: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to... & https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...
[2] Am currently available/open to be though. :D
[3] Will try to upload some example audio of this scene because I found it pretty funny. :)