In several conversations I hear my own brain echoes of responses to AI mania of both fatigue and hype wariness but also some curiosity and itches to explore.
My own reckoning I reckon is the comparisons to the disruptive force of the early web was we were not clubbed over the head with the web in the early 1990s. There was almost an invitation to jump in the web, not a blaring of YOU WILL BE REDUNDANT LUDDITE IF YOPU FAIL TO LINE UP AND MARCH.
So here we are.
It’s not even realistic to talk about “it” as a singular thing, nor just proclaiming how bad / good / innovative / destructive force “it” is as a whole. Don’t we have sufficient thought pieces and newsletter gunk? I guess not.
That’s why I appreciate so much the thoughtful writing on (always) of Jon Udell in his series of posts on LLMs, who is not talking generally from some podium, but profoundly sharing what he is doing in actual real work using the thing. Line that up to with the Middlebury College Digital Detox on Demystifying AI which offers hands on doing type activities.
My curiosity was at least raised from a coma with a mentions somewhere of the MyShell OpenVoice set up to play with at Hugging Face:
… a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.https://huggingface.co/spaces/myshell-ai/OpenVoice
Well maybe that’s not so clear. There’s a bit more at the MyShell project page for OpenVoice. If I can do my un-artificial intelligence summary, OpenVoice uses a very short audio source clip (like 30 seconds) to turn any entered text into audio, but also with an ability to have it speak in different pseudo emotion styles AND supposedly it can also generate it in other languages. All of this apparently done in a smaller technical footprint.
I actually read none of that, I just charged in. I tried the opening sentence from a recent blog post
I plopped that in the box on the left, expecting to get a “friendly” emoted version….
…. but apparently the “intelligence” thought my writings where not in English. Well, that is some reassuring feedback on my style! I went back to the the blog pile once more for a different test phrase to generate some variants, using another opening line from What’s the Diff
There you go, a deliberately vague blog post title that give no real indication what this is about.
I start with anger!
Here are some variants
This seems intriguing, as I recall from Michael Branson-Smith describing his amazing work turning Dr Oblivion into an AI spewing audio bot, how he. had little control over the phrasing, that somehow the AI he used interjected those pauses that make it seem more real.
Why not ask the brilliant Dr. O?
Your voice seems to sounds the same emption. Can your generative AI make you speak so, more angry, or more sad, or even whisper?
To which he more or less asks, “Can you pick up what I’m generating down?”
Just because it’s easy and worthwhile to do with audio, I drop this dulcet wisdom into MacWhisper to get a transcript. Yes, in his smug superior tone, he swats me down:
How fascinating that you’re curious about the emotional range of my generative AI.
While it does possess the capability to emulate different emotions in speech, that’s not the primary function I choose to focus on.
My goal with artificial intelligence is to explore the ways it can enhance our understanding of media, technology, and their impact on society.
So keep your emotions in check and let’s dive deep into the depths of artificial intelligence.Dr. Oblivion responds to my question (I would not call it an answer)
I don’t know what I’ve really accomplished here, but I’ve played, and it was fun. But when you play with OpenVoice, it’s impressive what it can do of its really only training on that little bit of source audio. But then again, while I have played and poked, I have understanding what it is doing.
And this is where my botheredness rises up, and says…. “end this damned post”!