FAQs

What’s a diVA?

You diVA is your digital Voice Avatar. It matches your chosen voice in the digital world!

What do these model names mean? diaLiMA-2? diaLiMA-1? diaVoCo?

We’ve made lots of models over the years! These are our flagship models for text-to-speech and speech-to-speech. They each have advantages that make them suited for different tasks.

diaLiMA is dialogger Linguistic Model Architecture. We use advanced AI frameworks to create the best possible AI voices, and we are always upgrading our systems to get better, faster, cheaper results!

diaLiMA-1 was our first major foray into the TTS word. It is fast, super stable, can handle long batches of audio, has great vocal adherence ( staying in character, basically ) and will never hallucinate!

diaVoCo is our Voice Converter. You can say a line exactly how you want it to sound and our AI will convert that recording into the chosen voice! The advantage there is that you get much more precise control on specific timing and delivery, but you will have to do some voice work on your side to get the outputs you want.

diaLiMA-2 is our latest flagship model. It is highly contextually aware, super emotional and has amazing, life-like prosody. It’s capable of working in a streaming context or in batches. It is rock solid, and configurable to get some wild results if you increase the temperature!

What do the additional sliders do?

With diaLiMA-2, temperature and top_p are standard LLM-style controls. Higher temperature means more randomness, and higher top_p means more stringent thresholding and therefore more coherent results.

With diaLiMA-1, voice and emotion are adherence sliders, that tell the model to use the internal prompt more than the base model characteristics. Generally, you’ll get more expressiveness on the furthest right settings.

With diaVoCo, the sliders are there to help refine the model performance as well. “Protect Consonants” shifts the audio output more towards voiceless / sibilant sounds, and with that to the left, the model will focus more on voiced and vowel sounds. “Model adherence” gets it to focus more on the trained model’s pitch and prosody range - with that to the left, it will rely more on the given performance. Filter radius is how much smoothing is applied to the finished audio, and pitch change allows you to manually shift generated audio’s pitch center.

How can I get access to more tools?

More tools are available in higher subscription plans such as the ability to generate up to 32-bit, 96KHz, audio, upsample, use custom fine-tuned voices, and pitch shift.

What does upsampling do?

Upsampling increases the number of samples in a given audio recording and therefore increases the audio quality and flexibility. Some projects require higher sample rates such as film and television productions which typically utilize 24bit 48Khz audio. Higher sample rates offer more samples to manipulate which is especially helpful when pitch shifting or time stretching.

How can I clone my voice?

At higher subscription tiers, we can create a custom clone of your voice using at mininum 10 minutes of user submitted audio, or you schedule a time to record with our expert audio engineers if you reside in or around Los Angeles, CA.

Can I upgrade my plan?

Yes, you can! You will get more features and more use!