Spock AI

Generating an audio and a video talking head using AI models

Posted on: 2025-06-02

With AI deepfakes and other realistic videos going around, I was curious how easy and costly it was to create a simple talking head style AI video, using a real person's voice. The answer: it's very easy, and basically free.

To get started, you'll need 2 things:

The audio clip should be clear but doesn't need to be long. 15 seconds is fine. For this experiment, I used the image shown above of Spock from Star Trek, along with this short clip of him, both from an original episode of Star Trek:

The prompt I wanted him to say is the following:

AI is the next natural evolution of the human race. From ChatGPT to the advanced computer we have on the Enterprise, AI allows humanity to expand itself and prosper. It is only logical.


Making the audio

The first step is to create the audio part. For that, you can use a tool called Chatterbox TTS which is open source and can be self hosted, or used directly on the Hugging Face page.

On that page, you need to:

  1. First, type the prompt that you want your character to say.
  2. Then upload the audio clip of the person.
  3. Finally, press on the Generate button.

If you don't like the first result you get, you can use the Exaggeration slider to adjust how much emotion will be in the voice, and the CFG slider to adjust the speed. Based on my experiments, if you have a good sample and get the right options set, the voice will sound very authentic.


Making the video

Once you have the audio file downloaded as an MP3 file, you can use a second tool called SUTRA Avatar, also available as open source and on the Hugging Face page.

Again, the instructions are very simple:

  1. First, upload your still image.
  2. Then upload the audio file you created with the previous model.
  3. Finally, press on the Generate button.

You can try either the talk-head or talk-neutral options, based on the image you use. For a static, very simple portrait photo, it works decently well. However it can quickly start looking unnatural. There are better lip-sync video models out there such as Hedra but those tend to cost money and be more restrictive as to what prompt and what image they will allow you to use.


The final result