All source listed below is under MIT license if no LICENSE file stating different is available.

Research Regarding STT/TTS

This repository is a mess! It's my personal notepad — a pure collection of snippets and experiments that cost me blood, sweat, and many tears.

Special thanks to: Google. You know what you did.
To OpenAI: You're amazing! Quality stuff. Sadly, I'm not rich enough to run a 24/7 service with your pricing regarding STT/TTS, so I use only gpt4o-mini.

The end result of this repository is a working STT/TTS system that allows you to talk with ChatGPT.

To save money, I use TTS/STT from Google Cloud (paid). It's surprisingly cheap!

Do not take the way I communicate with the LLM too seriously — that wasnt the main focus. The implementation in this project has no context, memory, or system messages. Every call is treated as a new session.

If you're interested in this technology but get stuck due to lack of documentation, feel free to email me at retoor@molodetz.nl.


How to Play Immediately (Without Configuration)

You can get started in just 5 minutes:

  1. Create a virtual environment.
  2. Install the requirements file: pip install -r requirements.txt.
  3. Execute tts.py.

With these steps, you'll have a working gpt4o-mini model listening to you and responding in text.


Application Output (tts.py)

The output is speech, but heres how a typical conversation looks:

Adjusting for ambient noise, please wait... 
Listening... 
Recognized Text: what is the name of the dog of ga 
Response from gpt4o_mini: Please provide more context or details about what "GA" refers to, so I can assist you accurately. 
Recognized Text: Garfield the gas has a dog friends what is his name 
Response from gpt4o_mini: Garfield's dog friend is named Odie. 
Recognized Text: is FTP still used 
Response from gpt4o_mini: Yes, FTP (File Transfer Protocol) is still used for transferring files over a network, although more secure alternatives like SFTP (Secure File Transfer Protocol) and FTPS (FTP Secure) are often preferred due to security concerns. 
Recognized Text: why is Linux better than 
Response from gpt4o_mini: Please complete your question for a more specific comparison about why Linux might be considered better than another operating system or software.

Repository Structure

The repository contains:

  • play.py: For playing audio with Python.
  • gcloud.py: A wrapper around the Google Cloud SDK (this was the most time-consuming to build).
  • tts.py: Execute this script to talk with GPT.

Requirements and Preparation

  • A paid Google Cloud account

    • Google Cloud CLI
    • You get $300 and 90 days for free, but you'll need to attach a credit card. I used it extensively and didn't spend a cent!
    • The free credit barely depletes even with heavy usage.
  • Google Cloud SDK + CLI installed
    Important: These standalone applications affect the behavior of Python's Google library regarding authentication.

  • Python 3 and the following:

    • python3-venv
    • python3-pip

I initially installed a lot using apt-get, but I cant recall if it was all necessary in the end.


Installation Steps

  1. Activate the virtual environment:
    python3 -m venv venv && source venv/bin/activate
    
  2. Install the requirements:
    pip install -r requirements.txt
    

Testing the setup

  1. Check Google Authentication & TTS
    python gcloud.py
    
  • If successful, it will speak a sentence.
  • If not, you'll likely encounter some authentication issues — brace yourself for Google-related configuration struggles.
  1. Check Speech Recognition (No API Needed)
    python tts.py
    
  • This sends your text to the gpt4o-mini model and prints the response.
  • Requires no configuration and works out of the box.

Conclusion

Play stupid games, win stupid prizes. Figuring this out was a nightmare. If OpenAI's services were financially viable, I would have chosen them — better quality and much easier to implement.

Now, I have a fully operational project that communicates perfectly and even follows conversations. For example, I can:

  • Assign numbers.
  • Perform calculations (e.g., divide "the first number by the second").
  • Use the microphone full-time to ask or say anything I want. I have a wireless JBL GO speaker that's directly ready for the job when I turn it on.

I hope some people appreciate the snippets!

.gitignore
gcloud.py
play.py
README.md
requirements.txt
tts.py
ttsstt.html