This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research Regarding STT/TTS
This repository is a mess! It's my personal notepad — a pure collection of snippets and experiments that cost me blood, sweat, and many tears.
**Special thanks to:** Google. *You know what you did.*
**To OpenAI:** You're amazing! Quality stuff. Sadly, I'm not rich enough to run a 24/7 service with your pricing regarding STT/TTS, so I use only `gpt4o-mini`.
The end result of this repository is a working **STT/TTS system** that allows you to talk with ChatGPT.
To save money, I use TTS/STT from Google Cloud (paid). It's surprisingly cheap!
Do not take the way I communicate with the LLM too seriously — that wasn’t the main focus. The implementation in this project has no context, memory, or system messages. Every call is treated as a new session.
If you're interested in this technology but get stuck due to lack of documentation, feel free to email me at **retoor@molodetz.nl**.
---
## How to Play Immediately (Without Configuration)
You can get started in just 5 minutes:
1. Create a virtual environment.
2. Install the requirements file: `pip install -r requirements.txt`.
3. Execute `tts.py`.
With these steps, you'll have a working `gpt4o-mini` model listening to you and responding in text.
---
## Application Output (`tts.py`)
The output is speech, but here’s how a typical conversation looks:
```
Adjusting for ambient noise, please wait...
Listening...
Recognized Text: what is the name of the dog of ga
Response from gpt4o_mini: Please provide more context or details about what "GA" refers to, so I can assist you accurately.
Recognized Text: Garfield the gas has a dog friends what is his name
Response from gpt4o_mini: Garfield's dog friend is named Odie.
Recognized Text: is FTP still used
Response from gpt4o_mini: Yes, FTP (File Transfer Protocol) is still used for transferring files over a network, although more secure alternatives like SFTP (Secure File Transfer Protocol) and FTPS (FTP Secure) are often preferred due to security concerns.
Recognized Text: why is Linux better than
Response from gpt4o_mini: Please complete your question for a more specific comparison about why Linux might be considered better than another operating system or software.
```
---
## Repository Structure
The repository contains:
- **`play.py`**: For playing audio with Python.
- **`gcloud.py`**: A wrapper around the Google Cloud SDK (this was the most time-consuming to build).
- **`tts.py`**: Execute this script to talk with GPT.
---
## Requirements and Preparation
- **A paid Google Cloud account**
- Google Cloud CLI
- You get $300 and 90 days for free, but you'll need to attach a credit card. I used it extensively and didn't spend a cent!
- The free credit barely depletes even with heavy usage.
- **Google Cloud SDK + CLI** installed
*Important:* These standalone applications affect the behavior of Python's Google library regarding authentication.
- **Python 3** and the following:
-`python3-venv`
-`python3-pip`
> I initially installed a lot using `apt-get`, but I can’t recall if it was all necessary in the end.
---
## Installation Steps
1. Activate the virtual environment:
```bash
python3 -m venv venv && source venv/bin/activate
```
2. Install the requirements:
```bash
pip install -r requirements.txt
```
## Testing the setup
1. Check Google Authentication & TTS
```bash
python gcloud.py
```
- If successful, it will speak a sentence.
- If not, you'll likely encounter some authentication issues — brace yourself for Google-related configuration struggles.
2. Check Speech Recognition (No API Needed)
```bash
python tts.py
```
- This sends your text to the gpt4o-mini model and prints the response.
- Requires no configuration and works out of the box.
## Conclusion
Play stupid games, win stupid prizes. Figuring this out was a nightmare. If OpenAI's services were financially viable, I would have chosen them — better quality and much easier to implement.
Now, I have a fully operational project that communicates perfectly and even follows conversations. For example, I can:
- Assign numbers.
- Perform calculations (e.g., divide "the first number by the second").
- Use the microphone full-time to ask or say anything I want. I have a wireless JBL GO speaker that's directly ready for the job when I turn it on.