Initial commit, hopely the last.

2025-01-18 08:58:13 +01:00 · 2025-01-18 08:58:13 +01:00 · b8a517cc14
commit b8a517cc14
7 changed files with 448 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,11 @@
+__pycache__
+build
+dist
+*.egg-info
+*.egg
+*.pyc
+*.pyo
+venv
+.venv
+output.wav
+.backup*
--- a/README.md
+++ b/README.md
@ -0,0 +1,107 @@
+# Research Regarding STT/TTS
+
+This repository is a mess! It's my personal notepad — a pure collection of snippets and experiments that cost me blood, sweat, and many tears. 
+
+**Special thanks to:** Google. *You know what you did.*  
+**To OpenAI:** You're amazing! Quality stuff. Sadly, I'm not rich enough to run a 24/7 service with your pricing regarding STT/TTS, so I use only `gpt4o-mini`.
+
+The end result of this repository is a working **STT/TTS system** that allows you to talk with ChatGPT. 
+
+To save money, I use TTS/STT from Google Cloud (paid). It's surprisingly cheap!
+
+Do not take the way I communicate with the LLM too seriously — that wasn’t the main focus. The implementation in this project has no context, memory, or system messages. Every call is treated as a new session.
+
+If you're interested in this technology but get stuck due to lack of documentation, feel free to email me at **retoor@molodetz.nl**.
+
+---
+
+## How to Play Immediately (Without Configuration)
+You can get started in just 5 minutes:
+1. Create a virtual environment.
+2. Install the requirements file: `pip install -r requirements.txt`.
+3. Execute `tts.py`.
+
+With these steps, you'll have a working `gpt4o-mini` model listening to you and responding in text.
+
+---
+
+## Application Output (`tts.py`)
+
+The output is speech, but here’s how a typical conversation looks:
+
+```
+Adjusting for ambient noise, please wait... 
+Listening... 
+Recognized Text: what is the name of the dog of ga 
+Response from gpt4o_mini: Please provide more context or details about what "GA" refers to, so I can assist you accurately. 
+Recognized Text: Garfield the gas has a dog friends what is his name 
+Response from gpt4o_mini: Garfield's dog friend is named Odie. 
+Recognized Text: is FTP still used 
+Response from gpt4o_mini: Yes, FTP (File Transfer Protocol) is still used for transferring files over a network, although more secure alternatives like SFTP (Secure File Transfer Protocol) and FTPS (FTP Secure) are often preferred due to security concerns. 
+Recognized Text: why is Linux better than 
+Response from gpt4o_mini: Please complete your question for a more specific comparison about why Linux might be considered better than another operating system or software.
+```
+
+---
+
+## Repository Structure
+
+The repository contains:
+- **`play.py`**: For playing audio with Python.
+- **`gcloud.py`**: A wrapper around the Google Cloud SDK (this was the most time-consuming to build).
+- **`tts.py`**: Execute this script to talk with GPT.
+
+---
+
+## Requirements and Preparation
+
+- **A paid Google Cloud account**
+  - Google Cloud CLI  
+  - You get $300 and 90 days for free, but you'll need to attach a credit card. I used it extensively and didn't spend a cent!  
+  - The free credit barely depletes even with heavy usage.
+
+- **Google Cloud SDK + CLI** installed  
+  *Important:* These standalone applications affect the behavior of Python's Google library regarding authentication.
+
+- **Python 3** and the following:
+  - `python3-venv`
+  - `python3-pip`
+
+> I initially installed a lot using `apt-get`, but I can’t recall if it was all necessary in the end.
+
+---
+
+## Installation Steps
+
+1. Activate the virtual environment:
+   ```bash
+   python3 -m venv venv && source venv/bin/activate
+   ```
+2. Install the requirements:
+	```bash
+	pip install -r requirements.txt
+	```
+## Testing the setup
+1. Check Google Authentication & TTS
+	```bash
+	python gcloud.py
+	```
+ - If successful, it will speak a sentence.
+ - If not, you'll likely encounter some authentication issues — brace yourself for Google-related configuration struggles.
+ 
+2. Check Speech Recognition (No API Needed)
+	```bash
+	python tts.py
+	```
+ - This sends your text to the gpt4o-mini model and prints the response.
+ - Requires no configuration and works out of the box.
+
+## Conclusion
+Play stupid games, win stupid prizes. Figuring this out was a nightmare. If OpenAI's services were financially viable, I would have chosen them — better quality and much easier to implement.
+
+Now, I have a fully operational project that communicates perfectly and even follows conversations. For example, I can:
+ - Assign numbers.
+ - Perform calculations (e.g., divide "the first number by the second").
+ - Use the microphone full-time to ask or say anything I want. I have a wireless JBL GO speaker that's directly ready for the job when I turn it on.
+ 
+I hope some people appreciate the snippets!
--- a/gcloud.py
+++ b/gcloud.py
@ -0,0 +1,125 @@
+# Written by retoor@molodetz.nl
+
+# This script interfaces with Google's Text-to-Speech API to synthesize spoken audio from text. 
+# It also includes functionality to handle Google authentication tokens.
+
+# External imports:
+# - aiohttp: Asynchronous HTTP requests.
+# - google-auth packages: For managing Google authentication tokens.
+# - env, play: Local modules for playing audio and environment configurations.
+
+# MIT License
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+
+import aiohttp
+import asyncio
+from urllib.parse import urlencode
+import base64
+import sys
+from functools import cache
+from google.oauth2 import id_token
+from google.auth.transport import requests
+import google.auth
+from play import play_audio
+import google.oauth2.credentials
+import uuid
+import pathlib
+
+
+@cache
+def google_token():
+    gcloud_default, project = google.auth.default()
+    from google.oauth2 import _client as google_auth_client
+    import google.auth.transport.urllib3 as google_auth_urllib3
+    import urllib3
+    http = urllib3.PoolManager()
+    request = google_auth_urllib3.Request(http)
+    token_uri = 'https://oauth2.googleapis.com/token'
+    refresh_token = gcloud_default.refresh_token
+    client_id = gcloud_default.client_id
+    client_secret = gcloud_default.client_secret
+
+    scopes = ['https://www.googleapis.com/auth/cloud-platform']
+
+    access_token, _, _, _ = google_auth_client.refresh_grant(
+        request, token_uri, refresh_token, client_id, client_secret, scopes)
+    return access_token
+
+
+async def tts(text):
+    url = "https://texttospeech.googleapis.com/v1/text:synthesize"
+    text = text.replace("*", "").replace("#", "").replace("`", "").strip()
+    if not text:
+        return
+
+    headers = {
+        "Authorization": f"Bearer {google_token()}",
+        "Content-Type": "application/json",
+        "X-Goog-User-Project": "lisa-448004",
+    }
+    data = {
+        "input": {
+            "text": text
+        },
+        "voice": {
+            "languageCode": "nl-NL",
+            "name": "nl-NL-Standard-D",
+            "ssmlGender": "FEMALE"
+        },
+        "audioConfig": {
+            "audioEncoding": "MP3",
+            "speakingRate": 1.0,
+            "pitch": 0.0
+        }
+    }
+    async with aiohttp.ClientSession() as session:
+        response = await session.post(url, headers=headers, json=data)
+        response_json = await response.json()
+        audio_content = response_json.get("audioContent")
+        file = pathlib.Path(str(uuid.uuid4()) + ".mp3")
+        with file.open("wb") as audio_file:
+            audio_file.write(base64.b64decode(audio_content.encode('latin1')))
+        play_audio(file)
+        file.unlink()
+        return
+
+
+def oud():
+    client = speech.SpeechClient()
+
+    with open(file_path, "rb") as audio_file:
+        content = audio_file.read()
+
+    audio = speech.RecognitionAudio(content=content)
+    config = speech.RecognitionConfig(
+        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
+        sample_rate_hertz=16000,
+        language_code="en-US",
+    )
+    response = client.recognize(config=config, audio=audio)
+    for result in response.results:
+        print("Transcript:", result.alternatives[0].transcript)
+
+
+async def main():
+    print(google_token())
+    await tts("If you hear this sentence, the google part works fine. Congrats.")
+
+
+if __name__ == '__main__':
+    asyncio.run(main())
--- a/play.py
+++ b/play.py
@ -0,0 +1,66 @@
+# Written by retoor@molodetz.nl
+
+# This source code initializes a Text-to-Speech (TTS) engine, plays text as audio using the TTS engine, and plays audio files using both the VLC media player and PyAudio.
+
+# Libraries imported: 'pyaudio', 'wave', 'pyttsx3', 'functools', 'os', 'simpleaudio'
+
+# The MIT License (MIT)
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+import pyaudio
+import functools
+import os
+import subprocess
+import sys
+
+@functools.cache
+def get_py_audio():
+    return pyaudio.PyAudio()
+
+def play_audio(filename):
+    ffmpeg_cmd = [
+        "ffmpeg",
+        "-i", filename,
+        "-f", "s16le",
+        "-ar", "44100",
+        "-ac", "2",
+        "pipe:1"
+    ]
+    process = subprocess.Popen(ffmpeg_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=10**6)
+
+    py_audio = get_py_audio()
+    stream = py_audio.open(
+        format=py_audio.get_format_from_width(2),
+        channels=2,
+        rate=44100,
+        output=True
+    )
+    chunk_size = 4096
+    try:
+        while True:
+            data = process.stdout.read(chunk_size)
+            if not data:
+                break
+            stream.write(data)
+    finally:
+        stream.stop_stream()
+        stream.close()
+        process.stdout.close()
+        process.wait()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,5 @@
+pyaudio
+SpeechRecognition
+google-cloud-speech
+google-cloud-texttospeech
+google-auth
--- a/tts.py
+++ b/tts.py
@ -0,0 +1,61 @@
+# Written by retoor@molodetz.nl
+
+# This script listens to audio input via a microphone, recognizes speech using the Google API, sends the recognized text to a server for processing, and uses Google Cloud to convert the server response to speech.
+
+# Imports:
+# - speech_recognition: For speech recognition functionality.
+# - xmlrpc.client: To communicate with a remote server using the XML-RPC protocol.
+# - gcloud: Presumably for Google Cloud services, though this requires clarification or specific library inclusion.
+
+# MIT License
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# 
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+# 
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+import speech_recognition as sr
+from xmlrpc.client import ServerProxy
+import gcloud
+
+molodetz = ServerProxy("https://api.molodetz.nl/rpc")
+
+async def main():
+    recognizer = sr.Recognizer()
+
+    with sr.Microphone() as source:
+        print("Adjusting for ambient noise, please wait...")
+        recognizer.adjust_for_ambient_noise(source, duration=1)
+        print("Listening...")
+
+        while True:
+            try:
+                audio_data = recognizer.listen(source, timeout=10)
+                text = recognizer.recognize_google(audio_data, language="en-US")
+                print(f"Recognized Text: {text}")
+                response_llm = molodetz.gpt4o_mini(text)
+                print(f"Response from gpt4o_mini: {response_llm}")
+                await gcloud.tts(response_llm)
+            except sr.WaitTimeoutError:
+                continue
+            except sr.UnknownValueError:
+                continue
+            except sr.RequestError:
+                continue
+
+if __name__ == "__main__":
+    import asyncio
+    asyncio.run(main())
--- a/ttsstt.html
+++ b/ttsstt.html
@ -0,0 +1,73 @@
+<html>
+    <head>
+    </head>
+    <body>
+        <div id="messages"></div>
+        <script>
+
+function speak(text) {
+    // Create a new SpeechSynthesisUtterance instance
+    const utterance = new SpeechSynthesisUtterance(text);
+
+    // Set voice properties (optional)
+    utterance.lang = 'en-US'; // Set language (adjust as needed)
+    utterance.pitch = 1;      // Adjust pitch (0 to 2)
+    utterance.rate = 1;       // Adjust rate/speed (0.1 to 10)
+    utterance.volume = 1;     // Adjust volume (0 to 1)
+
+    // Speak the text
+    window.speechSynthesis.speak(utterance);
+}
+let recognition;
+
+function startSpeechRecognition() {
+    // Check if the Web Speech API is supported
+    if (!('SpeechRecognition' in window || 'webkitSpeechRecognition' in window)) {
+        console.error('Web Speech API is not supported in this browser.');
+        return;
+    }
+
+    // Initialize SpeechRecognition
+    recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
+
+    // Configure SpeechRecognition
+    recognition.lang = 'en-US'; // Set the desired language
+    recognition.continuous = true; // Allow continuous recognition
+    recognition.interimResults = true; // Capture interim results
+
+    // Event listener for speech recognition results
+    recognition.onresult = (event) => {
+        let transcript = '';
+        for (let i = event.resultIndex; i < event.results.length; i++) {
+            transcript += event.results[i][0].transcript;
+        
+
+        }
+
+        console.log('Recognized Speech:', transcript);
+    };
+
+    // Handle errors
+    recognition.onerror = (event) => {
+        console.error('Speech Recognition Error:', event.error);
+    };
+
+    // Automatically restart recognition if it stops
+    recognition.onend = () => {
+        console.log('Speech recognition stopped. Restarting...');
+        recognition.start();
+    };
+
+    // Start speech recognition
+    recognition.start();
+    console.log('Speech recognition started.');
+}
+
+// Start the speech recognition loop
+startSpeechRecognition();
+
+
+
+        </script>
+    </body?>
+</html>