2024-08-13

This Simple AI-powered Python Script will Completely Change How You Work

Vintage alarm clock standing on a newspaper

Photo by Ales Krivec on Unsplash

As programmers, we often find ourselves in situations when we need to give our wrists a break. Be it long coding sessions or the urge for a more ergonomic workflow, being able to dictate text can change things. In this tutorial, we will teach you how to create an advanced voice-to-text transcription tool in Python with the high speed and accuracy of Groq’s Whisper API.

The aim is to develop a script that is capable of running in the background with which we can trigger voice input in any application with the press of a button. It will then transcribe whatever is spoken on releasing the button and automatically paste it into the active text input field. With this approach, we can have native voice mode for almost any application on your system.

Prerequisites

Before we dive in, make sure you have Python installed on your system. You’ll also need to install the following libraries:

pip install keyboard pyautogui pyperclip groq pyaudio

Each of these libraries serves a specific purpose:

  1. PyAudio: To handle audio input from the microphone.
  2. Keyboard: To detect and respond to keyboard events.
  3. PyAutoGUI: For simulating keyboard input to paste the transcribed text.
  4. Pyperclip: To interact with the system clipboard.
  5. Groq: The Groq API client for accessing their Whisper implementation.

Additionally, you’ll need a Groq API key. If you haven’t already, head over to https://console.groq.com/keys to register for a free API key.

The Code

For the complete code, please refer to this GitHub Project. We’ll be breaking down the key components of this script and exploring how they work together to create our voice-to-text tool.

Just this once, we will not be using the Atomic Agents library, but if you are looking for some kick-ass agentic AI stuff, have a look at Streamlining AI Workflows with Modular AI: 5 Extremely Useful Atomic Agents.

Setting Up the Environment

We start by importing the necessary libraries and setting up the Groq client. We initialize the Groq client using an API key stored in an environment variable. This is a best practice for handling sensitive information like API keys, as it keeps them out of your source code, so be sure to make a .env file containing your API key.

import os
import tempfile
import wave
import pyaudio
import keyboard
import pyautogui
import pyperclip
from groq import Groq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

Recording Audio

The record_audio function, as its name implies, is responsible for capturing audio input:

def record_audio(sample_rate=16000, channels=1, chunk=1024):
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=channels,
        rate=sample_rate,
        input=True,
        frames_per_buffer=chunk,
    )

    print("Press and hold the PAUSE button to start recording...")
    frames = []
    keyboard.wait("pause")  # Wait for PAUSE button to be pressed
    print("Recording... (Release PAUSE to stop)")
    while keyboard.is_pressed("pause"):
        data = stream.read(chunk)
        frames.append(data)
    print("Recording finished.")
    stream.stop_stream()
    stream.close()
    p.terminate()
    return frames, sample_rate

We use a sample rate of 16000 Hz, which is optimal for our use case. Whisper itself downsamples to 16000 Hz, so using a higher sample rate would only increase file size, which means we can’t transcribe as many seconds as with 16000 Hz.

The function sets up a PyAudio stream and waits for the PAUSE button to be pressed. It then records audio in chunks while the button is held down. We chose the PAUSE button because it’s rarely used in modern applications, but you could easily modify this to use a different key if desired.

Saving Audio to a Temporary File

Once we have recorded audio, we need to save it to a temporary file for processing:

def save_audio(frames, sample_rate):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio:
        wf = wave.open(temp_audio.name, "wb")
        wf.setnchannels(1)
        wf.setsampwidth(pyaudio.PyAudio().get_sample_size(pyaudio.paInt16))
        wf.setframerate(sample_rate)
        wf.writeframes(b"".join(frames))
        wf.close()
        return temp_audio.name

This function creates a temporary WAV file using the tempfile module. Temporary files come in handy here since we only need the audio data briefly for transcription, after which we would want to clean it up, which we will revisit later.

Transcribing Audio with Groq

The heart of our tool is the transcription process, handled by the transcribe_audio function:

def transcribe_audio(audio_file_path):
    try:
        with open(audio_file_path, "rb") as file:
            transcription = client.audio.transcriptions.create(
                file=(os.path.basename(audio_file_path), file.read()),
                model="whisper-large-v3",
                prompt="""The audio is by a programmer discussing programming issues, the programmer mostly uses python and might mention python libraries or reference code in his speech.""",
                response_format="text",
                language="en",
            )
        return transcription
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None

The function will transcribe the audio file using the Groq API. We used the “whisper-large-v3” model that offers high accuracy for speech recognition, but through the Groq API, it also does this in a lightning-fast manner. The prompt parameter provides context to the model, improving its understanding of the audio content. In this case I instructed it that I will be discussing programming-related things, which makes it slightly better at transcribing things like library names.

Handling Transcription Output

Once we have the transcribed text, we need to get it into the active application:

def copy_transcription_to_clipboard(text):
    pyperclip.copy(text)
    pyautogui.hotkey("ctrl", "v")

This function uses pyperclip to copy the transcribed text to the clipboard and then simulates a ”Ctrl+V” keystroke using pyautogui to paste the text into the active application. This approach ensures that our tool works seamlessly with any text input field, regardless of the application.

The Main Loop

The heart of our script is the main() function, which ties everything together:

def main():
    while True:
        # Record audio
        frames, sample_rate = record_audio()

        # Save audio to temporary file
        temp_audio_file = save_audio(frames, sample_rate)
        # Transcribe audio
        print("Transcribing...")
        transcription = transcribe_audio(temp_audio_file)
        # Copy transcription to clipboard
        if transcription:
            print("\nTranscription:")
            print(transcription)
            print("Copying transcription to clipboard...")
            copy_transcription_to_clipboard(transcription)
            print("Transcription copied to clipboard and pasted into the application.")
        else:
            print("Transcription failed.")
        # Clean up temporary file
        os.unlink(temp_audio_file)
        print("\nReady for next recording. Press PAUSE to start.")

This function runs in an infinite loop, allowing the user to make multiple recordings without restarting the script. Here’s what happens in each iteration:

  • The script records audio when the PAUSE button is pressed and held.
  • The recorded audio is saved to a temporary file.
  • The audio is transcribed using Groq’s Whisper API.
  • If transcription is successful, the text is copied to the clipboard and pasted into the active application.
  • The temporary audio file is deleted to save space.

Be sure to star the project on GitHub or give me a follow if you enjoyed it!

Want this kind of analysis in your inbox once a month?