Building My Own Productivity Voice Agent

Voice assistants are one of the biggest opportunities in productivity software right now. The way we interact with our tools - tasks, notes, calendars, docs - is still dominated by tapping, typing, and switching apps. A voice-native AI that can actually do things on your behalf changes that fundamentally.

OpenAI's new GPT Realtime 2 model, only available via API at the time of writing, shows what this could look like: a realtime voice model for speech-to-speech interactions with configurable reasoning effort, stronger instruction following, and more reliable tool use for complex voice-agent workflows.

That last part is crucial. This is no longer a transcription pipeline duct-taped to a chat model and a TTS engine. It's a single model that hears you, reasons, calls tools across your productivity stack, and talks back. The latency drops, the conversation feels natural, and suddenly your assistant is something you actually want to use to get work done.

What I learned building Jarvis

To see how far this could go, I built my own application called Jarvis - a voice agent with access to Todoist and Notion via MCP.

Jarvis productivity voice agent interface

Jarvis can:

Find my tasks, create new ones, and change their priority or project in Todoist.
Create and update pages in Notion, and move them between databases.
Do all of this inside a natural, free-flowing conversation.

It genuinely feels like the next step in productivity, because the interaction is effortless. No tapping through menus, no context switching - just thinking out loud and watching the system keep up.

Prototype vs. product

Building Jarvis was also a sharp reminder of how much harder productizing this is than prototyping it.

Wiring up tools for myself was simple. Doing the same thing as a SaaS product is a different problem entirely: every user needs their own authenticated connections to their own tools, managed securely and reliably.

On top of that, real-world robustness is its own challenge:

Picking the right tool, with the right arguments, every time.
Handling bad microphone quality.
Surviving network interruptions and changing connectivity.

Getting to a prototype that lets you feel the power of a native voice model is exciting and relatively fast. Turning that into something I'd rely on every day for my productivity is a much bigger lift.

Why OpenAI should embed this into ChatGPT

This is exactly why I hope OpenAI embeds its realtime voice model directly into the ChatGPT app - and in doing so, turns ChatGPT into the productivity agent everyone uses every day.

They've already solved the hardest non-model part of this:

Server-side integrations. ChatGPT authenticates your productivity tools once and keeps the connection alive on the server.
Cross-device, cross-modality. Those connections are available everywhere you use ChatGPT - the ChatGPT app, Codex on the Mac, and so on.

If they extend that same integration layer to the realtime voice model, every productivity tool you've already connected becomes instantly usable by voice. That's the unlock: a single, always-available productivity agent you can just talk to, anywhere, that already knows your tasks, notes, and calendar.

The one missing piece I'd add: a watchOS app, so I could capture a task or check my day on the go without my phone.

Who can actually pull this off

This is a massive opportunity, and OpenAI is uniquely positioned for it. They have:

Their own state-of-the-art voice model.
The application infrastructure across devices and modalities.
The distribution to roll it out at scale right now.

The other contenders fall into a few buckets.

Model and software players:

Anthropic doesn't have a strong voice model of its own. Even its existing transcription is lacking.
Google / DeepMind is a realistic contender. It has comparable standing in the market and it has the AI models.
Notion is the wildcard. It has invested heavily in AI infrastructure and in a user experience built around AI workflows, and it already owns a big chunk of the productivity stack its users live in. Notion doesn't need its own voice model either - it can pick up a model like GPT Realtime 2 and wire it into the product. That's a very different, but very credible, path to the same outcome.

Hardware-first players:

Amazon, Apple, and Sonos don't need to invent a voice model. They need to properly productize the ones already available and plug them into the productivity tools people use. Amazon is clearly already going down this path with Alexa+. The hardware footprint is a real moat: there's a meaningful difference between "I have to pick up my phone" and "I just say it into the room."

The bottom line

Voice-native, tool-using AI isn't a gimmick. It's the most natural interface productivity software has ever had, and it's about to become a real product category of its own.

The winners won't just be whoever has the best voice model. They'll be whoever combines a great voice model with deep integrations into the productivity tools people actually use, broad device coverage, and the distribution to put it in everyone's pocket - and on their wrist.