Instanly Translate every Customer Voice - MOHA Meeting Co-pilot

MOHA’s in-house AI Meeting Co-Pilot brings real-time transcription, live translation, and voice-cloned responses directly into every Google Meet session. By embedding language AI inside day-to-day collaboration, the platform eliminates “Can you repeat that?” moments, accelerates decisions, and safeguards sensitive project discussions under strict security controls. What began as an engineering side-project is now a core productivity service supported by a clear roadmap toward automated minutes, action-item extraction, and context-aware reply suggestions.

Business Overview

Global collaboration: Development squads span Vietnam, Japan, and North America, with clients joining calls from additional regions.
High stakes: Requirements misunderstandings or missed action items ripple into re-work, delayed releases, and frustrated stakeholders.
Strategic mandate: MOHA’s leadership asked the AI Center of Excellence to prototype an internal tool that would remove language friction, preserve confidentiality, and fit seamlessly into existing Google Workspace workflows.

Challenge	Pre-Co-Pilot Reality
Language barriers	Mixed-language calls forced frequent clarifications; nuanced feedback was sometimes lost.
Cognitive load	Team members took manual notes while listening, speaking, and screen-sharing—leading to omissions.
Meeting efficiency	Repeated “Could you say that again?” exchanges stretched 30-minute calls into 45-minute sessions.
Security & compliance	Third-party caption extensions were off-limits for client projects containing NDA-protected information.

Design Principles

Component	Technology	Key Details
Speech Recognition Layer	Google Speech-to-Text (streaming)	Multichannel audio mixed locally and streamed with word-level timestamps; supports custom phrase hints for domain jargon.
Translation Engine	Google Cloud Translation	Optional; invoked per-listener to render captions in their preferred language.
Voice-Cloning Synthesizer	ElevenLabs	Receives short user text prompts, returns speaker-style audio in < 20 s; model voices trained from five clean samples.
Secure Orchestration	AWS Lambda + API Gateway	Stateless services route audio packets, captions, and synth requests; IAM roles restrict cross-project access.
Storage	AWS S3 (server-side KMS)	Transcripts and synthesized clips stored with lifecycle policies (auto-purge after configurable retention).
Front-End Overlay	Chrome extension injected into Meet DOM	Displays captions, language selector, and “Reply with Voice” button; all UI logic runs client-side to minimize round-trip lag.

End-to-end latency from spoken word to on-screen caption averages under two seconds—fast enough to follow rapid-fire technical debates.

User Workflow

Before the Call

Host toggles the “Enable Co-Pilot” switch in the Meet extension.
Participants choose a caption language (default: spoken language detected).
Pre-meeting checklist confirms recording consent where required by client contracts.

During the Call

Live captions scroll beneath video tiles; overlapping speakers are colour-coded by audio source ID.
Any participant can click “Voice-Reply” next to a caption, type a short answer, and hear it spoken back in their own cloned voice—instantly translated if the recipient uses a different language.
A status chip shows encryption is active and data is being written only to MOHA’s S3 bucket.

After the Call

The host receives a private transcript link in Slack (permissioned to invited attendees).

Participants can highlight paragraphs to mark action items—these flags feed the upcoming minutes-generation module.

Impact and Early Feedback

Product managers report that cross-border sprint reviews finish closer to their scheduled time with far fewer clarifications.
Junior engineers use transcripts as learning material, picking up domain vocabulary faster than before.
Clients appreciate instantaneous translated summaries when decision points arise, reducing follow-up email chains.
Information-security auditors accepted the solution without exceptions, citing its isolated storage and detailed audit trail.

Lessons Learned

Latency budgets matter. Even half-second hiccups break conversational flow; local buffering and WebRTC optimizations paid off.
Explainable AI builds trust. Icons that clearly label synthesized speech prevent confusion and increase user comfort.
Security sign-off early is essential. Involving infosec during prototype saved weeks of re-architecture later.

The near-term roadmap for the AI Meeting co-pilot focuses on transforming the tool from a real-time translator into a fully fledged meeting assistant. First, the team is building one-click generation of polished minutes that automatically highlight decisions and convert flagged action items into Jira tasks, eliminating post-call busywork. In parallel, engineers are adding a contextual reply engine that scans Confluence pages, design docs, and recent code comments to suggest draft responses users can approve or tweak before they’re spoken in the participant’s own voice. Accuracy will improve through enhanced speaker diarization—particularly when multiple voices overlap or network conditions are poor—while latency and data-sovereignty concerns will be addressed by evaluating an on-premise Whisper v3 model fine-tuned on MOHA’s internal audio corpus.

Together, these enhancements will deepen the co-pilot’s integration with day-to-day tooling, shorten feedback loops, and further reduce the cognitive load of multilingual, geographically distributed meetings.

Instanly Translate every Customer Voice – MOHA Meeting Co-pilot

Instanly Translate every Customer Voice – MOHA Meeting Co-pilot

Business Overview

Design Principles

User Workflow

Before the Call

During the Call

After the Call

Impact and Early Feedback

Lessons Learned

AI Development , Software , Web App

AI Development , Software

Software

Join our newsletter