Portforlio

Instanly Translate every Customer Voice – MOHA Meeting Co-pilot

Portforlio

Instanly Translate every Customer Voice – MOHA Meeting Co-pilot

INDUSTRY
Other
TECHNOLOGIES USED
Category
Duration
Country
Team Size

 MOHA’s in-house AI Meeting Co-Pilot brings real-time transcription, live translation, and voice-cloned responses directly into every Google Meet session. By embedding language AI inside day-to-day collaboration, the platform eliminates “Can you repeat that?” moments, accelerates decisions, and safeguards sensitive project discussions under strict security controls. What began as an engineering side-project is now a core productivity service supported by a clear roadmap toward automated minutes, action-item extraction, and context-aware reply suggestions.

Business Overview

  • Global collaboration: Development squads span Vietnam, Japan, and North America, with clients joining calls from additional regions.
  • High stakes: Requirements misunderstandings or missed action items ripple into re-work, delayed releases, and frustrated stakeholders.
  • Strategic mandate: MOHA’s leadership asked the AI Center of Excellence to prototype an internal tool that would remove language friction, preserve confidentiality, and fit seamlessly into existing Google Workspace workflows.

 

Challenge

Pre-Co-Pilot Reality

Language barriers

Mixed-language calls forced frequent clarifications; nuanced feedback was sometimes lost.

Cognitive load

Team members took manual notes while listening, speaking, and screen-sharing—leading to omissions.

Meeting efficiency

Repeated “Could you say that again?” exchanges stretched 30-minute calls into 45-minute sessions.

Security & compliance

Third-party caption extensions were off-limits for client projects containing NDA-protected information.

Design Principles

Component

Technology

Key Details

Speech Recognition Layer

Google Speech-to-Text (streaming)

Multichannel audio mixed locally and streamed with word-level timestamps; supports custom phrase hints for domain jargon.

Translation Engine

Google Cloud Translation

Optional; invoked per-listener to render captions in their preferred language.

Voice-Cloning Synthesizer

ElevenLabs

Receives short user text prompts, returns speaker-style audio in < 20 s; model voices trained from five clean samples.

Secure Orchestration

AWS Lambda + API Gateway

Stateless services route audio packets, captions, and synth requests; IAM roles restrict cross-project access.

Storage

AWS S3 (server-side KMS)

Transcripts and synthesized clips stored with lifecycle policies (auto-purge after configurable retention).

Front-End Overlay

Chrome extension injected into Meet DOM

Displays captions, language selector, and “Reply with Voice” button; all UI logic runs client-side to minimize round-trip lag.

End-to-end latency from spoken word to on-screen caption averages under two seconds—fast enough to follow rapid-fire technical debates.

User Workflow

Before the Call

  1. Host toggles the “Enable Co-Pilot” switch in the Meet extension.
  2. Participants choose a caption language (default: spoken language detected).
  3. Pre-meeting checklist confirms recording consent where required by client contracts.

During the Call

  • Live captions scroll beneath video tiles; overlapping speakers are colour-coded by audio source ID.
  • Any participant can click “Voice-Reply” next to a caption, type a short answer, and hear it spoken back in their own cloned voice—instantly translated if the recipient uses a different language.
  • A status chip shows encryption is active and data is being written only to MOHA’s S3 bucket.

After the Call

  • The host receives a private transcript link in Slack (permissioned to invited attendees).

Participants can highlight paragraphs to mark action items—these flags feed the upcoming minutes-generation module.

Impact and Early Feedback

  • Product managers report that cross-border sprint reviews finish closer to their scheduled time with far fewer clarifications.
  • Junior engineers use transcripts as learning material, picking up domain vocabulary faster than before.
  • Clients appreciate instantaneous translated summaries when decision points arise, reducing follow-up email chains.
  • Information-security auditors accepted the solution without exceptions, citing its isolated storage and detailed audit trail.

Lessons Learned

  • Latency budgets matter. Even half-second hiccups break conversational flow; local buffering and WebRTC optimizations paid off.
  • Explainable AI builds trust. Icons that clearly label synthesized speech prevent confusion and increase user comfort.
  • Security sign-off early is essential. Involving infosec during prototype saved weeks of re-architecture later.

The near-term roadmap for the AI Meeting co-pilot focuses on transforming the tool from a real-time translator into a fully fledged meeting assistant. First, the team is building one-click generation of polished minutes that automatically highlight decisions and convert flagged action items into Jira tasks, eliminating post-call busywork. In parallel, engineers are adding a contextual reply engine that scans Confluence pages, design docs, and recent code comments to suggest draft responses users can approve or tweak before they’re spoken in the participant’s own voice. Accuracy will improve through enhanced speaker diarization—particularly when multiple voices overlap or network conditions are poor—while latency and data-sovereignty concerns will be addressed by evaluating an on-premise Whisper v3 model fine-tuned on MOHA’s internal audio corpus.

 

Together, these enhancements will deepen the co-pilot’s integration with day-to-day tooling, shorten feedback loops, and further reduce the cognitive load of multilingual, geographically distributed meetings.

Case Study

AI Development , Software , Web App

AI Development , Software

AI (Deep Learning) AWS
We got your back! Share your idea with us and get a free quote