Microsoft Goes Beyond LLMs With New Voice, Image Models
The new AI models signal a stronger push toward Microsoft-developed AI systems.
Microsoft on Thursday unveiled three new AI models, marking an expansion beyond typical large language models to multimodal, in-house capabilities.
The models were introduced under the Microsoft AI (MAI) division.
The release includes MAI-Transcribe-1, a new speech-to-text system, as well as voice generation and image models MAI-Voice-1 and MAI-Image-2. All three are the first models of their kind for Microsoft and are available on Microsoft Foundry and the MAI Playground.
MAI-Transcribe-1 is Microsoft’s first dedicated transcription model, designed to convert audio into text across 25 languages. Potential applications include video captioning, meeting transcriptions and voice-enabled agents.
According to Microsoft, the model can operate at speeds up to 2.5 times faster than its existing Azure Fast transcription model.
MAI-Voice-1, meanwhile, is designed for high-quality speech generation.
The model can generate up to a minute of audio in a single second, with an emphasis on natural, emotional tone and speaker personality.
Related:Microsoft to Invest $5.5 billion in AI in Singapore
The third release, MAI-Image-2, represents the second generation of Microsoft’s in-house image model. The company says it offers at least twice the generation speed of its predecessor while providing more realistic details, such as skin tone, lighting and textures.
The model is targeted for use in the creative industries, and is already being rolled out across Microsoft products, with integrations planned for the Bing search engine and PowerPoint.
Early customers include marketing and communications firm WPP, Microsoft said.
“MAI-Image-2 is a genuine game-changer,” Rob Reilly, global chief creative officer at WPP said in a MAI blog post on the launch. “It’s a platform that not only responds to the intricate nuance of creative direction, but deeply respects the sheer craft involved in generating real-world, campaign-ready images.”
In the post, Microsoft said the updates come as it pursues a more "humanist" AI.
“We have a distinct view when creating our AI models -- putting humans at the center, optimizing for how people actually communicate, training for practical use,” the company said.
The launches also reflect a broader strategic shift as Microsoft looks to diversify its AI portfolio and reduce reliance on external partners such as OpenAI. It is also aiming to strengthen its competitive standing against rivals such as Google and Amazon, both of which have been investing heavily in proprietary AI stacks.
About the Author
Contributing Writer
Scarlett Evans is a freelance writer with a focus on emerging technologies and the minerals industry. Previously, she served as assistant editor at IoT World Today, where she specialized in robotics and smart city technologies. Scarlett also has a background in the mining and resources sector, with experience at Mine Australia, Mine Technology and Power Technology. She joined Informa in April 2022 before transitioning to freelance work.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelv0.16.0
Axolotl v0.16.0 Release Notes We’re very excited to share this new packed release. We had ~80 new commits since v0.15.0 (March 6, 2026). Highlights Async GRPO — Asynchronous Reinforcement Learning Training ( #3486 ) Full support for asynchronous Group Relative Policy Optimization with vLLM integration. Includes async data producer with replay buffer, streaming partial-batch training, native LoRA weight sync to vLLM, and FP8 compatibility. Supports multi-GPU via FSDP1/FSDP2 and DeepSpeed ZeRO-3. Achieves up to 58% faster step times (1.59s/step vs 3.79s baseline on Qwen2-0.5B). Optimization Step Time Improvement Baseline 3.79s — + Batched weight sync 2.52s 34% faster + Liger kernel fusion 2.01s 47% faster + Streaming partial batch 1.79s 53% faster + Element chunking + re-roll fix (500 steps)
v4.3
Changes ik_llama.cpp support : Add ik_llama.cpp as a new backend: new textgen-portable-ik portable builds, new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc). Transformers: Autodetect torch_dtype fr
1.13.0
What's Changed Features Add RuntimeState RootModel for unified state serialization Enhance event listener with new telemetry spans for skill and memory events Add A2UI extension with v0.8/v0.9 support, schemas, and docs Emit token usage data in LLMCallCompletedEvent Auto-update deployment test repo during release Improve enterprise release resilience and UX Bug Fixes Add tool repository credentials to crewai install Add tool repository credentials to uv build in tool publish Pass fingerprint metadata via config instead of tool args Handle GPT-5.x models not supporting the stop API parameter Add GPT-5 and o-series to multimodal vision prefixes Bust uv cache for freshly published packages in enterprise release Cap lancedb below 0.30.1 for Windows compatibility Fix RBAC permission levels to m
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Show HN: MicroSafe-RL – Sub-microsecond safety layer for Edge AI 1.18µs latency
I built MicroSafe-RL to solve the "Hardware Drift" problem in Reinforcement Learning. When RL agents move from simulation to real hardware, they often encounter unknown states and destroy expensive parts. Key specs: 1.18µs latency (85 cycles on STM32 @ 72MHz) 20 bytes of RAM (no malloc) Model-free: It adapts to mechanical wear-and-tear using EMA/MAD stats. Includes a Python Auto-Tuner to generate C++ parameters from 2 mins of telemetry. Check it out: https://github.com/Kretski/MicroSafe-RL Comments URL: https://news.ycombinator.com/item?id=47621536 Points: 1 # Comments: 0




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!