TASTE: Text-Aligned Speech Tokenization and Embedding
for Spoken Language Modeling

[Paper] [Code] [Model]

MediaTek Research     National Taiwan University

Liang-Hsuan Tseng*, Yi-Chang Chen*, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
*Equal contribution

Abstract: Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human–LLM interaction. Joint speech–text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Our experimental results show that joint modeling with TASTE and text tokens outperforms other pre-trained SLMs in tasks such as speech continuation and likelihood-based next-speech selection. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling.

↪︎ The overview of TASTE for spoken language modeling.

Stage 1: TASTE Speech Tokenizer

↪︎ The illustration of using TASTE for speech reconstruction.



Demo 1.1 - Speech Reconstruction (on LibriSpeech)

Method Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

Ground Truth

256k bps

Encodec

3000 bps

SpeechTokenizer

2000 bps

Mimi

1000 bps

S3 token (topline)

600 bps

Text-only (baseline)

~50 bps

TASTE (ours)

~150 bps



Demo 1.2 - Text-Aligned Speech Editing with TASTE

↪︎ The illustration of using TASTE for text-aligned speech editing.



Duration

Original A Original B Edited A Edited B

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.



Prosody

Original A Original B Edited A Edited B

In my opinion, she is a hero.

In my opinion, she is a hero.

In my opinion, she is a hero.

In my opinion, she is a hero.

Setting seven timers, what timer duration do you want?

Setting seven timers, what timer duration do you want?

Setting seven timers, what timer duration do you want?

Setting seven timers, what timer duration do you want?



Ablation - content mismatch

(Even if the substituted content is different on both sides, it won’t break, and the prosody might even carry over.)

Original A Original B Edited A Edited B

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

Any news on the dancer's assault case?

The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader.

Any news on the dancer's assault case?

Setting seven timers, what timer duration do you want?

You have to have the products for their needs.

Setting seven timers, what timer duration do you want?

You have to have the products for their needs.

Stage 2: Text-Aligned Spoken Language Modeling (TASLM)

↪︎ The illustration of using TASTE for text-aligned spoken language modeling (TASLM).



Demo 2.1 - Speech Continuation with TASLM

Prompt Speech Continuation

The dark mystery of

The dark mystery of the lost city of Atlantis has fascinated mankind for centuries. It is said that the city was located in the Atlantic Ocean, off the coast of modern-day Portugal, and was home to a civilization that flourished for thousands of years.

I've managed to save some

I've managed to save some money on my phone bill, but I'm still paying a lot for my internet. I'm not sure if I should get a new phone or just get a new internet plan.

I reserve your services

I reserve your services for a long time. I am very happy with the result. I will definitely recommend you to my friends.

This powder must not

This powder must not be used in the presence of water or other liquids. It is a very fine powder and should be used with a fine powder applicator.

I’ll take the armchair

I’ll take the armchair in the corner and just sit there. It’ll feel better. I’ll try to not sleep.


Demo 2.2 - Speech Generation with Text-only Prompt

Text-only Prompt Speech Generation
And then later on, fully acquiring that company.
It’s a very good example of how you can use the power of a platform to create a business that's going to be successful.
Rainfall alone does not constitute a storm.
It is the combination of Rainfall and Wind that causes the most damage. In the case of a storm, the wind is the most important factor.
Other sweet herbs are cultivated
for their medicinal properties. They are used to treat a variety of ailments, including coughs, colds, and other respiratory problems.
Tom: Hi! How are you today? END Jerry: I'm good, thanks! How about you? END Tom: Not bad. Just a bit tired from work. END Jerry: Me too. END Tom:
I've got a few minutes to talk to you.


Demo 2.3 - Few-shot learning to facilitate question answering

prompt: "Q: which countries border the us? A: Canada. Q: what is charles darwin famous for? A: Evolution Q: who is robert downey jr wife? A: Susan Downey Q: [Question-Audio-Here] A:"

Question Audio (Input) Generated Audio

Who discovered the theory of gravity?

Isaac Newton

Who painted the famous painting the Starry Night?

Vincent van Gogh

Who was the first woman to fly solo across the Atlantic Ocean?

Amelia Earhart

What is the largest mamel?

the blue whale

what is the largest canyon in the world?

Grand Canyon

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.