TASTE: Text-Aligned Speech Tokenization and Embedding
for Spoken Language Modeling
[Paper]
[Code]
[Model]
MediaTek Research National Taiwan University
Liang-Hsuan Tseng*, Yi-Chang Chen*, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
*Equal contribution
Abstract: Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human–LLM interaction. Joint speech–text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Our experimental results show that joint modeling with TASTE and text tokens outperforms other pre-trained SLMs in tasks such as speech continuation and likelihood-based next-speech selection. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling.
↪︎ The overview of TASTE for spoken language modeling.
Stage 1: TASTE Speech Tokenizer
↪︎ The illustration of using TASTE for speech reconstruction.
Demo 1.1 - Speech Reconstruction (on LibriSpeech)
| Method | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
|---|---|---|---|---|---|
Ground Truth 256k bps | |||||
Encodec 3000 bps | |||||
SpeechTokenizer 2000 bps | |||||
Mimi 1000 bps | |||||
S3 token (topline) 600 bps | |||||
Text-only (baseline) ~50 bps | |||||
TASTE (ours) ~150 bps |
Demo 1.2 - Text-Aligned Speech Editing with TASTE
↪︎ The illustration of using TASTE for text-aligned speech editing.
Duration
| Original A | Original B | Edited A | Edited B |
|---|---|---|---|
|
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
|
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
|
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
Prosody
| Original A | Original B | Edited A | Edited B |
|---|---|---|---|
|
In my opinion, she is a hero. |
In my opinion, she is a hero. |
In my opinion, she is a hero. |
In my opinion, she is a hero. |
|
Setting seven timers, what timer duration do you want? |
Setting seven timers, what timer duration do you want? |
Setting seven timers, what timer duration do you want? |
Setting seven timers, what timer duration do you want? |
Ablation - content mismatch
(Even if the substituted content is different on both sides, it won’t break, and the prosody might even carry over.)
| Original A | Original B | Edited A | Edited B |
|---|---|---|---|
|
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
Any news on the dancer's assault case? |
The captain's face had been buried in a pile of papers, but now Murdoch came around to stare at the gang leader. |
Any news on the dancer's assault case? |
|
Setting seven timers, what timer duration do you want? |
You have to have the products for their needs. |
Setting seven timers, what timer duration do you want? |
You have to have the products for their needs. |
Stage 2: Text-Aligned Spoken Language Modeling (TASLM)
↪︎ The illustration of using TASTE for text-aligned spoken language modeling (TASLM).
Demo 2.1 - Speech Continuation with TASLM
| Prompt | Speech Continuation |
|---|---|
The dark mystery of |
The dark mystery of the lost city of Atlantis has fascinated mankind for centuries. It is said that the city was located in the Atlantic Ocean, off the coast of modern-day Portugal, and was home to a civilization that flourished for thousands of years. |
I've managed to save some |
I've managed to save some money on my phone bill, but I'm still paying a lot for my internet. I'm not sure if I should get a new phone or just get a new internet plan. |
I reserve your services |
I reserve your services for a long time. I am very happy with the result. I will definitely recommend you to my friends. |
This powder must not |
This powder must not be used in the presence of water or other liquids. It is a very fine powder and should be used with a fine powder applicator. |
I’ll take the armchair |
I’ll take the armchair in the corner and just sit there. It’ll feel better. I’ll try to not sleep. |
Demo 2.2 - Speech Generation with Text-only Prompt
| Text-only Prompt | Speech Generation |
|---|---|
| And then later on, fully acquiring that company. | It’s a very good example of how you can use the power of a platform to create a business that's going to be successful. |
| Rainfall alone does not constitute a storm. | It is the combination of Rainfall and Wind that causes the most damage. In the case of a storm, the wind is the most important factor. |
| Other sweet herbs are cultivated | for their medicinal properties. They are used to treat a variety of ailments, including coughs, colds, and other respiratory problems. |
| Tom: Hi! How are you today? END Jerry: I'm good, thanks! How about you? END Tom: Not bad. Just a bit tired from work. END Jerry: Me too. END Tom: | I've got a few minutes to talk to you. |
Demo 2.3 - Few-shot learning to facilitate question answering
prompt: "Q: which countries border the us? A: Canada. Q: what is charles darwin famous for? A: Evolution Q: who is robert downey jr wife? A: Susan Downey Q: [Question-Audio-Here] A:"
| Question Audio (Input) | Generated Audio |
|---|---|
Who discovered the theory of gravity? |
Isaac Newton |
Who painted the famous painting the Starry Night? |
Vincent van Gogh |
Who was the first woman to fly solo across the Atlantic Ocean? |
Amelia Earhart |
What is the largest mamel? |
the blue whale |
what is the largest canyon in the world? |
Grand Canyon |
Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.