TASTE Speech Homepage

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

[Paper] [Code] [Model]

Liang-Hsuan Tseng*, Yi-Chang Chen*, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
*Equal contribution

Abstract: Large Language Models (LLMs) excel in text-based natural language processing tasks but remain constrained by their reliance on textual inputs and outputs. To enable more natural human-LLM interaction, recent progress have focused on deriving a spoken language model (SLM) that can not only listen but also generate speech. To achieve this, a promising direction is to conduct speech-text joint modeling. However, recent SLM still lag behind text LLM due to the modality mismatch. One significant mismatch can be the sequence lengths between speech and text tokens. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through the special aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. Furthermore, by leveraging TASTE, we can adapt text-based LLMs into effective SLMs with parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA). Experimental results on benchmark tasks, including SALMON and StoryCloze, demonstrate that TASTE-based SLMs perform similarly to previous full-finetuning methods. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling.

Stage 1: TASTE Speech Tokenization

Figure 1. The overall framework of our text-aligned speech tokenization and embedding.

Demo: In-domain Speech Reconstruction

Original Speech (from LibriTTS)	Reconstruction w/ TASTE Tokenization (ours)	Reconstruction w/o TASTE Tokenization
I never expect to have any more.
He examines the horizon all round with his glass and folds his arms with the air of an injured man.

Demo: Out-of-domain Speech Reconstruction

Original Speech (from Expresso)	Reconstruction w/ TASTE Tokenization (ours)	Reconstruction w/o TASTE Tokenization
Given the circumstances, isn't this a little unorthodox?
[confused] Given the circumstances, isn't this a little unorthodox?
[enunciated] Given the circumstances, isn't this a little unorthodox?
[happy] Given the circumstances, isn't this a little unorthodox?
[laughing] Given the circumstances, isn't this a little unorthodox?
[sad] Given the circumstances, isn't this a little unorthodox?
[whisper] Given the circumstances, isn't this a little unorthodox?

Stage 2: TASTE Spoken Language Modeling

Figure 2. The overall framework of spoken language modeling with TASTE. We describe how we tackle the vocabulary mismatch issue before conducting text-aligned spoken language modeling on the left part; while the right side illustrate the simple and effective spoken language modeling with text-aligned tokenization and embedding.

Demo: Speech Continuation

Prompt	Speech Continuation
The dark mystery of	The dark mystery of the lost city of Atlantis has fascinated mankind for centuries. It is said that the city was located in the Atlantic Ocean, off the coast of modern-day Portugal, and was home to a civilization that flourished for thousands of years.
I've managed to save some	I've managed to save some money on my phone bill, but I'm still paying a lot for my internet. I'm not sure if I should get a new phone or just get a new internet plan.
I reserve your services	I reserve your services for a long time. I am very happy with the result. I will definitely recommend you to my friends.
This powder must not	This powder must not be used in the presence of water or other liquids. It is a very fine powder and should be used with a fine powder applicator.
I’ll take the armchair	I’ll take the armchair in the corner and just sit there. It’ll feel better. I’ll try to not sleep.

Demo: Speech Generation with Text-only Prompt

Text-only Prompt	Speech Generation
And then later on, fully acquiring that company.	It’s a very good example of how you can use the power of a platform to create a business that's going to be successful.
Rainfall alone does not constitute a storm.	It is the combination of Rainfall and Wind that causes the most damage. In the case of a storm, the wind is the most important factor.
Other sweet herbs are cultivated	for their medicinal properties. They are used to treat a variety of ailments, including coughs, colds, and other respiratory problems.
Tom: Hi! How are you today? END Jerry: I'm good, thanks! How about you? END Tom: Not bad. Just a bit tired from work. END Jerry: Me too. END Tom:	I've got a few minutes to talk to you.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.