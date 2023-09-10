Large Language Models (LLMs) have proven their effectiveness in solving multi-modal tasks. However, the question arises of whether they can also serve as creators of dynamic multimedia content. This article introduces a novel system called WavJourney that utilizes LLMs for creating compositional audio guided by language instructions.

Compositional audio creation involves producing digital media in various forms, such as text, images, and audio. While past efforts have utilized generative models to synthesize audio context based on specific conditions like speech or music descriptions, they often struggle to generate diverse audio content beyond these conditions, limiting real-world applicability.

WavJourney addresses this limitation by leveraging LLMs to generate audio scripts adhering to predefined structures encompassing speech, music, and sound effects. These scripts intricately consider the spatio-temporal relationships between these acoustic elements. The system then dissects complex auditory scenes into individual acoustic components and their corresponding acoustic layouts. These audio scripts are compiled into a computer program that invokes task-specific audio generation models, audio I/O functions, or computational operations to generate the desired audio content.

This design offers several notable benefits. Firstly, WavJourney taps into LLMs’ comprehension and vast knowledge to craft audio scripts featuring diverse sound elements and captivating audio narratives. Secondly, it adopts a compositional strategy, allowing for the incorporation of diverse audio generation models, setting it apart from end-to-end methods. Thirdly, WavJourney operates without the need for training audio models or fine-tuning LLMs, optimizing resource utilization. Lastly, it facilitates co-creation between humans and machines in real-world audio production.

WavJourney’s capabilities are exemplified by the sample results presented in the article. These case studies compare WavJourney with state-of-the-art generation approaches, showcasing its effectiveness.

The proposed system opens new possibilities in the field of multi-modal AI, offering exciting potential in personalized entertainment, improved accessibility features, and more. For further details on WavJourney and the research behind it, please refer to the links provided in the original article.

Definisi:

– Large Language Models (LLMs): Powerful AI models that combine visual, auditory, and textual data for multi-modal tasks.

– Multi-modal AI: A field of AI that converges visual, auditory, and textual data.

– Compositional audio creation: Producing digital media in various forms, such as text, images, and audio.

