MUSIC AI ARCHITECTURE

8 min readNov 14, 2023

By Ivan Linn

In recent years, sophisticated music creation has become the focus of attention in the industry. Constant technological progress enables artificial intelligence to better comprehend, analyse, and generate music. Deep learning on vast quantities of music data has enabled artificial intelligence to imitate and generate various categories of music. This not only introduces significant convenience and innovation to the process of creating music, but also establishes intelligent music creation as a crucial trajectory for the future of the music industry. What is the operation of AI music creators? What is the structure of the database that machine learning detects? What is the mechanism of operation subsequent to identification? How does its operation differ during production? This article will investigate the secrets behind the AI intelligent music database in addition to the aforementioned subjects.

Function of AI in music production

The potential for artificial intelligence (AI) to generate music is multifaceted. Utilising machine learning algorithms, specifically deep neural networks, to analyse extensive datasets of pre-existing music and subsequently produce novel compositions predicated on the analysis is one of the most widely adopted approaches.

AI-generated music creation requires the Machine Learning (ML) algorithm to be trained on a dataset of pre-existing music, which may consist of an extensive assortment of tracks belonging to a specific genre or style. The algorithm analyses the musical compositions’ structures and patterns, including instrumentation, beats, rhythms, harmonies, and melodies, and then generates new compositions with a similar structure and style.[¹]

Machine learning language structures

The principal challenge in the development of AI-powered music pertains to the conversion of musical elements into a format that machine learning models can comprehend. Numerous AI music creation applications currently available on the market utilise spectrogram and waveform as databases for algorithmic model learning. For example, the well-known music AI project — — Riffusion operates by compiling an indexed collection of spectrograms, wherein each spectrogram is annotated with keywords that correspond to the musical genre depicted within the spectrogram. Riffusion creates melodies using the same technology as Stable Diffusion, the image-generating AI, after it has been trained on this collection. The undertaking will produce a spectrogram image that exhibits comparable characteristics to spectrograms generated in response to textual prompts from the user.[²] There are many more examples, such as MusicLM, launched by researchers at Google. Although the relevant technical model is confidential due to the business plan for the time being, according to relevant sources, in the updated MusicLM version, the AI image generation engine, Stable Diffusion, is also used to convert written prompts into spectrograms and music.[³] The MusicLM data sets refer to a collection of datasets that pertain to the field of music language modeling. SoundStream is utilised for audio encoding and decoding in MusicCaps, AudioSet, and MuLan. SoundStream is a neural audio codec that employs a process of converting audio input into an encoded signal, subsequently compressing it, and finally reconverting it back to audio using a decoder. This decoder has the capability to effectively reduce the size of a large wav file, resulting in a smaller mp3 file that may be used for approximate comparison purposes.[⁴]

There is another form of language that machine learning can understand — the MIDI format, which are commonly employed for the storage, transportation, and retrieval of music sequences. The use of MIDI as a dataset is not uncommon, as author Arturo Rey has detailed in his published article.[⁵] MIDI files are organised files that include structured data pertaining to musical notes, rhythmic variations, and beats per minute (BPM), among other elements. MIDI files are employed as a medium of natural language for the purpose of training our models.[⁶] Wavv also uses MIDI to establish the music database and train the Music AI.

Machine learning algorithms and models

For the input learning model of waveform graphs, we take WaveNet as an example. WaveNet is a Deep Learning-based generative model for raw audio developed by Google DeepMind. WaveNet is a neural network that predicts successive amplitude values by breaking the audio wave into equal-sized chunks. The output of each chunk depends on past information, which is known as the autoregressive task. In the inference phase, the model generates new samples by selecting a random array of sample values and appending the maximum probability value to an array of samples.

The WaveNet architecture consists of Causal Dilated 1D Convolution layers, which are similar to the Long Short Term Memory model. The objective of 1D convolution is to capture sequential information from an input sequence, making training faster than GRU or LSTM. However, 1D convolution violates the Autoregressive principle and has a low receptive field. This led to the concept of Dilated 1D.

Dilated 1D convolution is a Causal 1D convolution layer with holes or spaces between kernel values, defined by the dilation rate. This increases the receptive field exponentially by increasing the dilation rate at every hidden layer. The output is influenced by all inputs, resulting in a very low receptive field，and the element-wise addition of a skip connection and output of causal 1D results in the residual.[⁷]

It appears that Music21 is an excellent illustration of how to utilise MIDI files as a database. It serves as a prototype utility for MIT’s Python library development process for reading and comprehending music data. The dataset will be converted into a list of sequences of notes and chords using music21. The process involves converting the MIDI files into a list of paths to the songs, converting them into a sequence of notes/chords, and converting the list into a machine-readable format.

The process includes importing music21, converting the MIDI files into a list of notes, and saving the list for future runs. The machine then uses a processing function to separate the notes into chunks of n_sequences + 1 columns of data. The output of the network_input is a DataFrame, which can be viewed by transforming it into a nxm matrix and converting it into a DataFrame. This process ensures that the machine can accurately predict the next note/chord in the sequence. To train the model, we collect MIDI files, load them into memory, transform the list into a matrix and vector, and convert the list into a list of sequenced notes/chords.[⁸] In this project, the reasearchers will develop an Automatic Music Generation model utilising Long Short-Term Memory (LSTM) technology. The process involves extracting notes from all music files, which are subsequently inputted into the model for predictive analysis. Ultimately, a MIDI file will be generated utilising the aforementioned projected notes. To enable machine software to predict the next note, Music21 employs recurrent neural networks (RNN), an existing and easily replicable method. This approach would enable us to produce an unlimited amount of music by iteratively feeding the generated note back into the model.[⁹]

The effect of music produced by different databases

Through the above comparison of different machine learning algorithms required for different databases, we can better understand the possible disadvantages and drawbacks of the output music under different operational mechanisms.

First of all, we can understand the use of spectrograms for data analysis of music creation. Similar to the majority of AI-generated music, the software generates brief, low-fidelity audio files lasting only a few seconds. A prevalent issue in this field is that neural networks frequently deviate significantly and abruptly off course from their initial prompt.[¹⁰] This means that the quality of music produced is difficult and expensive to enjoy. Simultaneously, the challenge of neural network computing technology makes it difficult to meet the needs of users with precision. In order to accomplish this, more robust computing power and more stable technology are required to support more in-depth learning algorithms and stable diffusion models, which are required to join short-duration waveforms and spectrograms into longer audio clips.

Secondly, the utilisation of spectrograms for data analysis necessitates the support of pre-existing copyrighted music. As we know, Spectrograms are sound waves visualised and mapped onto a graph. Each frequency exists on an axis of time and frequency, represented by the colour of the pixels and amplitude of the waveform.[¹¹] In music data analysis, spectrograms are frequently employed to disclose the frequency and energy distributions of the music. Copyrighted music must be collected prior to the creation of a spectrogram, as material for data analysis can only consist of previously published music. As such, Generative Music AI based on spectrograms training will need substantial music data input at the very early state if adequate learning effect is to be achieved and more listening music works that are more in line with user input instructions are to be generated. As a result, the initial expenses associated with database learning and model operation may escalate significantly. The music data may consist of publicly accessible music libraries or licensed audio samples. Following the acquisition of the music data, a sequence of data preprocessing operations must be executed. These operations consist of data smoothing, invalid data deletion, and filling data gaps, among others. The utilisation of these analytical techniques can facilitate comprehension of the energy and frequency distributions, as well as the structure and properties of music. These analyses provide a deeper understanding, which enables us to direct the practical task more effectively. However，the utilisation of pre-existing copyrighted musical compositions increases the likelihood of unauthorised duplication. On the contrary, data analysis utilising MIDI files prioritises the function model connecting every digit symbol and restructures the combination arrangement order based on probability. This process significantly diminishes the resemblance to the source file and facilitates the generation of novel compositions.

Wavv well masters the merits and drawbacks of employing spectrograms and MIDI exclusively for model training. It has devised a music language model known as “Musica” with the intention of optimising the beneficial contributions of both in automation-driven music production. By arranging and combining various musical elements using the MIDI format for training during data input learning, the Musica model circumvents the drawbacks of spectral graph training, significantly boosts music output, reduces sample cost, and eliminates the possibility of infringement. Later phases of the software will involve optimising music through the utilisation of spectrograms and both programming languages. Artificial intelligence music will be an entirely new experience.

[1] Pal, Kaushik. n.d. “How Can an AI Mode Create music.” Accessed October 26, 2023. https://www.techopedia.com/how-can-an-ai-model-create-music.

[2] Roberts, Rachel. 2022. “New AI Project, Riffusion, Generates Spectrograms to Produce Music.” MusicTech. December 22, 2022. https://musictech.com/news/industry/ai-riffusion-spectogram/.

[3] Dominguez, Daniel. 2023. “Google Unveils MusicLM, an AI That Can Generate Music from Text Prompts.” InfoQ. February 1, 2023. https://www.infoq.com/news/2023/02/google-musiclm-ai-music/.

[4] Sandzer-Bell, Ezra. 2023. “Google’s AI Music Datasets: MusicCaps, AudioSet and MuLan.” AudioCipher (blog). May 17, 2023. https://www.audiocipher.com/post/musiccaps-audioset-mulan.

[5] Rey, Arturo. 2022. “How to Generate Music Using Machine Learning.” MLearning.ai. December 9, 2022. https://medium.com/mlearning-ai/how-to-generate-music-using-machine-learning-72360ba4a085.

[6] Schmidt, Mariano Schmidtmariano. n.d. “How to Generate Music with AI.” Accessed October 25, 2023. https://www.rootstrap.com/blog/how-to-generate-music-with-ai.

[7] Pai, Aravindpai. n.d. “Want to Generate your own Music using Deep Learning? Here’s a Guide to do just that!.” Accessed October 26, 2023. https://www.analyticsvidhya.com/blog/2020/01/how-to-perform-automatic-music-generation/.

[8] Rey, Arturo. 2022. “How to Generate Music Using Machine Learning.” MLearning.ai. December 9, 2022. https://medium.com/mlearning-ai/how-to-generate-music-using-machine-learning-72360ba4a085

[9] “Automatic Music Generation Project Using Deep Learning.” 2021. DataFlair. December 24, 2021. https://data-flair.training/blogs/automatic-music-generation-lstm-deep-learning/.

[10] Sandzer-Bell, Ezra. 2022. “Riffusion: Generate AI Music from Spectrograms.” AudioCipher (blog). December 16, 2022. https://www.audiocipher.com/post/riffusion.

[11] Ezra Sandzer- Bell. 2022.Riffusion Generates AI Music from Spectrograms, https://www.audiocipher.com/post/riffusion.

MUSIC AI ARCHITECTURE

Written by Wavv