Music Analysis for Automatic Music Composition: Source Separation and Music Transcription

AI needs a lot of music examples to learn to compose music. The quality and diversity of the music examples can be the key to the success of the AI. Typically, researchers begin with training an AI music composition model by learning from symbolic music data such as MIDI files. This is how we developed the AI Jazz bass player introduced in our last blog post.


However, relying on the MIDI files as the major data source has a few clear limitations. First, not all the music out there has MIDI files that are publicly and widely available. This is especially the case for certain music genres such as Jazz, which features improvisation. Second, MIDI files are notoriously noisy [1]. A great effort is needed in preprocessing and cleansing the MIDI data before they can be used to train a machine learning model. Such a process may come with assumptions, simplifications, and imprecisions that limit the performance of the resulting AI model. Third, not all MIDI files contain performance-level attributes of music such as the velocities (dynamics) and microtimings (timing offsets) of the musical notes. The music generated may sound mechanical and not expressive enough [2].

To free Yating from such limitations, we have a team of data engineer, machine learning engineer and musicians that are working on tasks that can be in general referred to as music analysis, or music information retrieval. Our goal is to enable Yating to learn to compose and perform music directly from audio recordings of music performances, an approach the Google Magenta team is also exploring [3]. This new approach, when successful, can unlock many important potentials of AI music composition models.

While the Google Magenta team dealt with exclusively piano-only music in [3], we are interested in building a data processing pipeline that allows us to learn from music played by any instruments.

In doing so, we are building an “AI Listener” that can (one day) comprehend the content of arbitrary music signals as good as well-trained human listeners. The first two music analysis tasks we are focusing on now are “source separation” and “music transcription,” for the output of such models, after some other processing, can be used to AI music composition models.

A core task of source separation [4] is to isolate out the sounds of specific instruments from an audio mixture. For example, a Jazz piano trio usually consists of the sounds played by a pianist, a bass player and a drummer. While human ears can focus on the sounds from one of the instruments while listening to the music, it may be hard for a machine to do so, as the sounds from these instruments (musical sources) have been mixed together in the audio signal.

The task of music transcription [5], on the other hand, can be said as converting music from the audio domain (audio signal) to the symbolic domain (e.g., MIDI file). For single-instrument music, we may want to transcribe the pitch, onset/offset timings, and even the velocity of all the musical notes. For multi-instrument music, the task is even challenging as we need to decide which note is played by which instrument.

For now, we built a source separation model to isolate out the piano track from an audio mixture, and a music transcription model to convert the (separated) piano track from the audio domain to the symbolic domain. We focus on the piano now because there are more public-domain datasets for piano transcriptions (such as the MAESTRO dataset [3] and the MAPS database [6]). But, thanks to the source separation model, we can learn from not only piano-only music but also multi-instrument music that contains piano.

In other words, the two models are cascaded to transcribe the piano part of an audio mixture. The transcription result can then be used to train an AI compoisition model. This process is illustrated below.


We present below four examples showing the performance of our models in isolating out and transcribing the piano. In each set of audio files, we show the original audio mixture first, then the separated piano track, and finally the transcribed result. The transcribed result is rendered using an electric piano sound font by a VSTi. We also show the pianoroll demonstrating the transcription result for each song. 








(Please note that, because our music transcription model does not predict the usage of sustain pedal thus far (this is a function we will add soon), we occasionally apply sustain pedal by hand to the transcribed result in the above examples.)

In general, the separation result is fairly good. The separation model removes the sounds from other instruments, and the remaining piano sounds do not suffer from distortion or other artefacts. This is quite remarkable, as we notice that this may be the first demonstration of a successful piano source seperation model in the world—people working on musical source seperation usually aim to isolating out the singing voice, drum, and bass (see the SiSEC challenge [7] for example), not the piano. We are currently extending the model to deal with other instruments, such as the guitar.

The transcription result is not perfect yet it already seems feasible to be used for training music composition models.

While it’s still our ongoing work to leverage such transcription result for training AI music composition models, our in-house musicians already find ways to play with the separated piano tracks. Check the video below to see how they used the output of our separation model for making hip-hop style music.



[1] C Raffel and DPW Ellis, “Extracting Ground-Truth Information from MIDI Files: A MIDIfesto,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2016. (link)

[2] B Wang and YH Yang, “PerformanceNet: Score-to-audio Music Generation with Multi-band Convolutional Residual Network,” in Proc. AAAI Conference on Artificial Intelligence (AAAI), 2019. (link)

[3] C Hawthorne, A Stasyuk, A Roberts, I Simon, CZ Anna Huang, S Dieleman, E Elsen, J Engel, and D Eck, “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,” in Proc. International Conference on Learning Representations (ICLR), 2019. (link)

[4] JY Liu and YH Yang, “Dilated Convolution with Dilated GRU for Music Source Separation,” in Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2019. (link)

[5] C Hawthorne, E Elsen, J Song, A Roberts, I Simon, C Raffel, J Engel, S Oore, and D Eck, “Onsets and Frames: Dual-Objective Piano Transcription,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2018. (link)

[6] V Emiya, R Badeau, B David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Transactions on Audio, Speech and Language Processing, 2010. (link)




By: Jen-Yu Liu, Chung-Yang Wang, Tsu-Kuang Hsieh and Yi-Hsuan Yang (Taiwan AI Labs Yating/Music Team)

AI Jazz Bass Player: Bass Accompaniment in A Jazz Piano Trio Setting

November 2018 marks the debut of Yating, an AI Pianist that learns to compose and perform keyboard-style music by means of the AI technology we are developing here at the Taiwan AI Labs.  Instead of playing pre-existing music, Yating listens to you and composes original piano music on-the-fly in response to the affective cues found in your voice input.  This is done with a combination of our technology in automatic speech recognition, affective computing, human-computer interaction, and automatic music composition.  In November 2018 Yating gave a public concert as her debut at the Taiwan Social Innovation Lab (社會創新實驗中心) in Taipei (see the trailer here; in Mandarin).  Now, you can download the App we developed (both iOS and Android versions are available) to listen to the piano music Yating creates for you any time through your smartphone.

Yating keeps growing her skillset since that.  One of the most important skills we want Yating to have is the ability to create original multi-track music, i.e., a music piece that is composed of multiple instruments.  Unlike the previous case of composing keyboard-only, single-instrument music, composing multi-track music demands consideration of the relationship among the multiple tracks/instruments that are involved in the piece of music.  Each track must sound “right” on its own, and collectively the tracks interact with one another closely and unfold over time interdependently.

We begin with the so-called Jazz Piano Trio setting, which is composed of a pianist playing the melody and chord, a double bass player playing the bass, and a drummer that plays the drums.  We find this setting interesting, because it involves a reasonable number of tracks with different roles, and because it’s a direct extension of the previous piano-only setting.  Our goal here is therefore to learn to compose original music with these three tracks in the style of Jazz.

We share with this blog post how Yating learns to play the part of the bass player.  We may talk about the parts of the pianist and drummer in the near future.

Specifically, we consider the case of bass accopmaniment over some given chords and rhythm.  This can be understood as the case where the pianist plays only the chord (but not the melody), the drummer plays the rhythm, and the bass player has to compose the bassline over the provided harmonic and rhythmic progression.  In this case, we only have to compose music of one specific track, but while composing the track we need to take into account the interdependence among all the three tracks.

We use an in-house collection of 1,500+ eight-bar phrases segmented from MIDI files of Jazz piano trio to train a deep recurrent neural network to do this. For I/O, we use pretty_MIDI (link).

You can first listen to a few examples of the bass this model composes given human-composed chords and rhythm.


You can find below a video demonstrating a music piece our in-house musicians created in collaboration with this AI bass player.

The architecture of our bass composition neural network is shown in the figure below.  It can be considered as a many-to-many recurrent network.  The input to the model comprises a chord progression and a drum pattern, both of eight bars, and the intended tempo of the music.  The target output of the model is a bass solo of eight bars long as well, comprising the pitch and velocity (which is related to the loudness) associated with each note.

The drum pattern and chord sequence are processed by separate stacks of two recurrent layers of bidirectional long short-term memory (BiLSTM) units.

The input drum pattern is represented by a sequence of eight 16-dimensional vectors, one vector for each bar.  Each element in the vector represents the activity of drums for each 16th beat of the bar, calculated mainly by counting the number of active drums for that 16th beat over the following nine drums: kick drum, snare drum, closed/open hi-hat, low/mid/high toms, crash cymbal, and ride cymbal.  We weigh the kick drum a bit more to differentiate it from the other drums.  The output of the last BiLSTM layer of the drum branch is another sequence of eight K1-dimensional vectors, again one vector for each bar.  Here, K1 denotes the number of hidden units of the last BiLSTM layer of the drum branch.

The input chord progression, on the other hand, is represented by a sequence of thirty-two 24-dimensional vectors, one vector for each beat.  We use a higher temporal resolution here to reflect the fact that the chord may change every beat (while the rhythm may be more often perceived at the bar level).  Each vector is composed of two parts, a 12-dimensional multi-hot “pitch class profile” (PCP) vector representing the activity of the twelve pitch classes (C, C#, …, B, in a chromatic scale) in that beat, and another 12-dimensional one-hot vector marking the pitch class of the bass note (not the root note) in that chord.  The output of the last BiLSTM layer of the chord branch is a sequence of thirty-two K2-dimensional vectors, again one vector for each beat.  Here, K2 denotes the number of hidden units of the last BiLSTM layer of the chord branch.

The tempo of the 8-bar segment, which is available from the MIDI file, is represented as a 35-dimensional one-hot vector after quantizing (non-uniformly) the tempo into 35 bins (the choice of the number of bins is made quite arbitrary).  The vector is used as the input to a fully-connected layer to get an K3-dimensional vector representing the tempo information for the whole segment.

Compared to drum and chords, we use an even finer temporal resolution for the bass generator (the second half of the bass composition model shown in the figure above): we aim to generate the bass for every 16th beat.  The input of the bass generator is therefore a sequence of 128 (K1+K2+K3)-dimensional vectors, one vector for each 16th beat.  Each vector is obtained by concatenating the output from the drum branch, chord branch, and tempo branch of the corresponding 16th beat.  The bar generator is implemented again by two stacks of BiLSTM layers.  From the output of the last BiLSTM layer, we aim to generate a 39-dimensional one-hot vector representing the pitch and a 14-dimensional one-hot vector representing the velocity used by the bass for that 16th beat.  Here, the pitch vector is 39 dimensional because we consider 37 pitches (from MIDI number 28-64, which corresponds to the pitch range of double bass) plus one rest token and one “repeat-the-note” token.  The velocity vector is 14 dimensional because we quantize (non-uniformly) the velocity value (which is originally from 0 to 128) to 14 bins.  Because the model has to predict both the pitch and velocity of the bass, it can be said that model is doing multi-task learning.

After training the model with tens of epochs, we find that it can start to generate some reasonable bass, but the pitch contour is sometimes too fragmented.  It might be possible to further improve the result by collecting more training data, but we decide to apply some simple postprocessing rules based on some music knowledge.  We are in general happy with the current result: the bass fits with the drum pattern nicely and has pleasant grooving.

You can listen to more music we generated below.

This is just the beginning of Yating’s journey in learning to compose multi-track music.  The bass accompaniment model itself can be further improved, but for now we’d like to move on and have fun learning to compose the melody, chords, and drum in the setting of Jazz piano trio.



By: Yin-Cheng Yeh, Chung-Yang Wang, Yi-Pai Liang, and Yi-Hsuan Yang (Taiwan AI Labs Yating/Music Team)

, ,, open source blockchain for AI Data Justice

[ 換日線報導中文連結 ]

The beginning of Data Justice” movement

By collaborating with online citizen and social science workers in Taiwan, Taiwan AILabs promotes the Data Justice” in the following principles:

  1. Prioritize Privacy and Integrity with goodwill for applications before data collection
    • In addition to Privacy Protection Acts, review the tech giant on potential abuse of monopoly position forcing users to give up their the privacy, or misuse user content and data for different purpose. In particular, organizations that became monopoly in the market should be reviewed regularly by local administration knowing if there is any abuse of data when users are unwillingly giving up their privacy.
  2. Users’ data and activities belong to users
    • The platform should remain neutral to avoid the misuse of user data and its creation.
  3. Public data collected should open for public researches
    • The government organization data holder is responsible for its openness while privacy and integrity are secured.For example, health insurance data for public health and smart city data for traffic researches.
  4. Regulate mandatory data openness
    • For the data critical to major public welfare controlled by monopoly private agency, we shall equip the administration the power for data openness.
    • For example, Taipower electric power usage data in Taiwan.

Monopoly now is worse than oil monopoly”

In 1882, the American oil giant John D. Rockefeller founded Standard Oil Trust and united with 40 oil-related companies to reach price control. In 1890, U.S. government sued Standard Oil Trust to prevent unfair monopoly. The antitrust laws have been formulated so as to ensure fair trade, fair competition, and prevent price manipulation. The governments of various countries followed the movement to establish anti-monopoly laws. In 1984 AT&T, a telecom giant, was split into several companies for antitrust laws. Microsoft was sued in 2001 for having Internet Explorer in its operating systems.

In 2003, Network Neutrality principle mandated ISPs (Internet Service Providers) to treat all data on Internet the same. FCC (Federal Communications Commission) successfully stopped Comcast, AT&T, Verizon and other giants from slowing down or throttling traffic based on application or domain level. Apple FaceTime, Google YouTube and Netflix are benefited from the principle. After 10 years, the oil and ISPs companies are no longer in the top 10 most valuable companies in the world. Instead, the Internet companies protected by Network Neutrality a decade ago have became the new giants. In the US market, the most valuable companies in the world dominate the market shares in many places. In February 2018, Apple reached 50% of the smart phone market share, Google dominated more than 60% of search traffic, and Facebook controlled nearly 70% of social traffic. Facebook and Google two companies have controlled 73% of the online Ads market. Amazon is on the path grabbing 50% of online shopping revenue. At China side, the situation is even worse. AliPay is owned by Alibaba and WePay is owned by WeChat. Two companies together contributed to 90% of China’s payment market.

When data became weapons, innovations and users become meatloaf

After series of AI breakthrough in the 2010’s, big data as import as crude oil. In internet era, users grant Internet companies permission on collecting their personal data for connecting with creditable users and content out of convenience. For example, the magazine publishes articles on Facebook because Facebook allows users to subscribe their article. At the same time, the publisher can manage their subscribers’ relationship with messenger system. The recommendation system helped to rank users and their content published. All the free services are sponsored from advertisements, which pay the cost of internet space and traffic. This model has encouraged more users to join the platform. Users and content accumulated on the platform also attracted more users to participate in. After 4G mobile era, mobile users are always online. It pushed the data aggregation to a whole new level. After merging and acquisition between Internet companies, a few companies stands out dominating user’s daily data today. New initiatives can no longer reach users easily by launching a new website or an app. On the other hand, Internet giants can easily issue a copycat of innovation, and leverage their traffic, funding and data resources to gain the territories. Startups had little choice but being acquired or burnout by unfair competition. Fewer and fewer stories about innovation from garages. More and more stories about tech giants’ copy startup ideas before they being shaped. There is a well quoted statement in China for example: Being acquired or die, new start-up now will never bypass the giants today.”. The phenomenon of monopoly also limited users’ choices. If a user does not consent to the data collection policy there is no alternative platform usually.

Net Neutrality repealed, giants eat the world

Nasim Aghdam’s anger at YouTube casts a nightmarish shadow over how it deals with creators and advertisers. She shot at the YouTube headquarters and caused 3 injuries. She killed herself in the end. At the beginning of Internet era, innovative content creators can be reasonably rewarded for their own creations. However, after the platform became monopoly, content providers find that their creation of content are ranked through opaque algorithms which ranked their content farther and farther away from their loyal subscribers. Before their subscribers can reach their content, poor advertising and fake news stand on the way. If the publisher wants to retain the original popularity, the content creator need also pay for advertisement. Suddenly reputable content providers are being charged for reaching their own loyal subscribers. Even worse, their subscribers’ information and user behavior are being consumed platform’s machine learning algorithms for serving targeting Ads. At the same time, the platform doesn’t really effectively screen the Advertisers, low quality fake news and fake ads are being served. It is known for scams and elections. After Facebook scandal, users discovered their own private data are being used through analysis tools to attack their mind. However at the #deletefacebook movement, users find no alternative platform due to the monopoly of technical giants. Friends and users are at the platform.

In December 2017, FCC voted to repeal the Net Neutrality principle for the reason that US had failed to achieved Net Neutrality. ISPs companies are not the ones to blame. After a decade, Internet companies who benefited from Net Neutrality are now the monopoly giants and Net Neutrality wasn’t able to be applied for their private ranking and censorship algorithm. Facebook for example offers mobile access to selected sites on its platform at different charge of data service which was widely panned for violating net neutrality principles. It is still active in 63 other countries around the world. The situation is getting worse in the era of AI. Tech giants have leveraged their data power and stepped into the automotive, medical, home, manufacturing, retail, and financial sectors. Through acquisitions by the giants rapidly accumulating new types of vertical data and forcing the traditional industries opening up their data ownership. The traditional industries are facing a even larger and smarter technology monopoly than the ISP or oil companies in a decades.

Taiwan experience may mitigate global data monopoly

Starting from the root cause, at the vertical point of view, The user who contributed the data” was motivated by the trust” of the their friends or the reputable content provider. In order to have the convenience and better service, the user consents to collecting their private data and grant the platform for further analysis. The user who contributed the content” consents to publishing their creation on the platform because the users are already on the platform. The platform now owns the power of the data and content that should originally belong to the users and publisher. For privacy, safety and convenience purpose, the platform prevents other platforms or users from consuming the data. Repeatedly, it results in an exclusive platform for users and content providers.

From horizontal point of view, in order to reach user, for data and traffic, the startup company signed unfair consent with the platform. In the end, the good innovations is usually swallowed by the platform because the platform also owns data and traffic for the innovations. Therefore, the platform will become larger and larger by either merging or copying the good innovation.

In order to break this vicious cycle and create fair competition environment for AI researches. Taiwan AILabs shared at 2018 3/27 Taipei Global Smart City Expo and a panel at 3/28 Taiwan German 2018 Global Solution Workshop with visiting experts and scholars on data policies making. Taiwan AILabs exchanged Taiwan’s unique experience on Data Justice. In the discussion we concluded opportunities that can potentially break the cycle.

The opportunities comes from the the following observations in Taiwan. Currently, the mainstream of the world’s online social network platforms is provided by private companies optimized for advertising revenue. Taiwan has a mature network of users, open source workers and open data campaigns. Internet users” in Taiwan are closer to online citizens”. Taiwan Internet platform, PTT( for example, is not running for profit. The users elect the managers directly. Over the years, this culture has not cooled down. PTT is still dominating. Due to its equity of voice, it is difficult to be manipulated by Ads contribution. Fake news and fraud can be easily detected by its online evidence. PTT is a more of a major platform for public opinions compared with Facebook in Taiwan. With the collaboration between PTT and Taiwan AILabs, it now has its AI news writer to report news out of its users’ activities. The AI based new writer can minimize editor’s bias. is another group of non profit organization in Taiwan focusing on citizen science and technology. It promotes the transparency and openness of government organizations through hackathon. It collaborated with the government, academia, non-governmental organizations, and international organizations for data openness on public data with open source collaboration in various fields.

Introducing project: using blockchain for Data Justice” in AI era

PTT is Taiwan’s most impactful online platform running for 23 years. It has its own digital currency – P coin, instant messaging, e-mail, users, elections and administrators elected by users. However, the services hosting the online platform are still relatively centralized. 

In the past, users chose a trusted platform for trusted information. For convenience and Internet space, users and content providers consent to unfair data collection. To avoid centralized data storage, blockchain technology gives new directions. Blockchain is capable to certify the users and content by its chain of trust. The credit system is not built on top of single owner and at the same time the content storage system is also built on top of the chain. It avoids the control of a single organization which becomes the super power. is a research starting to learn from PTT’s data economy, combining with the latest blockchain encryption technology and implementing in the decentralization approach.

The mainstream social network platforms in China and the United States created new super power of data through the creation of users and users’ own friends. It will continue to collect more information by horizontally merging industries with unequal data power. The launch of is a thinking of data ownership in different direction. We hope to study how to upgrade the system PTT in the era of AI, and use this platform as the basis for enabling more industries to cooperate with data platforms. It gives the data control back to users and mitigate the data monopoly happening. will also collaborate with leading players on the automotive, medical, smart home, manufacturing, retail, and financial sectors who are interested in creating open community platform. 

Currently, the experimentation of technology started on an independent platform. It does not involve the operation or migration of the current PTT yet. Please follow the latest news of on .


[2018/10/24 Updates]:

The open source project is on github now:

[2019/4/2 Updates]:

More open source projects are on github now:


, ,

Humanity with Privacy and Integrity is Taiwan AI Mindset

The 2018 Smart City Summit & Expo (SCSE) along with three sub-expos have taken place at Taipei Nangang Exhibition Center on March 27th with 210 exhibitors from around the world this year, exhibiting a diversity of innovative applications and solutions in building a smart city. Taiwan is known for the friendly and healthy business environment, ranked as 11th by World Bank. With 40+ years in ICT manufacturing and top level embedded systems, companies form a vigorous ecosystem in Taiwan. With an openness toward innovation, 17 out of 22 Taiwan cities have made it to the top in Intelligent Community Forum (ICF).

Ethan Tu, Taiwan AILabs Founder, gave a talk of “AI in Smart Society for City Governance” and laid out AI position in Taiwan that smart cities is for “humanity with privacy and integrity” besides “safety and convenience”. He said “AI in Taiwan is for humanity. Privacy and integrity will also be protected.”. The maturity of crowd participation, transparency and open data mindset are the key assets to drive Taiwan on smart cities to deliver humanity with privacy and integrity. Taiwan AILabs took social participating and AI collaborated editing open-source news site of as an example. The city governments are now consuming the news to detect the social events happening in Taiwan in real-time for the AI news’ robustness and reliability in scale. AILabs collaborated with Tainan city on AI drone project to simulate “Beyond Beauty” director Chi Po-lin who dies in helicopter crash. AILabs also established “Taipei Traffic Density Network (TTDN)” supporting real-time traffic detection and prediction with citizen’s privacy secured, no people nor car can be identified without necessity for Taipei city.

The Global Solutions (GS) Taipei Workshop 2018 with “Shaping the Future of an Inclusive Digital Society” took place at the Ambassador Hotel on March 28, 2018 in Taipei. It is co-organized by Chung-Hua Institute for Economic Research (CIER) and the Kiel Institute for the World Economy. The “Using Big Data to Support Economic and Societal Development” panel section was hosted by Dennis Görlich Head, Global Challenges Center, Kiel Institute for the World Economy. Chien-Chih Liu, Founder of the Asia IoT Alliance (AIOTA), Thomas Losse-Müller, Senior Fellow at the Hertie School of Governance, Reuben Ng, Assistant Professor, and Lee Kuan Yew School of Public Policy, National University of Singapore all participated in the discussion. Big data has been identified as oil for AI and economic growth. He shared the vision in his panel, “We don’t have to sacrifice for safety or convenience. On the other hand, Facebook movement is a good example that the tech giants who overlook privacy and integrity will be dumped.”

Ethan explained 3 key principles from Taiwan societies on big data collection. The following principles exist and are contributed by the mature open internet societies and movements in Taiwan. AILabs will promote them as fundamental guidances for data collection on medical records, government records, open communities and so on.

1. Data produced by users belongs to users. The policy makers shall ensure no solo authority such as social media platform is too dominant to user and force users on giving up data ownership.

2. Data collected by public agent belongs to public. The policy makers shall ensure the data collected by public agency shall provide the roadmap on opening data for general public on researches. for example is a NPO for the open data movement.

3. “Net Neutrality” is not only ISP but also for social media and content hosting service. for example, persists in equality of voice without Ads. Over the time the equality of voice has overcome the fake news by standing-out evidences.

“Humanity is the direction for AILabs. Privacy and Integrity are what we insist.” said Ethan.Smart City workshop with Amsterdam Innovation Exchange Lab from Netherlands

SITEC from Malaysia visiting

AI music composition passed Turing test

Music composition by computers has been of great research interests for long. Many techniques, such as rules, grammars, probabilistic graphical models, neural networks, and evolutionary methods, are applied to automatic music generation. In this article we describe our approach and the corresponding results.

AI music recognition test

Before describing our method, let’s test if you can distinguish AI music from human music. 5 AI tunes and 5 human tunes are gathered and shuffled, and you are encouraged to select 5, which you consider more machine-made, from them. The true composers will be revealed later in this article.

Breaking music into components

To compose a tune using computers, we break the tune into several components and generate each component individually (but dependently). A music work, e.g., a classical work or a modern pop song, usually consists of several voices, played by several instruments. In some works we can easily recognize one voice as the main melody and the other voices as foil. In this article, we are devoted to generation of monophonic main melodies.

A monophonic melody is a sequence of notes, consisting of pitches and duration. By collecting pitches of all notes we get what is called voice leading, and collecting duration yields the rhythm. There is usually another musical element underlying the main melody called a chord progression, which controls primary transition of moods. One can think of the chord progression as supporting branches and the melody as blooming flowers.

Techniques for musical components

In the above we introduced three musical components: chord progression, rhythm, and voice leading. Our composition method is to generate chord progressions and voice leading with probabilistic graphical models, and rhythms with rules.

The procedure to generate a song is described here. The time configuration, such as how long a song is and how many chords are there in a chord progression, is decided by human. The chord progression and the rhythm are then generated independently. Finally, voice leading is generated to fit the chord progression and the rhythm, completing the composition.

The answer of the AI music recognition test

Now we come back to the AI music recognition test. In the track list earlier in this article, A, D, H, I, and J are composed by computers with the procedure mentioned above. The others are extracted from Johann Sebastian Bach’s Well-Tempered Clavier Volume 1, as listed below.

B: Prelude No.2, bar 18.

C: Prelude No.10, bar 33.

E: Prelude No.2, bar 25.

F: Prelude No.5, bar 25.

G: Prelude No.2, bar 5.

Statistics of the AI music recognition test

Did you guess all the composers right? Let’s see how other people performed. We held this test on Taiwan’s PTT Bulletin Board System and had 85 participants. The resulting statistics is gathered below.

# correct guess (out of 5) 0 1 2 3 4 5 total
# testee 6 9 37 24 6 3 85
tune id composer # testee judging it right % testee judging it right
A AI 51 60%
B Bach 48 56%
C Bach 24 28%
D AI 43 51%
E Bach 41 48%
F Bach 39 46%
G Bach 42 49%
H AI 44 52%
I AI 19 22%
J AI 37 44%
average 0.46

Most people gave 2 ~ 3 correct guesses out of 5, which is of similar accuracy as random selection, and even the test holder mixes them up when not paying attention. So don’t be too blue even if you are fooled.


Feautured Photo by  Brandon Giesbrecht / CC BY 2.0

Doppelgänger app – Can someone unlock your iPhone?

Could your doppelgänger trick your iPhone’s facial recognition feature into believing that you are the same person? The answer might lie within our newly-built facial recognition software "Doppelgänger app" at

One of social media's hottest topics is "How can two celebrities, without any blood relation, look identical?" This discussion went viral on PTT, one of Taiwanese largest bulletin board system (BBS), right after Apple released the "Face ID" feature with iPhone X in November, 2017. Many people were wondering: Can Elva Hsiao(蕭亞軒) unlock Landy Wen(溫嵐)'s iPhone?


Read More

, ,

Meet JARVIS – The Engine Behind AILabs

In Taiwan AI Labs, we are constantly teaching computers to see the world, hear the world, and feel the world so that computers can make sense of them and interact with people in exciting new ways. The process requires moving a large amount of data through various training and evaluation stages, wherein each stage consumes a substantial amount of resources to compute. In other words, the computations we perform are both CPU/GPU bound and I/O bound.

This impose a tremendous challenge in engineering such a computing environment, as conventional systems are either CPU bound or I/O bound, but rarely both.

We recognized this need and crafted our own computing environment from day one. We call it Jarvis internally, named after the system that runs everything for Iron Man. It primarily comprises a frontdoor endpoint that accepts media and control streams from the outside world, a cluster master that manages bare metal resources within the cluster, a set of streaming and routing endpoints that are capable of muxing and demuxing media streams for each computing stage, and a storage system to store and feed data to cluster members.

The core system is written in C++ with a Python adapter layer to integrate with various machine learning libraries.



The design of Jarvis emphasizes realtime processing capability. The core of Jarvis enables data streams flow between computing processors to have minimal latency, and each processing stage is engineered to achieve a required throughput per second. For a long complex procedure, we break it down into smaller sub-tasks and use Jarvis to form a computing pipeline to achieve the target throughput. We also utilize muxing and demuxing techniques to process portions of the data stream in parallel to further increase throughput without incurring too much latency. Once the computational tasks are defined, the blue-print is then handed over to cluster master to allocate underlying hardware resources and dispatch tasks to run on them. The allocation algorithm has to take special care about GPUs, as they are scarce resources that cannot be virtualized at the moment.

Altogether, Jarvis becomes a powerful yet agile platform to perform machine learning tasks. It handles huge amount of work with minimum overhead. Moreover, Jarvis can be scaled up horizontally with little effort by just adding new machines to the cluster. It suits our needs pretty well. We have re-engineered Jarvis several times in the past few months, and will continue to evolve it. Jarvis is our engine to move fast in this fast-changing AI field.


Featured image by Nathan Rupert / CC BY

Face Recognition – The essential part of “Face ID”

Upon seeing a person, what enters our eyes is the person’s face. Human face plays an important role in our daily life when we interact and communicate with others. Unlike other biometrics such as fingerprint, identifying a person with its face can be a non-contract process. We can easily acquire face images of a person from a distance and recognize the person without interacting with the person directly. As a result, it is intuitive that we use human face as the key to build a Face Recognition system.



Over the last ten years, Face Recognition is a popular research area only in computer vision. However, with the rapid development of deep learning techniques in recent years, Face Recognition has become an AI  topic and more and more people are interested in this field. Many company such as Google, Microsoft and Amazon have developed their own Face Recognition tools and applications. In the late 2017, Apple also introduced the iPhone X with Face ID, which is a Face Recognition system aimed at replacing the fingerprint-scanning Touch ID to unlock the phone.


What Face Recognition can be used?

  • automated border system for arrival and departure in the airport
  • access control system for a company
  • criminal surveillance system for government
  • transaction certification for consumer
  • unlocking system for phone or computer


How Face Recognition Works?

Face Recognition system can be divided into three parts:

  • Face Detection : tell where the face is in the image
  • Face Representation : encode facial feature of a face image
  • Face Classification : determine which person is it

Face Detection

Locating the face in the image and finding the size of the face is what Face Detection do. Face Detection, is essentially an object-class detection problem for a given class of human face. For object detection in computer vision, a set of features is first extracted from the image and classifiers or localizers are run in sliding window through the whole image to find the potential bounding box, which is time-consuming and complex. With the approach of deep learning, object detection can be accomplished by a single neural network, from image pixels to bounding box coordinates and class probabilities, with the benefit of end-to-end training and real-time prediction. YOLO, which is an open source real-time object detection system, was built for Face Detection in our Face Recognition pipeline.


Face Representation

With the goal of comparing two faces, computing the distance of two face images pixel by pixel is somehow impracticable because of large computing time and resources. Thus, what we need to do is extract face feature to represent face image.

“The distance between your eyes and ears” and “The size of your noes and mouth”….

These facial features become an easy measurement for us to compare whether the two unknown face represent the same person. Eigen-face and genetic algorithm are used in old days to help discover these features. With the new deep learning technique, a deep neural network project each face image on a 128-dimensional unit hypersphere and generate feature vector of each image for us.

Regarding to transforming face images into Face Representations, OpenFace and DLIB are two commonly used model to generate feature vector. Some experiments are done for these two models and we found out that the face representation for DLIB model is more consistent between each frames for the same person and it indeed outperformed OpenFace model for accuracy test, as a result, DLIB was finally used as our face representation model.


Each vertical slice represents a face representation for a specific person from a image frame. The x-axis is the timestamp for each frame of video. This results show that dlib model does a better job at making consistent images-to-representations transformation for the face image of the same person between each frame.


Face Classification

Gathering the face representations for each person to build a face database, a classifier can be trained to classify each person. To stabilize the final classification results, “weighted moving average” is introduced into our system where we take classification results from the previous frames into consideration when determining the current classification results. With this mechanism, we found out that it smoothes the final classification results and has a better performance on accuracy test compared to classification result from a single image.


Feature image by Ars Electronica / CC BY


AI frontdesk – improve office security and working conditions

Imagine that someone in your office serves as doorkeeper, takes care of visitors and even cares about your working conditions, 24-7? One of our missions at is to explore AI solutions to address society’s problems and improve the quality of life of people and, we have developed one AI-powered front-desk to do all of the tasks mentioned above.

Based on 2016 annual report from Taiwan MOL (Ministry of Labor), the average work hours per year of Taiwanese employee is 2106 hours. Compared with OECD stats, this number ranked No.3 in the world which is just below Mexico and Costa Rica.

Recently on 4th, December, 2017,  the first review of the Labor Standards Act revision was passed. The new version of the law will allow flexible work-time arrangements and expand monthly maximum work hours up to 300. Other major changes of the amendment includes conditionally allowing employees to work 12 days in a row and reduction of a minimum 11 hour break between shifts down to 8 hours. The ruling party plans to finish second and third-reading procedure of this revision early next year (2018), and it will put 9-million Taiwanese labors in worse working environment.To get rid off the bad reputation of “Taiwan – The Island of Overwork “, a system which will notify both employee and employer that one has been extremely over-working, and the attendance report can not easily be manipulated is needed.

In May 2017, an employee Luo Yufen from Pxmart, one of Taiwan’s major supermarket chain, died from a long time of overwork after 7 days of being in the state of coma. However, the OSHA(Occupational Safety and Health Administration) initially find no evidence of overwork after reviewing the clocking report provided by Pxmart which looks ‘normal’. It wasn’t until August, when Luo’s case are requested for further investigation, that the Luo’s real working hours before her death proves her overwork condition.

Read more