Just early this July, Ailab.tw released an AI reporter named Copycat(記者快抄) that produces news covering contents from Taiwan’s largest online forum PTT. It performs its job faster and produces more contents than its human colleagues in real time.
Now Copycat can write about 500 news articles automatically with popular topics every day.
The Requirements of Media Industry Nowadays
How to attract reader’s attention to produced content, and how to make content rank higher on social networks or search engine are getting more and more important for media industry. To meet this goal, reporters need to produce as many articles as they can, update fast enough and search for interesting materials all over the world. Copycat (記者快抄), an AI reporter, can do this task as well by generating news based on the most discussed topic from Taiwan’s largest online forums PTT.
In the beginning this was a side project. However, we found people are interested in this website, so we made some effort to improve it.
PTT, the biggest and non-commercial forum in Taiwan.
Generate News Automatically
PTT is the largest terminal-based bulletin board system (BBS) based in Taiwan, it has more than 1.5 million registered users with over 150,000 users online in peak time. This BBS is a non-commercial and open-source online platform which has over 20,000 boards covering a multitude of topics and generates 500,000 comments every day.
Our system now fetches important articles and posts from PTT every 30 minutes, parses them and posts the results on the dashboard. Likes and Boos are also collected to display on each posts, indicating the general public’s reactions.
Three Steps to Generate News Articles
First, summarization. Based on the popular posts on PTT forum, we describe the main idea in a few sentences. Article contents are broken down into sentences and a score is given to each sentence to represent how tight it connects with other sentences in the article. In addition, other deep learning techniques such as word embedding is also used to support the algorithm.
AI generated news from PTT
With a list of sentences candidates, we algorithmically pick and compile them into an article. We collect some widely used news templates so Copycat can mix the key sentences with these templates and turns out a common daily news.
The last part is to make the news article more readable. PTT users often write posts with their own styles and formats such as unexpected new lines and spaces. This make it hard for machine to read and understand the content. To deal with this problem we generate a model from newspaper text as a grammar corrector to teach Copycat how to write like a professional reporter.
Feature Image Selection
Only text is not enough. A news article should have images. The posts on PTT forum often includes some image links which can be a great resource. However, many of them do not have an image associated with the posts.
To search for an image like how a human editor does, we trained a multi-layer document retrieval RNN model as an image search engine. This engine grasps an image by comparing the text-similarity between the image’s description and the news content.
Now, our AI reporter Copycat can not only copy the images from the original post, but also can find a related image when needed.
The figure is auto-selected by Copycat based on text content
More to Come
The original categories on PTT and the topic extracted by Copycat are useful tags for people to find related news articles. The discussion and re-posts on the forum are potential data to show further and different standpoints of certain topics.
After importing our face and speech recognition module, Copycat can search for celebrities’ comment related to specific topic all over video clips on the Internet. This news knowledge graph can also benefit human-reporters.
We believe that artificial intelligence will be a support rather than a threat to help reporter produce news with higher quality. By automating the process of picking topics and generate articles online, reporters can move the needle on the content generation process and focus on creating insights or stories for readers.
Copycat is constantly improving and on the way to become a better reporter.