A Brief Introduction to Text Summarization

Text summarization is the process of taking pieces from a longer text to put together a (shorter) summary, in which the key elements and meaning of the text are preserved. Doing this manually is quite a time-consuming and strenuous task. However, powered by the data and AI revolution, the automation of this task is gaining more popularity.

We can distinguish two types of text summarization: extraction and abstraction.

Extractive Summarization

Extractive summarization is the easiest approach to automatic text summarization as it requires little linguistic analysis. In extractive summarization, sentences are picked directly from the document, based on their scoring, and are then put together to form a coherent summary. With this method, important sections of the text are identified, then cropped out and stitched together to produce a condensed version of the full document or text.

Extractive summarization consists of three steps:

The first step is to construct an intermediate representation of the input text. There are two ways to do this: a topic representation or an indicator representation.
In a topic representation, the text is transformed into constituent topics. The techniques used for this differ in terms of their complexity and representation model, and are divided into frequency-driven approaches, topic word approaches, latent semantic analysis, and Bayesian topic models.
In an indicator representation, each sentence is represented as a list of indicators of importance (sentence length, location in the document, presence of certain phrases etc). Examples of indicator representations are graph-based models and machine learning models.
In the second step, each sentence in the representation is assigned a score or value that indicates their importance.
For topic representations, the score is usually related to how well the sentence expresses some of the most important topics in the document or to what extent it combines information about the different topics.
For indicator representations, the score of each sentence is determined by combining the outcome of the different indicators.
In the final step, the summarizer selects the best combination of important sentences to form an average length summary. Usually, the most important (highest valued) sentences that form a summary of the desired length are put together. Ideally, the system tries to maximize overall importance, minimize redundant sentences and maximize coherency.

Abstractive Summarization

Abstractive summarization requires more advanced NLP techniques, as it aims to produce a summary through the interpretation of the text. In abstractive summarization, important information is incorporated by AI models to generate new and rephrased sentences, parts of which may not appear in the original text. These generated summaries are more linguistically fluent and comparable to human-made summaries.

Abstractive summarization can be regarded as a “sequence mapping task”, where the source text should be mapped to the target summary, and take advantage of the advancements in deep learning techniques and “sequence to sequence models”. Just like with machine translation models, these sequence-to-sequence models consist of an encoder and a decoder, where a neural network reads the text, encodes it, and then generates the target text.

Because it involves complex language modeling, building automatic human-like abstractive summaries remains a challenging task.

There are some free online tools available for automated extractive and abstractive summarization, such as SummarizeBot, Resoomer, SMMRY, TextSummarization, Text Compactor.

A Brief Introduction to Text Summarization

Text Summarization can be categorized under two types: Extraction and Abstraction. With the power of AI, summarization is becoming more popular and accessible.

Extractive Summarization

Abstractive Summarization