Acquiring high-quality parallel corpora is essential for training good-performing MT engines. There are a number of publicly available multilingual corpora, such as the proceedings of the European Parliament (Europarl) or transcribed TEDTalks (OPUS). Owing to their size and confirmed high quality, these have been used by researchers as sources of large-scale parallel language data.
One of the most common ways to access or generate parallel corpora is web scraping, making use of the immense source of multilingual data offered on the web. In the NLP community, web scraping is often used to collect text data, both bilingual and multilingual, as well as monolingual. For example, the BERT language model, widely used for a variety of NLP tasks, such as text classification and text generation, has been trained on text data extracted from Wikipedia.
If you want to scrape the web for parallel data, where do you start and how do you approach this task? This article aims to give some direction for those interested in web scraping. A related term, web crawling, is often used synonymously with web scraping. Both of them involve automatic browsing of the web, but web scraping takes an emphasis on data extraction.
Typically scraping for parallel data consists of three parts:
The first step is to study the web for potential parallel content. This can be a website that offers its content in multiple languages. Or, it can also be two websites or web pages on a similar topic but in different languages. A good example is Wikipedia, which offers the majority of its content in multiple languages.
The next step is to study the structure of a candidate website to figure out which part of it might be available in parallel fashion. You could start by looking at the recent articles or sections that have a language bar, offering the same content in another language, and see if there is a consistent pattern of some part or section being offered in multiple languages – chances are this content has been translated and can be scraped to create a parallel corpus.
A good idea at this point is to check how close the original and translated articles are in terms of content. If you are familiar with the languages you are dealing with to some extent, you can just use your own judgment, otherwise, you can run the text through an automated MT system of your choice to get a general understanding. Articles that are translated too loosely or summarized rather than translated, might not be fit for your purposes.
Once a multilingual website or websites have been identified as fitting for parallel data scraping, a number of web scraping tools are available for the scraping itself. Some of the popular and efficient tools are Scrapy, deployed in Python, as well as Selenium WebDriver, which can be implemented with Java, Python and a few other programming languages.
These tools offer wide functionality in terms of scraping the web for all kinds of data including text data. Scrapy is a versatile framework that allows the user to create their own scraper or, as it's often called, a spider, that will scrape the website of the user's choice.
Once provided with a starting URL address, or a list of URLs, a Scrapy spider will extract the information from them in a structured manner as defined by the user. For instance, a Scrapy spider can be programmed to parse the HTML structure of every fetched webpage and extract only text data in a specific part on the page, while ignoring images or menu bars.
In addition to Scrapy’s functionalities, Selenium also allows the data specialists to mimic and automate how a user would normally interact with a website through a web browser: for example, to click on pop-ups, fill out forms, or scroll down the page to download all of the content on it. By mimicking this behavior, a scraper can collect more full and complete data from a list of URLs it was provided with.
However, it is an additional design effort on the data specialist's part to scrape this data in such a way that it can be aligned according to language and in parallel fashion, article per article or page per page. One of the decisions is to scrape the data per language. The URL structure of the webpage often provides the information about the language it is published in: it can be a domain extension like .de or .ru, as well as a language code in the URL address. The example below illustrates how the European Commission's webpages on Research and Innovation in German and English differ only by de/en langage code in the URL address:
https://ec.europa.eu/info/research-and-innovation_de
https://ec.europa.eu/info/research-and-innovation_en
Leveraging this information, a data specialist can launch multiple scraping processes for each of the domains or language codes and save the results by language for further preprocessing. Often the HTML structure of the webpage will also contain an element with the information about its language, as illustrated below:
Including the instructions for extracting this element is also something a web scraper like Scrapy can do for you, allowing you to save the scraping results in a language-dependent manner.
In case you are dealing with a less structured website where language information is not available, language identification tools like fasttext by Facebook or cld3 by Google can help you sort the scraped data per language before proceeding to align and post-process.
Once you have scraped your website of interest and saved the data by language, the next step is to align the collected text data on a sentence by sentence level.
In case the website is well structured and the URL address or HTML structure provides clues as to which page in the chosen source language corresponds to which page in the target language (as in the EC web page example above), the collected documents can be matched based on this information.
However, sometimes an intermediate step is needed, where you align the documents themselves before you can proceed to align the documents on a sentence level. Traditionally, various translation-based approaches have been applied for document alignment, and as shown by the results of WMT 2016 Document alignment task, various techniques based on tf-idf information of candidate texts, as well as on n-gram and neural language models work really well and demonstrate high accuracy.
Once the data has been document-aligned, you can proceed to the final step for creating the actual parallel corpus, namely aligning on sentence level. Sentence alignment tools like Bleualign and Vecalign are both good choices, boasting high accuracy. Bleualign requires the additional effort of creating MT-translated sets of source sentences into target language, allowing the tool to calculate pairwise BLEU scores of target sentences and MT-translated source sentences and make a judgment on which sentences should be aligned based on the best BLEU score.
Vecalign works in the same manner, except instead of calculating BLEU scores, it calculates the cosine similarity between source and target sentences and makes an alignment decision based on the highest similarity found. In order to calculate this similarity, Vecalign requires sets of generated sentence embeddings for both source and target files.
Both Bleualign and Vecalign are great choices when it comes to sentence alignment for your corpora, however, Vecalign boasts an additional advantage of being able to detect one-to-many and many-to-one alignments, i.e. where two sentences on the source side correspond to one sentence on the target corpus side. This is a very useful feature allowing the extraction of more data for the final corpus. It also comes with a higher computational effort, since it requires sentence embeddings generated for all sets of sentence overlaps on both source and target sides.
Once the sentences have been aligned, the parallel corpus is almost ready! However, there still might be some noise in it, requiring additional post-processing and cleaning. In this case, you might want to remove repeating sentences, short sentences of one-two words that are not helpful for your purposes, sentences in languages other than your intended language pair, as well as potential misaligned sentences. NLP offers a number of useful tools that can help with these cleaning options. As already mentioned in this article, fasttext and cld3 models can help remove not intended languages, and sentence embeddings would help carry out a sanity check on how well-aligned the corpus is, as well as remove badly aligned sentences.
At TAUS we have experience in both developing scraping frameworks as well as building efficient post-processing and cleaning pipelines with the help of toolkits described in the article.
Lisa is a Data Curator at the NLP Team with TAUS. Using her background in linguistics and experience in the translation industry, she helps TAUS optimize the data offering and create new data solutions.