Software & Data
TAB: Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. To address this shortage, we have developed a new dataset called the Text Anonymization Benchmark (TAB). The dataset comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations.
Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available under an open-source license.
skweak: How can I obtain labelled data for my NLP task? This is perhaps one of the most common challenge encounted by NLP practictioners. Our new Python toolkit skweak (pronounced /skwi:k/) provides a solution to this problem using weak supervision. The key idea is simple: instead of labelling texts by hand, we rely on labelling functions to automatically label our documents, and then use a statistical model to aggregate the results of those labelling functions, taking into account the estimated accuracy and confusions of each labelling function.
skweak can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. The toolkit is also tightly integrated with SpaCy, which makes it easy to incorporate into existing NLP pipelines. Give it a try!
OpenSubtitles: Together with Jörg Tiedemann (University of Helsinki), I worked on the release and maintenance of the OpenSubtitles dataset. The OpenSubtitles 2018 dataset contains over 3.7 million subtitles from movies and TV series covering 60 languages. The full dataset amounts to over 3.4 billion sentences (22.2 billion tokens)
The subtitles are aligned with one another across languages at the document and sentence levels, yielding a total of 1782 bitexts. The dataset is currently the world's largest open collection of parallel corpora and is used by hundreds of institutions across the globe for research on dialogue modelling, machine translation and cross-lingual NLP.
OpenDial: I developed for my PhD thesis a Java-based software toolkit called OpenDial released under an MIT open-source license. OpenDial is a Java-based, domain-independent toolkit for developing spoken dialogue systems. The primary focus of OpenDial is on dialogue management, but OpenDial can also be used to build full-fledged, end-to-end dialogue systems, integrating e.g. speech recognition, language understanding, generation, speech synthesis, multimodal processing and situation awareness.
The representation of dialogue domains is based on probabilistic rules encoded in a simple XML format, allowing dialogue developers to combine the benefits of logical and statistical approaches to dialogue modelling within a single, unified framework. In addition to the download packages, the OpenDial website also includes some practical examples of dialogue domains and a step-by-step documentation on how to use the toolkit.