The WiCoPaCo Corpus

News

January 2011
Technical document on rewriting typology (in French)
Read More

April 30, 2010
First release of the corpus
Read More

Table of contents

Introduction
Licence
File Format
Corpus Download
Spell Checking Labels
People Involved
Publications
Acknowledgements

Wikipedia Correction and Paraphrase Corpus

Introduction

WiCoPaCo is a corpus of natural rewritings extracted from the revision history of Wikipedia. It includes spelling corrections, reformulations, and other local text transformations, which can be of great interest for many NLP applications, such as text correction, text normalization, paraphrasing or summarization.

The following table provides examples of the corrections that can be found in the WiCoPaCo corpus.

Correction
Normalizations	Son 2ème disque → Son deuxième disque
Non-Word Error Corrections	c’est-à-dire la dernrière → dernière année avant l’ère chrétienne
Diacritics Error Corrections	la jeune Natascha Kampusch, agée → âgée de 18 ans
Real-Word Error Corrections	dans le but de sensibilisé → sensibiliser sur les changements
Reformulation
Close Meaning	Le tritium existe dans la nature . Il est produit → se forme naturellement dans l' atmosphère
	“Gimme Gimme Gimme” et “I Have A Dream” contribueront au gigantesque succès de → viendront alimenter la gloire d' Abba
Different Meaning	alors que l' ordinateur → qu'un processeur de la famille x86 reconnaîtra ce que l' instruction machine
	Le principal du collège M. Desdouets → Un de ses professeurs dit de lui
	Des opérations de base sont disponibles dans tous les → la plupart des jeux d' instructions
Spam
Obvious Agrammatical Spamming	Süleyman Ier s' empare de l' Arabie et fait entrer dans l' → emp kikoo c moi ca va loll ' empire ottoman Médine et La Mecque
Subtle Grammatical Spamming	pour promouvoir la justice , la solidarité et la paix → l'apéro dans le monde

Licence

The WiCoPaCo is realesed under the GNU Free Documentation License (GFDL)

Any research using this corpus for running experiments should include the following citation:

Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History, LREC 2010.

Here is the Bibtex entry:

@InProceedings{max10wicopaco,
    author = {Aurélien Max and Guillaume Wisniewski},
    title = {Mining Naturally-occurring Corrections and Paraphrases from 
             Wikipedia’s Revision History},
    booktitle = {Proceedings of the Seventh conference on International 
                 Language Resources and Evaluation (LREC'10)},
    year = {2010},
    month = {may},
    date = {19-21},
    address = {Valletta, Malta},
    editor = {Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, 
              Joseph Mariani, Jan Odjik, Stelios Piperidis, 
              Mike Rosner, Daniel Tapias},
    publisher = {European Language Resources Association (ELRA)},
    isbn = {2-9517408-6-7},
    language = {english}
}

File Format

The corpus is avaible in a simple XML format. Here is an example of an entry of the WiCoPaCo corpus that illustrates all the available information:

<modif id="23" wp_page_id="7" wp_before_rev_id="4649540"
    wp_after_rev_id="4671967" wp_user_id="0" wp_user_num_modif="1096911"
    wp_comment="Définition"> 
    <before>On nomme <m num_words="1">Algebre</m> linéaire...</before> 
    <after>On nomme <m num_words="1">Algèbre</m> linéaire...</after>  
</modif>

Corpus Download

File	Description	Size	Date updated
[xsd]	Correction and paraphrase corpus XML schema file	tiny	Nov. 8, 2009
[xml.gz]	Correction and paraphrase corpus French language	235,962 entries 30Mb	Nov. 8, 2009
[xml.gz]	Correction and paraphrase corpus French language	408,816 entries 52Mb	Jan. 25, 2010

Spelling Error Labels

We have extracted, from the WiCoPaCo corpus, all the modifications that are corrections of misspelled words. For the moment, this extraction is made semi-automatically thanks to 2 heuristics. A complete description of the extraction process can be found in [2].

Spelling errors are classified as either non-word error or real-word error.

The labels are stored in a simple XML file that associate a WiCoPaCo modification id (the id) tag of the wrhc* file) to a label (either real_word_error or non_word_error).

The current version of the labels refers to WiCoPaCo v2.0.

File	Description	Size	Date updated
[xml]	Spelling error labels	138,875 entries	June. 7, 2010
[xml]	Spelling error labels	146,595 entries	April. 22, 2010

People Involved

Delphine BERNHARD
Houda BOUAMOR
Julien BOULET
Camille DUTREY
Martine HURAULT-PLANTET
Aurélien MAX
Guillaume WISNIEWSKI

Publications

[1] Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History, LREC 2010, Valetta, Malta [pdf]
[2] Guillaume Wisniewski, Aurélien Max and François Yvon, Recueil et analyse d'un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia, TALN 2010, Montréal, Canada [pdf]
[3] Camille Dutrey, Houda Bouamor, Delphine Bernhard and Aurélien Max, Typologie des modifications dans les révisions de Wikipédia, LIMSI Technical Report 2011-01 [pdf]

Acknowledgements

This work was funded by a LIMSI AI grant and the ANR Trace project.