News
January 2011
Technical document on rewriting typology (in French)
Read More

April 30, 2010
First release of the corpus
Read More


Table of contents

Wikipedia Correction and Paraphrase Corpus

Introduction

WiCoPaCo is a corpus of natural rewritings extracted from the revision history of Wikipedia. It includes spelling corrections, reformulations, and other local text transformations, which can be of great interest for many NLP applications, such as text correction, text normalization, paraphrasing or summarization.

The following table provides examples of the corrections that can be found in the WiCoPaCo corpus.

Correction
Normalizations Son 2ème disque → Son deuxième disque
Non-Word Error Corrections c’est-à-dire la dernrière → dernière année avant l’ère chrétienne
Diacritics Error Corrections la jeune Natascha Kampusch, agée → âgée de 18 ans
Real-Word Error Corrections dans le but de sensibilisé → sensibiliser sur les changements
Reformulation
Close Meaning Le tritium existe dans la nature . Il est produit → se forme naturellement dans l' atmosphère
“Gimme Gimme Gimme” et “I Have A Dream” contribueront au gigantesque succès de → viendront alimenter la gloire d' Abba
Different Meaning alors que l' ordinateur → qu'un processeur de la famille x86 reconnaîtra ce que l' instruction machine
Le principal du collège M. Desdouets → Un de ses professeurs dit de lui
Des opérations de base sont disponibles dans tous les → la plupart des jeux d' instructions
Spam
Obvious Agrammatical Spamming Süleyman Ier s' empare de l' Arabie et fait entrer dans l' → emp kikoo c moi ca va loll ' empire ottoman Médine et La Mecque
Subtle Grammatical Spamming pour promouvoir la justice , la solidarité et la paix → l'apéro dans le monde

Licence

The WiCoPaCo is realesed under the GNU Free Documentation License (GFDL)

Any research using this corpus for running experiments should include the following citation:

Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History, LREC 2010.

Here is the Bibtex entry:

@InProceedings{max10wicopaco,
    author = {Aurélien Max and Guillaume Wisniewski},
    title = {Mining Naturally-occurring Corrections and Paraphrases from 
             Wikipedia’s Revision History},
    booktitle = {Proceedings of the Seventh conference on International 
                 Language Resources and Evaluation (LREC'10)},
    year = {2010},
    month = {may},
    date = {19-21},
    address = {Valletta, Malta},
    editor = {Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, 
              Joseph Mariani, Jan Odjik, Stelios Piperidis, 
              Mike Rosner, Daniel Tapias},
    publisher = {European Language Resources Association (ELRA)},
    isbn = {2-9517408-6-7},
    language = {english}
}
             

File Format

The corpus is avaible in a simple XML format. Here is an example of an entry of the WiCoPaCo corpus that illustrates all the available information:

<modif id="23" wp_page_id="7" wp_before_rev_id="4649540"
    wp_after_rev_id="4671967" wp_user_id="0" wp_user_num_modif="1096911"
    wp_comment="Définition"> 
    <before>On nomme <m num_words="1">Algebre</m> linéaire...</before> 
    <after>On nomme <m num_words="1">Algèbre</m> linéaire...</after>  
</modif>

Corpus Download

File Description Size Date updated
[xsd] Correction and paraphrase corpus XML schema file tiny Nov. 8, 2009
[xml.gz] Correction and paraphrase corpus
French language
235,962 entries
30Mb
Nov. 8, 2009
[xml.gz] Correction and paraphrase corpus
French language
408,816 entries
52Mb
Jan. 25, 2010

Spelling Error Labels

We have extracted, from the WiCoPaCo corpus, all the modifications that are corrections of misspelled words. For the moment, this extraction is made semi-automatically thanks to 2 heuristics. A complete description of the extraction process can be found in [2].

Spelling errors are classified as either non-word error or real-word error.

The labels are stored in a simple XML file that associate a WiCoPaCo modification id (the id) tag of the wrhc* file) to a label (either real_word_error or non_word_error).

The current version of the labels refers to WiCoPaCo v2.0.

File Description Size Date updated
[xml] Spelling error labels 138,875 entries June. 7, 2010
[xml] Spelling error labels 146,595 entries April. 22, 2010

People Involved

  • Delphine BERNHARD
  • Houda BOUAMOR
  • Julien BOULET
  • Camille DUTREY
  • Martine HURAULT-PLANTET
  • Aurélien MAX
  • Guillaume WISNIEWSKI

Publications

  • [1] Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History, LREC 2010, Valetta, Malta [pdf]
  • [2] Guillaume Wisniewski, Aurélien Max and François Yvon, Recueil et analyse d'un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia, TALN 2010, Montréal, Canada [pdf]
  • [3] Camille Dutrey, Houda Bouamor, Delphine Bernhard and Aurélien Max, Typologie des modifications dans les révisions de Wikipédia, LIMSI Technical Report 2011-01 [pdf]

Acknowledgements

This work was funded by a LIMSI AI grant and the ANR Trace project.