

Introduction
WiCoPaCo is a corpus of natural rewritings extracted from the revision history of Wikipedia. It includes spelling corrections, reformulations, and other local text transformations, which can be of great interest for many NLP applications, such as text correction, text normalization, paraphrasing or summarization.
The following table provides examples of the corrections that can be found in the WiCoPaCo corpus.
Correction | |
Normalizations | Son 2ème disque → Son deuxième disque |
Non-Word Error Corrections | c’est-à-dire la dernrière → dernière année avant l’ère chrétienne |
Diacritics Error Corrections | la jeune Natascha Kampusch, agée → âgée de 18 ans |
Real-Word Error Corrections | dans le but de sensibilisé → sensibiliser sur les changements |
Reformulation | |
Close Meaning | Le tritium existe dans la nature . Il est produit → se forme naturellement dans l' atmosphère |
“Gimme Gimme Gimme” et “I Have A Dream” contribueront au gigantesque succès de → viendront alimenter la gloire d' Abba | |
Different Meaning | alors que l' ordinateur → qu'un processeur de la famille x86 reconnaîtra ce que l' instruction machine |
Le principal du collège M. Desdouets → Un de ses professeurs dit de lui | |
Des opérations de base sont disponibles dans tous les → la plupart des jeux d' instructions | |
Spam | |
Obvious Agrammatical Spamming | Süleyman Ier s' empare de l' Arabie et fait entrer dans l' → emp kikoo c moi ca va loll ' empire ottoman Médine et La Mecque |
Subtle Grammatical Spamming | pour promouvoir la justice , la solidarité et la paix → l'apéro dans le monde |
Licence
The WiCoPaCo is realesed under the GNU Free Documentation License (GFDL)
Any research using this corpus for running experiments should include the following citation:
Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History, LREC 2010.
Here is the Bibtex entry:
@InProceedings{max10wicopaco, author = {Aurélien Max and Guillaume Wisniewski}, title = {Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }
File Format
The corpus is avaible in a simple XML format. Here is an example of an entry of the WiCoPaCo corpus that illustrates all the available information:
<modif id="23" wp_page_id="7" wp_before_rev_id="4649540" wp_after_rev_id="4671967" wp_user_id="0" wp_user_num_modif="1096911" wp_comment="Définition"> <before>On nomme <m num_words="1">Algebre</m> linéaire...</before> <after>On nomme <m num_words="1">Algèbre</m> linéaire...</after> </modif>
Corpus Download
File | Description | Size | Date updated |
---|---|---|---|
[xsd] | Correction and paraphrase corpus XML schema file | tiny | Nov. 8, 2009 |
[xml.gz] | Correction and paraphrase corpus French language |
235,962 entries 30Mb |
Nov. 8, 2009 |
[xml.gz] | Correction and paraphrase corpus French language |
408,816 entries 52Mb |
Jan. 25, 2010 |
Spelling Error Labels
We have extracted, from the WiCoPaCo corpus, all the modifications that are corrections of misspelled words. For the moment, this extraction is made semi-automatically thanks to 2 heuristics. A complete description of the extraction process can be found in [2].
Spelling errors are classified as either non-word error or real-word error.
The labels are stored in a simple XML file that associate a WiCoPaCo modification id (the id) tag of the wrhc* file) to a label (either real_word_error or non_word_error).
The current version of the labels refers to WiCoPaCo v2.0.
File | Description | Size | Date updated |
---|---|---|---|
[xml] | Spelling error labels | 138,875 entries | June. 7, 2010 |
[xml] | Spelling error labels | 146,595 entries | April. 22, 2010 |
People Involved
- Delphine BERNHARD
- Houda BOUAMOR
- Julien BOULET
- Camille DUTREY
- Martine HURAULT-PLANTET
- Aurélien MAX
- Guillaume WISNIEWSKI
Publications
- [1] Aurélien Max and Guillaume Wisniewski, Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History, LREC 2010, Valetta, Malta [pdf]
- [2] Guillaume Wisniewski, Aurélien Max and François Yvon, Recueil et analyse d'un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia, TALN 2010, Montréal, Canada [pdf]
- [3] Camille Dutrey, Houda Bouamor, Delphine Bernhard and Aurélien Max, Typologie des modifications dans les révisions de Wikipédia, LIMSI Technical Report 2011-01 [pdf]
Acknowledgements
This work was funded by a LIMSI AI grant and the ANR Trace project.