Multilingual sentence-level bias detection in Wikipedia

Desislava Aleksandrova; François Lareau; Pierre André Ménard

doi:10.26615/978-954-452-056-4_006

Multilingual sentence-level bias detection in Wikipedia

Desislava Aleksandrova
, François Lareau
, Pierre André Ménard

University of Montreal
Computer Research Institute of Montreal

Research output: Contribution to Book/Report types › Contribution to conference proceedings › peer-review

16 Citations (Scopus)

Abstract

We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.

Original language	English
Title of host publication	International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings
Editors	Galia Angelova, Ruslan Mitkov, Ivelina Nikolova, Irina Temnikova, Irina Temnikova
Publisher	Incoma Ltd
Pages	42-51
Number of pages	10
ISBN (Electronic)	9789544520557
DOIs	https://doi.org/10.26615/978-954-452-056-4_006
Publication status	Published - 2019
Externally published	Yes
Event	12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019 - Varna, Bulgaria Duration: 2 Sept 2019 → 4 Sept 2019

Publication series

Name	International Conference Recent Advances in Natural Language Processing, RANLP
Volume	2019-September
ISSN (Print)	1313-8502

Conference

Conference	12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019
Country/Territory	Bulgaria
City	Varna
Period	2/09/19 → 4/09/19

Access to Document

10.26615/978-954-452-056-4_006

Cite this

Aleksandrova, D., Lareau, F., & Ménard, P. A. (2019). Multilingual sentence-level bias detection in Wikipedia. In G. Angelova, R. Mitkov, I. Nikolova, I. Temnikova, & I. Temnikova (Eds.), International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings (pp. 42-51). (International Conference Recent Advances in Natural Language Processing, RANLP; Vol. 2019-September). Incoma Ltd. https://doi.org/10.26615/978-954-452-056-4_006

Aleksandrova, Desislava ; Lareau, François ; Ménard, Pierre André. / Multilingual sentence-level bias detection in Wikipedia. International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings. editor / Galia Angelova ; Ruslan Mitkov ; Ivelina Nikolova ; Irina Temnikova ; Irina Temnikova. Incoma Ltd, 2019. pp. 42-51 (International Conference Recent Advances in Natural Language Processing, RANLP).

@inproceedings{6b8ad883a5fc468fa7ac5d10bf0136e3,

title = "Multilingual sentence-level bias detection in Wikipedia",

abstract = "We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.",

author = "Desislava Aleksandrova and Fran{\c c}ois Lareau and M{\'e}nard, \{Pierre Andr{\'e}\}",

note = "Publisher Copyright: {\textcopyright} 2019 Association for Computational Linguistics (ACL). All rights reserved.; 12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019 ; Conference date: 02-09-2019 Through 04-09-2019",

year = "2019",

doi = "10.26615/978-954-452-056-4\_006",

language = "English",

series = "International Conference Recent Advances in Natural Language Processing, RANLP",

publisher = "Incoma Ltd",

pages = "42--51",

editor = "Galia Angelova and Ruslan Mitkov and Ivelina Nikolova and Irina Temnikova and Irina Temnikova",

booktitle = "International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings",

}

Aleksandrova, D, Lareau, F & Ménard, PA 2019, Multilingual sentence-level bias detection in Wikipedia. in G Angelova, R Mitkov, I Nikolova, I Temnikova & I Temnikova (eds), International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings. International Conference Recent Advances in Natural Language Processing, RANLP, vol. 2019-September, Incoma Ltd, pp. 42-51, 12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2/09/19. https://doi.org/10.26615/978-954-452-056-4_006

Multilingual sentence-level bias detection in Wikipedia. / Aleksandrova, Desislava; Lareau, François; Ménard, Pierre André.
International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings. ed. / Galia Angelova; Ruslan Mitkov; Ivelina Nikolova; Irina Temnikova; Irina Temnikova. Incoma Ltd, 2019. p. 42-51 (International Conference Recent Advances in Natural Language Processing, RANLP; Vol. 2019-September).

Research output: Contribution to Book/Report types › Contribution to conference proceedings › peer-review

TY - GEN

T1 - Multilingual sentence-level bias detection in Wikipedia

AU - Aleksandrova, Desislava

AU - Lareau, François

AU - Ménard, Pierre André

PY - 2019

Y1 - 2019

N2 - We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.

AB - We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.

UR - https://www.scopus.com/pages/publications/85076482100

U2 - 10.26615/978-954-452-056-4_006

DO - 10.26615/978-954-452-056-4_006

M3 - Contribution to conference proceedings

AN - SCOPUS:85076482100

T3 - International Conference Recent Advances in Natural Language Processing, RANLP

SP - 42

EP - 51

BT - International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings

A2 - Angelova, Galia

A2 - Mitkov, Ruslan

A2 - Nikolova, Ivelina

A2 - Temnikova, Irina

PB - Incoma Ltd

T2 - 12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019

Y2 - 2 September 2019 through 4 September 2019

ER -

Aleksandrova D, Lareau F, Ménard PA. Multilingual sentence-level bias detection in Wikipedia. In Angelova G, Mitkov R, Nikolova I, Temnikova I, Temnikova I, editors, International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings. Incoma Ltd. 2019. p. 42-51. (International Conference Recent Advances in Natural Language Processing, RANLP). doi: 10.26615/978-954-452-056-4_006

Multilingual sentence-level bias detection in Wikipedia

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this