TY - GEN
T1 - Multilingual sentence-level bias detection in Wikipedia
AU - Aleksandrova, Desislava
AU - Lareau, François
AU - Ménard, Pierre André
N1 - Publisher Copyright:
© 2019 Association for Computational Linguistics (ACL). All rights reserved.
PY - 2019
Y1 - 2019
N2 - We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.
AB - We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.
UR - https://www.scopus.com/pages/publications/85076482100
U2 - 10.26615/978-954-452-056-4_006
DO - 10.26615/978-954-452-056-4_006
M3 - Contribution to conference proceedings
AN - SCOPUS:85076482100
T3 - International Conference Recent Advances in Natural Language Processing, RANLP
SP - 42
EP - 51
BT - International Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings
A2 - Angelova, Galia
A2 - Mitkov, Ruslan
A2 - Nikolova, Ivelina
A2 - Temnikova, Irina
A2 - Temnikova, Irina
PB - Incoma Ltd
T2 - 12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019
Y2 - 2 September 2019 through 4 September 2019
ER -