metric
of articles popularity on English Wikipedia
popular articles
vs non-popular articles
important
?
How do we beat top-ranking Wikipedia content?
We suppose that Wikipedia is facing the common problem as pages in Google. Highly popular
articles are popular also because they have a lot of backlinks or pages that are redirected to
them. Imagine you're a journalist or a active blogger writing about an article on Afghanistan
news. You google and enter the Afghanistan
article on Wikipedia to get the
resources quickly. Entering wikipedia articles through search engines such as Google can be
tricky because even if it yields decent amount of content, you will never find an unpopular
Wikipedia page. Once you find enough content on Wikipedia, you'll return to your adored search
engine. So, why would you keep searching for that unpopular inexistent
page?
Our goal is to showcase articles which may be less popular so that you can be informed by facts and topics you may search or need using English Wikipedia. This can help promote less visible articles so that they can be improved, edited, viewed and thus contribute to our knowledge of the on-going events!
Also, for a journalist, it may really be helpful because instead of covering incredibly adverstised topics, they may get the opportunity to be the first to right their own story about an uncovered subject. Their research can improve the article visibility and give more importance to a hidden world.
We hypothesize that unkown articles are not necessarily unimportant!
In order to estimate the popularity of the wikipedia articles, we came up with some metrics. They don't suddenly pop into your head and bang! Instead, good ol'intuition and some good references (1) and (2) became a good indicator to select our candidates.
4 characteristics of an article could be used to identify the popularity of an article:
As a pilot-phase, we focus solely on articles that all have as common subject: civilian attack, civil conflict, military conflict.
If you want to know more, sit back & enjoy the ride!
Translating that intuition into a number isn't easy. One thing we might try is looking at the number of references in a an article. References are used by the writer of the article to justify and cite the content present in the article. References are thus synonym of quality, in the context of a collaborative and openly enditable encyclopedia such as Wikipedia.
We think that the references that are found at the bottom of the page of the articles could partially encapsulate the popularity of that article. So we analyze the references for each article.
Plotting the distribution of the number of references
and also in log-log scale, we
notice that the distribution seems to follow a power law, which is somehow intuitive: a lot of
pages do not have many many many many references.
This leads to another question!
What kind of references are hidden among this data?
Let's have a look at the most commonly occurring references in the data. Also what is the trend?
For recent wikipedia conlfict-related articles (after 1995) most of the references come from the American, English and Arab media. American and English sources are no surprise as we are working with the English version of Wikipedia. In the top 5, reuters is the first! Interestingly, almasdarnews is a close second. This online media source covers mostly conflicts in the Middle East: Syria, Yemen, and Iraq.
That's one down, three more to go!
Article length speaks for itself. The longer an article, the more likely it was edited and developed to fully cover the article topic. Looking at the number of views per page is also an important factor affecting the popularity of an article. Finally, the number of external links corresponds to the incoming links from other articles within Wikipedia.
Why not follow the trend?
Exploring our other indicators of popularity 'article length', 'number of views', 'number of external links', let's see if these choices actually contribute to popularity of an article! To do so, we plot one indicator with respect to the other: we analyze if variables are either monotonic or independent, i.e. if article length increases, number of external links increases or does not vary. Imagine if 'article length' increases and at the same time 'number of views' decreases! Summing them up into popularity score would mean adding a negative coefficient otherwise the effect of one will partially cancel out the effect of the other one.
article_lenght | views | refs_count | link_count | |
---|---|---|---|---|
article_lenght | 1.000000 | 0.619499 | 0.715575 | 0.892395 |
views | 0.619499 | 1.000000 | 0.477095 | 0.625089 |
refs_count | 0.715575 | 0.477095 | 1.000000 | 0.582192 |
link_count | 0.892395 | 0.625089 | 0.582192 | 1.000000 |
All variables seem more or less correlated, except 'views' that seem independent with respect to the others. Also, the ranges are completely different! To be able to measure the influence of each component, they need to be comparable and thus on the same range to compute the popularity score. For this reason, we transform the features by scaling each feature to a range between 0 and 1 using MinMaxScaler. We used this standardization approach instead of normalization to maintain the structure of the data, i.e. to preserve the distribution of the features.
Having chosen our four features for quantifying the popularity of a page, and standardizing them to have them on the same range, we can compute the popularity score of an article.
For each article x, the popularity score is defined as:
$score(x) = length(x) + views(x) + refs(x) + links(x)$
where
We want to double check that our score is correlated with what people think. We wrote a survey in order to verify that our metric is accurate.
We collected 2088 answers from more than 20 people. The user choices between left, right or 'Skip' in order to indicate the most popular conflict between the 2 proposed conflicts, i.e. it is a binary survey.
The coverage of the survey is not incredible (12%) because we have more than 17 000 articles that can be used to write the survey, and some are really really unknown. Therefore, we randomly selected only articles that correspond to the middle to the top part of our popularity ranking.
We counted the number of correct answer for each popularity duel and we found out that 94 % of the answers to the questionaire match our popularity score, not bad!
The metric thus seems to capture the popularity of the article. In this 6% of error, most of the errors seem to come more often from a missclick or a confusion rather than an error in our metric. For instance, two users thought that World War I is less popular than 2009 Jaipur fire or Battle of Adwa, which seems a bit unrealistic. Another group of errors arise from 2 unknown conflicts, like 2008 Bin Salman mosque bombing vs the Battle of Marawi. Having established that the metric seems realistic, we continue our investigation by looking at the findings our metric will give: spot important unpopular conflicts!
Not suprisingly, World War II is the most popular page! It has the highest popularity score of 2.59, with
World War II being the most popular article hints that our metrics may be enough to encapsulate the popularity of a page. Because who hasn't studied World War II in school and looked up this article on Wikipedia?
We wonder if the attribution of the popularity score is partially influenced because of the event date and the importance of the article. By event date, we refer to the actual end date of the conflict the article is referring to (as our articles talk about conflicts). For now, the importance of an article is quantified by the number of deaths during that event.
Let's first have a quick look at the distribution of the article end dates.
The plot gives us a very broad range of conflict end dates. Since we are interested in more recent conflicts, as a first approach why don't we have a look at articles mentionning a conflict end date after 1910.
Roughly five distinct population of articles are observed. The first peak around 1914 corresponds to World War I. Indeed, 1113 of our filtered articles have a end conflict end date between 1910-1930. The second peak around 1945 corresponds to World War 2, with more than 2300 articles belonging to that particular end conflict date. We can see that around an important conflict end date such as 1945, a lot of articles have talked about that conflict, making it even more significant in Wikipedia!
Since, we are more interesteted in recent events, let's fast-forward to 2017-2018. We are looking at article's describing ongoing conflicts.
Here, we look at the recent ongoing conflicts, and analyze the popularity of each article between 2017-2018. Most popular articles deal with the Yemen and Syria conflicts.
Let's have a closer look at the unpopular articles. We want to understand if unpopular articles are unpopular because they are not "important" enough. Therefore, we decide to look at the number of deaths associated to each conflict.
Interestingly, the number of death does not seem to correlate with the popularity of the article discussing ongoing conflicts: most of the ongoing conflicts have a low popularity score (< 0.6) but have a high number of death. Even with all articles (not only ongoing conflict) we can see that the number of death does not seem to correlate with the popularity of the article. All the articles follow the same trend!
Looking at articles in the recent years, we found that some unpopular articles may be not be completly unsignificant as these conflicts involve an considerable number of deaths. To live in a better world people must be aware of unpopular conflicts that have a high number of death! People will not have easily access to such articles because they are not cited very often, they do not appear in Google search, they even do not appear in the first pages of the wikipedia search. A way to overcome this barrier could be to monthly showcase these articles on the wikipedia front page. With this project we proposed an easy way (4 factors) of finding unpopular articles which could be improved in order to raise awardness on unknown conflicts involving victims.
From 1988
to 1998
Deaths: 20,000
Popularity score: 0.119432
From 1991
to 1995
Deaths: 21,000
Popularity score: 1.315572