What I’m talking about in 2016

Authority and authoritative sources, critical data studies, digital methods, the travel of facts online, bot politics and social media and politics. These are some of the things I’m talking about in the first six months of 2016. (Just in case you thought the #sunselfies only indicated fun and aimless loafing).  

15 January Fact factories: How Wikipedia’s logics determine what facts are represented online. Wikipedia 15th birthday event, Oxford Internet Institute. [Webcast, OII event page, OII’s Medium post]

29 January Wikipedia and me: A story in four acts. TEDx Leeds University. [Video, TEDx Leeds University site]

Abstract: This is a story about how I came to be involved in Wikipedia and how I became a critic. It’s a story about hope and friendship and failure, and what to do afterwards. In many ways this story represents the relationship that many others like me have had with the Internet: a story about enormous hope and enthusiasm followed by disappointment and despair. Although similar, the uniqueness of these stories is in the final act – the act where I tell you what I now think about the future of the Internet after my initial despair. This is my Internet love story in four acts: 1) Seeing the light 2) California rulz 3) Doubting Thomas 4) Critics unite. 

17 February. Add data to methods and stir. Digital Methods Summer School. CCI, Queensland University of Technology, Brisbane [QUT Digital Methods Summer School website]

Abstract: Are engagements with real humans necessary to ethnographic research? In this presentation, I argue for methods that connect data traces to the individuals who produce them by exploring examples of experimental methods featured on the site ‘EthnographyMatters.net’, such as live fieldnoting, collaborative mapmaking and ‘sensory postcards’.  This presentation will serve as an inspiration for new work that expands beyond disciplinary and methodological boundaries and connects the stories we tell about our things with the humans who create them.  

10 March. Situating Innovations in Digital Measures. University of Leeds, Leeds Critical Data Studies Inaugural Event.  

Abstract: Drawn from case studies that were presented at the recent Digital Methods Summer School (Digital Media Research Centre, Queensland University of Technology) in Brisbane, Australia last month, as well as from experimental methods contributed to by authors of the Ethnography Matters community, this seminar will present a host of inspiring methodological tools that researchers of digital culture and politics are using to explore questions about the role of digital technologies in modern life. Instead of data-centric models and methodologies, the seminar focuses on human-centric models that also engage with the opportunities afforded by digital technologies. 

21-22 April. Ode to the infobox. Streams of Consciousness: Data, Cognition and Intelligent Devices Conference. University of Warwick.

Abstract: Also called a ‘fact box’, the infobox is a graphic design element that highlights summarised statements or facts about the world contained within it. Infoboxes are important structural elements in the design of digital information. They usually hang in the right-hand corner of a webpage, calling out to us that the information contained within them is special and somehow apart from the rest. The infobox satisfies our rapid information-seeking needs. We’ve been trained to look to the box to discover, not just another set of informational options, but an authoritative statement of seemingly condensed consensus emerging out of the miasma of data about the world around us.

When you start to look for them, you’ll see infoboxes wherever you look. On Google, these boxes contain results from Google’s Knowledge Graph; on Wikipedia they are contained within articles and host summary statistics and categories; and on the BBC, infoboxes highlight particular facts and figures about the stories that flow around them.

The facts represented in the infoboxes are no longer as static as the infoboxes of old. Now they are the result of algorithmic processes that churn thousands, sometimes millions of data points according to rulesets that produce relatively unique encounters by each new user.

In this paper, I trace the multitude of instructions and sources, institutions and people that constitute the assemblage that results in different facts for different groups at different times. Investigating infoboxes on Wikipedia and Google through intermediaries such as Wikidata, I build a portrait of the pipes, processes and people that feed these living, dynamic frames. The infobox, humble as it seems, turns out to be a powerful force in today’s deeply connected information ecosystem. By celebrating the infobox, I hope to reveal its hidden power – a power with consequences far beyond the efficiency that it promises.

29 April. How facts travel in the digital age. Social Media Lab Guest Speaker Series, Ryerson University, Social Media Lab, Toronto, Canada. [Speaker series website]

Abstract: How do facts travel through online systems? How is it that some facts gather steam and gain new adherents while others languish in isolated sites? This research investigates the travel of two sets of facts through Wikipedia’s networks and onto search engines like Google. The first: facts relating to the 2011 Egyptian Revolution; the second: facts relating to “surr”, a sport played by men in the villages of Northern India. While the Egyptian Revolution became known to millions across the world as events were reported on multiple Wikipedia language versions in early 2011, the facts relating to surr faced enormous challenges as its companions attempted to propel it through Wikipedia’s infrastructure. Following the facts as they travelled through Wikipedia gives us an insight into the source of systemic biases of Internet infrastructures and the ways in which political actors are changing their strategies in order to control narratives around political events. 

8 June. Politicians, Journalists, Wikipedians and their Twitter bots. Algorithms, Automation and Politics. (Heather Ford, Elizabeth Dubois, Cornelius Puschmann) ICA Pre-Conference, Fukuoka, Japan. [Event website]

Abstract selection: Recent research suggests that automated agents deployed on social media platforms, particularly Twitter, have become a feature of the modern political communication environment (Samuel, 2015, Forelle et al, 2015, Milan, 2015). Haustein et al (2016) cite a range of studies that put the percentage of bots among all Twitter accounts at 10-16% (p. 233). Governments have been shown to employ social media experts to spread pro-governmental messages (Baker, 2015, Chen 2015), political parties pay marketing companies to create or manipulate trending topics (Forelle et al, 2015), and politicians and their staff use bots to augment the number of account followers in order to provide an illusion of popularity to their accounts (Forelle et al, 2015). The assumption in these analyses is that bots have a direct influence on public opinion and that they can act as credible and competent sources of information (Edwards et al, 2014). There is still, however, little empirical evidence of the link between bots and political discourse, the material consequences of such changes or how social groups are reacting. [continued] 

11 June. Wikipedia: Moving Between the Whole and its Traces. In ‘Drowning in Data: Industry and Academic Approaches to Mixed Methods in “Holistic” Big Data Studies’ panel. International Communication Association Conference. Fukuoka, Japan. [ICA website]

Abstract: In this paper, I outline my experiences as an ethnographer working with data scientists to explore various questions surrounding the dynamics of Wikipedia sources and citations. In particular, I focus on the moments at which we were able to bring the small and the large into conversation with one another, and moments when we looked, wide-eyed at one another, unable to articulate what had gone wrong. Inspired by Latour’s (2010) reading of Gabriel Tarde, I argue that a useful analogy for conducting mixed methods for studies about which large datasets and holistic tools are available is the process of life drawing – a process of moving up close to the easel and standing back (or to the side) as the artist looks at both their subject and the canvas in a continual motion.

Wikipedia’s citation traces can be analysed in their aggregate – piled up, one on top of the other to indicate the emergence of new patterns, new vocabulary, new authorities of knowledge in the digital information environment. But citation traces take a particular shape and form, and without an understanding of the behaviour that lies behind such traces, the tendency is to count what is available to us, rather than to think more critically about the larger questions that Wikipedia citations help to answer.

I outline a successful conversation which happened when we took a large snapshot of 67 million source postings from about 3.5 million Wikipedia articles and attempted to begin classifying the citations according to existing frameworks (Ford 2014). In response, I conducted a series of interviews with editors by visualising their citation traces and asking them questions about the decision-making and social interaction that lay behind such performances (Dubois and Ford 2015). I also reflect on a less successful moment when we attempted to discover patterns in the dataset on the basis of findings from my ethnographic research into the political behaviour of editors. Like the artist who had gotten their proportions wrong when scaling up the image on the canvas, we needed to re-orient ourselves and remember what we were trying to ultimately discover.

13 June. The rise of expert amateurs in the realm of knowledge production: The case of Wikipedia’s newsworkers. In ‘Dialogues in Journalism Studies: The New Gatekeepers’ panel. International Communication Association Conference. Fukuoka, Japan. [ICA website]

Abstract: Wikipedia has become an authoritative source about breaking news stories as they happen in many parts of the world. Although anyone can technically edit a Wikipedia article, recent evidence suggests that some have significantly more power than others when it comes to being able to have edits sustained over time. In this paper, I suggest that the theory of co-production, elaborated upon by Sheila Jasanoff, is a useful way of framing how, rather than a removal of the gatekeepers of the past, Wikipedia demonstrates two key trends. The first is the rise of a new set of gatekeepers in the form of experienced Wikipedians who are able to deploy coded objects effectively in order to stabilize or destabilize an article, and the second is a reconfiguration in the power of traditional sources of news and information in the choices that Wikipedia editors make when writing about breaking news events.



Max Klein on Wikidata, “botpedia” and gender classification

Max Klein defines himself on his blog as a ‘Mathematician-Programmer, Wikimedia-Enthusiast, Burner-Yogi’ who believes in ‘liberty through wikis and logic’. I interviewed him a few weeks ago when he was in the UK for Wikimania 2014. He then wrote up some of his answers so that we could share with it others. Max is a long-time volunteer of Wikipedia who has occupied a wide range of roles as a volunteer and as a Wikipedian in residence for OCLC, among others. He has been working on Wikidata from the beginning but it hasn’t always been plain sailing. Max is outspoken about his ideas and he is respected for that, as well as for his patience in teaching those who want to learn. This interview serves as a brief introduction to Wikidata and some of its early disagreements. 

Max Klein in 2011. CC BY SA, Wikimedia Commons

Max Klein in 2011. CC BY SA, Wikimedia Commons

How was Wikidata originally seeded?
In the first days of Wikidata we used to call it a ‘botpedia’ because it was basically just an echo chamber of bots talking to each other. People were writing bots to import information from infoboxes on Wikipedia. A heavy focus of this was data about persons from authority files.

Authority files?
An authority file is a Library Science term that is basically a numbering system to assign authors unique identifiers. The point is to avoid a “which John Smith?” problem. At last year’s Wikimania I said that Wikidata itself has become a kind of “super authority control” because now it connects so many other organisations’ authority control (e.g. Library of Congress and IMDB). In the future I can imagine Wikidata being the one authority control system to rule them all.

In the beginning, each Wikipedia project was supposed to be able to decide whether it wanted to integrate Wikidata. Do you know how this process was undertaken?
It actually wasn’t decided site-by-site. At first only Hungarian, Italian, and Hebrew Wikipedias were progressive enough to try. But once English Wikipedia approved the migration to use Wikidata, soon after there was a global switch for all Wikis to do so (see the announcement here).

Do you think it will be more difficult to edit Wikipedia when infoboxes are linking to templates that derive their data from Wikidata? (both editing and producing new infoboxes?)
It would seem to complicate matters that infobox editing becomes opaque to those who aren’t Wikidata aware. However at Wikimania 2014, two Sergeys from Russian Wikipedia demonstrated a very slick gadget that made this transparent again – it allowed editing of the Wikidata item from the Wikipedia article. So with the right technology this problem is a nonstarter.

Can you tell me about your opposition to the ways in which Wikidata editors decided to structure gender information on Wikidata?
In Wikidata you can put a constraint to what values a property can have. When I came across it the “sex or gender” property said “only one of ‘male, female, or intersex'”. I was opposed to this because I believe that any way the Wikidata community structure the gender options, we are going to imbue it with our own bias. For instance already the property is called “sex or gender”, which shows a lack of distinction between the two, which some people would consider important. So I spent some time arguing that at least we should allow any value. So if you want to say that someone is “third gender” or even that their gender is “Sodium” that’s now possible. It was just an early case of heteronormativity sneaking into the ontology.

Wikidata uses a CC0 license which is less restrictive than the CC BY SA license that Wikipedia is governed by. What do you think the impact of this decision has been in relation to others like Google who make use of Wikidata in projects like the Google Knowledge Graph?
Wikidata being CC0 at first seemed very radical to me. But one thing I noticed was that increasingly this will mean where the Google Knowledge Graph now credits their “info-cards” to Wikipedia, the attribution will just start disappearing. This seems mostly innocent until you consider that Google is a funder of the Wikidata project. So in some way it could seem like they are just paying to remove a blemish on their perceived omniscience.

But to nip my pessimism I have to remind myself that if we really believe in the Open Source, Open Data credo then this rising tide lifts all boats.

Code and the (Semantic) City

Mark Graham and I have just returned from Maynooth in Ireland where we participated in a really great workshop called Code and the City organised by Rob Kitchin and his team at the Programmable City project. We presented a draft paper entitled, ‘Semantic Cities: Coded Geopolitics and Rise of the Semantic Web’ where we trace how the city of Jerusalem is represented across Wikipedia and through WikiData, Freebase and to Google’s Knowledge Graph in order to answer questions about how linked data and the semantic web changes a user’s interactions with the city. We’ve been indebted to the folks from all of these projects who have helped us navigate questions about the history and affordances of these projects so that we can better understand the current Web ecology. The paper is currently being revised and will be available soon, we hope!

Infoboxes and cleanup tags: Artifacts of Wikipedia newsmaking

Screen Shot 2014-09-02 at 2.06.05 PM

Infobox from the first version of the 2011 Egyptian Revolution (then ‘protests’) article on English Wikipedia, 25 January, 2011

My article about Wikipedia infoboxes and cleanup tags and their role in the development of the 2011 Egyptian Revolution article has just been published in the journal, ‘Journalism: Theory, Practice and Criticism‘ (a pre-print is available on Academia.edu). The article forms part of a special issue of the journal edited by C W Anderson and Juliette de Meyer who organised the ‘Objects of Journalism’ pre-conference at the International Communications Association conference in London that I attended last year. The issue includes a number of really interesting articles from a variety of periods in journalism’s history – from pica sticks to interfaces, timezones to software, some of which we covered in the August 2013 edition of ethnographymatters.net

My article is about infoboxes and cleanup tags as objects of Wikipedia journalism, objects that have important functions in the coordination of editing and writing by distributed groups of editors. Infoboxes are summary tables on the right hand side of an article that enable readability and quick reference, while cleanup tags are notices at the head of an article warning readers and editors of specific problems with articles. When added to an article, both tools simultaneously notify editors about missing or weak elements of the article and add articles to particular categories of work.

The article contains an account of the first 18 days of the protests that resulted in the resignation of then-president Hosni Mubarak based on interviews with a number of the article’s key editors as well as traces in related articles, talk pages and edit histories. Below is a selection from what happened on day 1:

Day 1: 25 January, 2011 (first day of the protests)

The_Egyptian_Liberal published the article on English Wikipedia on the afternoon of what would become a wave of protests that would lead to the unseating of President Hosni Mubarak. A template was used to insert the ‘uprising’ infobox to house summarised information about the event including fields for its ‘characteristics’, the number of injuries and fatalities. This template was chosen from a range of other infoboxes relating to history and events on Wikipedia, but has since been deleted in favor of the more recently developed ‘civil conflict’ infobox with fields for ‘causes’, ‘methods’ and ‘results’.

The first draft included the terms ‘demonstration’, ‘riot’ and ‘self-immolation’ in the ‘characteristics’ field and was illustrated by the Latuff cartoon of Khaled Mohamed Saeed and Hosni Mubarak with the caption ‘Khaled Mohamed Saeed holding up a tiny, flailing, stone-faced Hosni Mubarak’. Khaled Mohamed Saeed was a young Egyptian man who was beaten to death reportedly by Egyptian security forces and the subject of the Facebook group ‘We are all Khaled Said’ moderated by Wael Ghonim that contributed to the growing discontent in the weeks leading up to 25 January, 2011. This would ideally have been a filled by a photograph of the protests, but the cartoon was used because the article was uploaded so soon after the first protests began. It also has significant emotive power and clearly represented the perspective of the crowd of anti-Mubarak demonstrators in the first protests.

Upon publishing, three prominent cleanup tags were automatically appended to the head of the article. These included the ‘new unreviewed article’ tag, the ‘expert in politics needed’ tag and the ‘current event’ tag, warning readers that information on the page may change rapidly as events progress. These three lines of code that constituted the cleanup tags initiated a complex distribution of tasks to different groups of users located in work groups throughout the site: page patrollers, subject experts and those interested in current events.

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

Looking at the diffs in the first day of the article’s growth, it becomes clear that the article is by no means a ‘blank slate’ that editors fill progressively with prose. Much of the activity in the first stage of the article’s development consisted of editors inserting markers or frames in the article that acted to prioritize and distribute work. Cleanup tags alerted others about what they believed to be priorities (to improve weak sections or provide political expertise, for example) while infoboxes and tables provided frames for editors to fill in details iteratively as new information became available.

By discussing the use of these tools in the context of Bowker and Star’s theories of classification (2000), I argue that these tools are not only material but also conceptual and symbolic. They facilitate collaboration by enabling users to fill in details according to a pre-defined set of categories and by catalyzing notices that alert others to the work that they believe needs to be done on the article. Their power, however, cannot only be seen in terms of their functional value. These artifacts are deployed and removed as acts of social and strategic power play among Wikipedia editors who each want to influence the narrative about what happened and why it happened. Infoboxes and tabular elements arise as clean, simple, well-referenced numbers out of the messiness and conflict that gave rise to them. When cleanup tags are removed, the article develops an implicit authority, appearing to rise above uncertainty, power struggles and the impermanence of the compromise that it originated from.

This categorization practice enables editors to collaborate iteratively with one another because each object signals work that needs to be done by others in order to fill in the gaps of the current content. In addition to this functional value, however, categorization also has a number of symbolic and political consequences. Editors are engaged in a continual practice of iterative summation that contributes to an active construction of the event as it happens rather than a mere assembling of ‘reliable sources’. The deployment and removal of cleanup tags can be seen as an act of power play between editors that affects readers’ evaluation of the article’s content. Infoboxes are similar sites of struggle whose deployment and development result in an erasure of the contradictions and debates that gave rise to them. These objects illuminate how this novel journalistic practice has important implications for the way that political events are represented.

Diary of an internet geography project #4

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-08-05 at 1.31.00 PMContinuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team. 

Step 1: Randomly selecting items from the corpus

We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?

Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).

  • Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
  • Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.

Continue reading

Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble

I gave this talk at Wikimania in London yesterday. 

In the first years of Wikipedia’s existence, many of us said that, as an example of citizen journalism and journalism by the people, Wikipedia would be able to avoid the gatekeeping problems faced by traditional media. The theory was that because we didn’t have the burden of shareholders and the practices that favoured elite viewpoints, we could produce a media that was about ‘all of us’ and not just ‘some of us’.

Dan Gillmor (2004) wrote that Wikipedia was an example of a wave of citizen journalism projects initiated at the turn of the century in which ‘news was being produced by regular people who had something to say and show, and not solely by the “official” news organizations that had traditionally decided how the first draft of history would look’ (Gillmor, 2004: x).

Yochai Benkler (2006) wrote that projects like Wikipedia enables ‘many more individuals to communicate their observations and their viewpoints to many others, and to do so in a way that cannot be controlled by media owners and is not as easily corruptible by money as were the mass media.’ (Benkler, 2006: 11)

I think that at that time we were all really buoyed by the idea that Wikipedia and peer production could produce information products that were much more representative of “everyone’s” experience. But the idea that Wikipedia could avoid bias completely, I now believe, is fundamentally wrong. Wikipedia presents a particular view of the world while rejecting others. Its bias arises both from its dependence on sources which are themselves biased, but Wikipedia itself has policies and practices that favour particular viewpoints. Although Wikipedia is as close to a truly global media product than we have probably ever come*, like every media product it is a representation of the world and is the result of a series of editorial, technical and social decisions made to prioritise certain narratives over others. Continue reading

Big Data and Small: Collaborations between ethnographers and data scientists

This article first appeared in Big Data and Society journal published by Sage and is licensed by the author under a Creative Commons Attribution license. [PDF]


In the past three years, Heather Ford—an ethnographer and now a PhD student—has worked on ad hoc collaborative projects around Wikipedia sources with two data scientists from Minnesota, Dave Musicant and Shilad Sen. In this essay, she talks about how the three met, how they worked together, and what they gained from the experience. Three themes became apparent through their collaboration: that data scientists and ethnographers have much in common, that their skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to their success.

Continue reading