Max Klein on Wikidata, “botpedia” and gender classification

Max Klein defines himself on his blog as a ‘Mathematician-Programmer, Wikimedia-Enthusiast, Burner-Yogi’ who believes in ‘liberty through wikis and logic’. I interviewed him a few weeks ago when he was in the UK for Wikimania 2014. He then wrote up some of his answers so that we could share with it others. Max is a long-time volunteer of Wikipedia who has occupied a wide range of roles as a volunteer and as a Wikipedian in residence for OCLC, among others. He has been working on Wikidata from the beginning but it hasn’t always been plain sailing. Max is outspoken about his ideas and he is respected for that, as well as for his patience in teaching those who want to learn. This interview serves as a brief introduction to Wikidata and some of its early disagreements. 

Max Klein in 2011. CC BY SA, Wikimedia Commons
Max Klein in 2011. CC BY SA, Wikimedia Commons

How was Wikidata originally seeded?
In the first days of Wikidata we used to call it a ‘botpedia’ because it was basically just an echo chamber of bots talking to each other. People were writing bots to import information from infoboxes on Wikipedia. A heavy focus of this was data about persons from authority files.

Authority files?
An authority file is a Library Science term that is basically a numbering system to assign authors unique identifiers. The point is to avoid a “which John Smith?” problem. At last year’s Wikimania I said that Wikidata itself has become a kind of “super authority control” because now it connects so many other organisations’ authority control (e.g. Library of Congress and IMDB). In the future I can imagine Wikidata being the one authority control system to rule them all.

In the beginning, each Wikipedia project was supposed to be able to decide whether it wanted to integrate Wikidata. Do you know how this process was undertaken?
It actually wasn’t decided site-by-site. At first only Hungarian, Italian, and Hebrew Wikipedias were progressive enough to try. But once English Wikipedia approved the migration to use Wikidata, soon after there was a global switch for all Wikis to do so (see the announcement here).

Do you think it will be more difficult to edit Wikipedia when infoboxes are linking to templates that derive their data from Wikidata? (both editing and producing new infoboxes?)
It would seem to complicate matters that infobox editing becomes opaque to those who aren’t Wikidata aware. However at Wikimania 2014, two Sergeys from Russian Wikipedia demonstrated a very slick gadget that made this transparent again – it allowed editing of the Wikidata item from the Wikipedia article. So with the right technology this problem is a nonstarter.

Can you tell me about your opposition to the ways in which Wikidata editors decided to structure gender information on Wikidata?
In Wikidata you can put a constraint to what values a property can have. When I came across it the “sex or gender” property said “only one of ‘male, female, or intersex'”. I was opposed to this because I believe that any way the Wikidata community structure the gender options, we are going to imbue it with our own bias. For instance already the property is called “sex or gender”, which shows a lack of distinction between the two, which some people would consider important. So I spent some time arguing that at least we should allow any value. So if you want to say that someone is “third gender” or even that their gender is “Sodium” that’s now possible. It was just an early case of heteronormativity sneaking into the ontology.

Wikidata uses a CC0 license which is less restrictive than the CC BY SA license that Wikipedia is governed by. What do you think the impact of this decision has been in relation to others like Google who make use of Wikidata in projects like the Google Knowledge Graph?
Wikidata being CC0 at first seemed very radical to me. But one thing I noticed was that increasingly this will mean where the Google Knowledge Graph now credits their “info-cards” to Wikipedia, the attribution will just start disappearing. This seems mostly innocent until you consider that Google is a funder of the Wikidata project. So in some way it could seem like they are just paying to remove a blemish on their perceived omniscience.

But to nip my pessimism I have to remind myself that if we really believe in the Open Source, Open Data credo then this rising tide lifts all boats.

February 2013: The Openness Edition

windows2

First published on ethnographymatters.net.

Last month on Ethnography Matters, we started a monthly thematic focus where each of the EM contributing editors would elicit posts about a particular theme. I kicked us off with the theme entitled ‘The Openness Edition’ where we investigated what openness means for the ethnographic community. I ended up editing some wonderful posts on the topic of openness last month – from Rachelle Annechino’s great post questioning what “informed consent” means in health research, to Jenna Burrell’s post about openaccess journals related to ethnography and Sarah Kendzior’s stimulating piece about by legitimacy and place of Internet research by anthropologists. We also had two really wonderful pieces sharing methods for more open, transparent research by Juliano Spyer (YouTube “video tags” as an open survey tool) and by Jeff Hall, Elizabeth Gin and An Xiao in their inspiring piece about how they facilitated story-building exercises with Homeless Youth in Boyle Heights (complete with PDF instructions!) Below is the editorial that I wrote at the beginning of the month where I try to tease out some of the complexities of my own relationship with the open access/open content movement. Comments welcome!

On Saturday the 12th of January, almost a month ago, I woke to news of Aaron Swartz’s death the previous day. In the days that followed, I experienced the mixed emotions that accompany such horrific moments: sadness for him and the pain he must have gone through in struggling with depression and anxiety, anger at those who had waged an exaggerated legal campaign against him, uncertainty as I posted about his death on Facebook and felt like I was trying to claim some part of him and his story, and finally resolution that I needed to clarify my own policy on open access. Continue reading “February 2013: The Openness Edition”

DataEDGE: A conversation about the future of data science

First posted at the Google Policy blog.

With all the hype around “Big Data” lately, you may be inclined to shrug it off as a business fad. But there is more to it than a buzzword. Data science is emerging as a new field, changing the ways that companies get to know their customers, governments their citizens, and relief organizations their constituents. It is a field which will demand entirely new skill sets and information professionals trained to collect, curate, combine, and analyze massive amounts of data.

Today, we create data both actively—as we socialize, conduct business, and organize online—and passively—via a host of remote sensing devices. McKinsey projects a 40% growth in global data generated annually. Companies and organizations are racing to find new ways to make sense of this data and use it to drive decision-making. In the health sector, that includes investigating the clinical and cost effectiveness of new drugs using large datasets. (McKinsey estimates that the efficient and effective use of data could provide as much as $300 billion in value to the United States healthcare sector.) In the public sector, it could mean using historical unemployment data to reduce the amount of time it takes unemployed workers to find new employment. And in the retail sector, it leads to tools that helps suppliers understand demand in stores so they know when they should restock items. Continue reading “DataEDGE: A conversation about the future of data science”

Why the muggle doesn’t like the term “bounded crowdsourcing”

Patrick Meier just wrote a post explaining why the term he coined, “bounded crowdsourcing” is ‘important for crisis mapping and beyond’. He likens “bounded crowdsourcing” to “snowball sampling”, where a few trusted individuals invite other individuals who they ‘fully trust and can vouch for… And so on and so forth at an exponential rate if desired’.

I like the idea of trusted networks of people working together (actually, it seems that this technique has been used for decades in the activism community) but I have some problems with the term that has been “coined”. I guess I will be called a “muggle” but I am willing to take the plunge because a) I have never been called a “muggle” and I would like to know what it feels like and b) the “crowdsourcing” term is one I feel is worthy of a duel.

Firstly, I don’t agree with the way that Meier likens “crowdsourcing” work like Ushahidi to statistical methods. I see why he’s trying to make the comparison (to prove crowdsourcing’s value, perhaps?) but I think that it is inaccurate and actually de-values the work involved in building an Ushahidi instance. Working on an Ushahidi deployment is not the same as answering a question through statistical methods. With statistical methods, a researcher (or group of researchers) tries to answer a question or test a hypothesis. ‘Do the majority of Hispanic Americans want Obama to win a second term?’ for example. Or ‘What do Kenyans think is the best place to go on holiday?’

But Ushahidi has never been about gaining a statistically significant understanding of a question or hypothesis. It has been designed as a way for a group of concerned citizens to provide a platform for people to report on what was happening to them or around them. Sure, in many cases, we can get a general feel about the mood of a place by looking at reports, but the lack of a single question (and the power differential between those asking and those being asked), the prevalence of unstructured reports and the skewed distribution of reporters towards those most likely to reply using the technology (or attempting to game the system) make the differences much greater than the similarities.

The other problem is that the term lacks a useful definition. Meier seems to suggest that the “bounded” part refers to the fact that the work is not completely open and is limited to a network of trusted individuals. More useful would be to understand under what conditions and for what types of work different levels of openness are useful, because no crowdsourcing project is entirely “unbounded”. Meier says that he ‘introduced the concept of bounded crowdsourcing to the field of crisis mapping in response to concerns over the reliability of crowd sourced information.’ But if this means that “crowdsourced” information is unreliable, then it would be useful to understand how and when it is unreliable.

If we take the very diverse types of work required of an Ushahidi deployment, we might say that they include the need to customize the design, build the channels (sms short codes, twitter hashtags, etc), designate the themes, advertise the map, curate the reports, verify the reports, find related media reports, among others. Once we’ve broken down the different types of work, we can then decide what level of openness is required for each of these job types. I certainly don’t want to restrict the advertising of my map to the world, so I want to keep that as “unbounded” as possible. I want to ensure that there are enough people with some “ownership” of the map to keep them supporting and talking about it, so I want to give them some jobs that keep them involved. Tagging reports as “verified” is probably a more sensitive activity because it requires a set of transparent rulesets and is one of the key ways that others come to trust the map or not. So I want to ensure that trusted people, or at least those over whom I have some recourse, do this type of work. I also want to get feedback on themes and hashtags to keep it close to the people, since in the end, a map is only as good as the network that supports it. Now if I have different levels of openness for different areas of work, is my project an example of “bounded” or “unbounded” crowdsourcing?

Although I am always in favor of adding new words to the English language, I feel that the term “unbounded crowdsourcing” is unhelpful in leading us towards any greater understanding of the nuances of online work like this. Actually, I’m always surprised at the use of the term “crowdsourcing” over “peer production” in the crisis mapping community since crowdsourcing implies monetary or commercial incentivized work rather than the non-monetary incentives that characterised peer production projects like Wikipedia (see an expanded definition + examples here). I can’t imagine anyone ever “coining” the term “unbounded peer production” (but I seem to be continually surprised, so I should completely discount it from happening) and I think that this is indicative of the problems with the term.

So, yes, if we’re talking about different ways of improving the reliability of information produced on the Ushahidi platform, I’m excited to learn more about using trusted networks. I just think that if a term is being coined, it should be one that advances our understanding of what the theory is here. Is it that: if you restrict the numbers of people who can take part in writing reports, you get a more reliable result? Where do you restrict? What kind of work should be open? What do we mean by open? Automatic acceptance of Twitter reports with a certain hashtag? Or an email address that you can use to request membership? Is there a certain number that you should limit a team to (as the Skype example suggests)?

This “muggle” thinks that the term doesn’t get us any further towards understanding these (really important) questions. The “muggle” will now squeeze her eyes shut and duck.

Why I won’t support Creative Commons or Wikipedia this year

It’s that time of the year again. Creative Commons and Wikipedia are working towards their fundraising goals for the coming year and asking users to donate to support the cause.

I spent the last five years working on building a global perspective on the commons and will probably spend the next working out what I did wrong. I worked directly with both organisations during this time, so it’s really sad for me to say this (and probably not very politically astute) but I feel like the only way we’re ever going to attack the problem of a lack of global agenda and global solidarity is by the funding issue. Here are my reasons in brief:

– Creative Commons (despite pressure from its international volunteers) still has a mostly male, mostly white, almost all American leadership. If CC is really committed to an international agenda, then they must at least attempt to involve a more diverse leadership in planning for the future.

– I know it’s a fundraising campaign but statements like this by Hal Abelson: ‘By supporting Creative Commons, you are helping to realize the promise of the Internet to uplift all of humanity’ leave me speechless. Until we have an international *common* agenda, until ‘all of humanity’ or at least major parts of it have ownership of this agenda (South Africa is the only African country in the CC International stable), we should feel ashamed to make statements like this.

Wikipedia plans to spend $9.4 million in the 2009-10 financial year (up 53% from last year) and has, at last, a plan for spreading the wealth with a $295,000 new grantmaking program (that’s only 3% of spending that goes to chapters but it’s better than almost 0). Problem is that this money seems to only be going to existing chapters (there are no chapters in Africa). This means that, if you wanted money to go specifically to outreach on the African continent, you couldn’t do it since you can only donate to Wikipedia or to existing Wikipedia chapters.

I think that one of the worst things that organisations who have global goals can do is to stop people from countries who are left out of the agenda from donating money. Even if it’s just a small amount, CC and Wikipedia are perpetuating the myth that we don’t care about these issues in Africa.

My small contribution has, instead, gone to Global Voices. They spread the small amount of money that they receive pretty widely and their leadership team reaches each region at least.