Facebook/Cambridge Analytica: Privacy lessons and a way forward
2018-03-20 - 21 minutes readBy Nathaniel Fruchter, Michael Specter, Ben Yuan
Introduction
On Saturday, March 17, 2018, The Guardian and The New York Times both reported that a political research and data science firm called Cambridge Analytica had inappropriately harvested data from the Facebook profiles of over 50 million people—the vast majority of whom had not provided consent for their data to be used for political and psychological profiling. While The Observer was gathering information for their story, on Friday, March 16, 2018, Facebook announced that it had suspended the firm from its platform.
We’re here to provide both technical and policy background on the Cambridge Analytica (CA) incident to illustrate the extent and potential harms of the user data that was stolen, and to provide some ideas for the way forward.
Who is Cambridge Analytica?
Cambridge Analytica (CA) is a political consulting firm which claims to create “psychographic profiles” of voters based on publicly available information. CA consulted with several Republican presidential campaigns during the 2016 U.S. presidential election.
According to the The Guardian, CA gained access to Facebook data through a partnership with Aleksandre Kogan, a UK academic, and his company Global Science Research. Kogan presented his data gathering as academic, but agreed to share information with CA. Facebook claims that while users gave consent to share data with Kogan as an academic, they did not consent to the secondary sharing with CA.
What happened?
Kogan collected information from people on Facebook under the pretense of operating a personality test for academic research purposes. As part of this work, he built an application called “thisisyourdigitallife” that gathered profile information from around 270,000 workers recruited from Amazon’s Mechanical Turk platform. (Mechanical Turk is a crowdsourcing marketplace where workers can accept tasks in return for small payments.) This data was gathered through a Facebook application programming interface (API), which allows developers to integrate with various parts of the Facebook platform.
While the Mechanical Turk workers could reasonably expect to have their own data harvested for academic purposes, Kogan’s application additionally collected profile data from each of the participants’ friends using APIs that were available at the time. In other words, it was possible to have your public profile data collected by virtue of being friends with someone who installed Kogan’s application.
While some within Facebook saw this data collection as acceptable for academic purposes, Kogan subsequently forwarded the collected data to Cambridge Analytica, who allegedly proceeded to use the data for distinctly non-academic activities. This would have been outside the scope of activity approved by Facebook and certainly outside the scope of consent provided by people using Kogan’s app.
When Facebook discovered in 2015 that Kogan had forwarded this data, they removed API access for Kogan’s app and requested that Kogan and CA certify that they had destroyed the collected data. While all parties provided this certification, it is apparent that copies of the information continued to persist and remain in use. For instance, the New York Times has viewed extant samples of the CA data as of March 2018.
What is the privacy issue?
We see two main privacy harms. First, there is the potential harm that could come from the exposure of this data to a third party. While nominally public data was gathered, the inferences made from it could be very sensitive. If, for instance, this information were leaked, users’ private data, such as political or religious affiliation, may quickly become public. Such a disclosure could cause people to lose jobs and insurance, and also cause a chilling effect against free speech and association. Second, there was a clear lack of consent involved when data was gathered about friends of app users. Few, if any, such people would have been aware that this was happening, much less been able to consent to this “secondary” harvesting.
Why should I care?
What makes this saga so incredibly interesting is that it touches on a number of fascinating technology and policy issues: academic honesty, the relationship between academia and industry, the increasing use of technology to target elections, and large scale social manipulation by unscrupulous actors.
Hundreds of thousands of Mechanical Turk workers gave up their personal data under misleading circumstances, while also unintentionally betraying the privacy of their friends and family. This data was then used for partisan targeting of political advertisements.
How did this happen?
How was the data obtained?
As mentioned previously, developers can use APIs to interact with the Facebook platform. This includes the ability to gather and record data associated with a Facebook user’s profile, such as their likes and friends.Facebook calls their primary API the Graph API, as it allows developers to interact with the platform’s “social graph”—data about Facebook users’ friends, likes, associations, and interactions. The Graph API acts as a common language to communicate this data across other integrations, such as the ubiquitous “Login With Facebook” button.
Kogan’s—and Cambridge Analytica’s—ability to harvest user data relied on now-removed functionality from its APIs. While Facebook is currently on the 2.0 generation of its API, Graph API 1.0 allowed developers to obtain profile information from all of a user’s Facebook friends. A developer would only need permission to access the friends list of the app user. Once armed with this permission, a developer could then gather the profile information of all of the app user’s friends by querying the Graph API. This is done without the others’ consent: the app user’s friends have little control over the sharing of their public profile information (such as likes, public posts, and demographic data) needed to construct Cambridge Analytica’s psychometric profiles.
To compound matters, friends list permissions were easily obtained through version 1.0 of Facebook Login. Early versions of this interface did not allow users of an app (like Kogan’s “thisisyourdigitallife”) to selectively deny permissions an application asked for. Instead, given an all-or-nothing choice: “Allow” or “Don’t Allow”. Furthermore, the crucial friends list permission was categorized as “basic information” alongside details like name and profile picture. This categorization most likely increased the likelihood that users would grant such a permission.
In summary, Cambridge Analytica was able to access this data because, before 2015, Facebook allowed such behavior by developers. While Kogan’s academic cover allowed him to continue gathering data at a rapid rate, the basic technical capabilities that allowed this behavior were inherent to Facebook’s API at the time.
Why is the data still around?
How did Cambridge Analytica retain access to the data despite having to certify its destruction? The problem with self-certification of data destruction is that it is impossible to attest to someone else that you do not have something – hiding information requires no active computation on the part of an information system, only that it be placed beyond the vision of an auditor. Furthermore, there is no technical measure that will prohibit all copying of data on a third-party system. Making copies of data is “free”, and doing so does not damage or otherwise affect the original information. Hence, CA could have made a show of deleting one or more copies of the data, and demonstrating and certifying their destruction to Facebook’s satisfaction, without impacting any other copies they may have controlled.
Is this still possible?
Could a motivated developer still gather the same scale of personal data from Facebook without a user’s permission?
Using the Facebook API
We believe Facebook has effectively curtailed the possibility of non-consensually gathering data about friends’ profiles using their own APIs. In 2015, the company officially deprecated Graph API 1.0 and transitioned all developers to v2.0. As part of this process, the friends list permission was severely limited in scope and users were granted new control over information shared during the Facebook login process. This information from Facebook meshes with reports from Wired about the timeline in which Cambridge Analytica’s app was able to collect data (“Facebook shut down that capability for app developers in mid-2014, but offered some apps that were already up and running a small grace period”).
Facebook also instituted an app review process for developers asking for broad permissions which would make it more difficult for an invasive app to act with impunity.
Using other methods
While it is significantly harder to gather data directly from Facebook’s API, this does not preclude collection of large amounts of data through other means. For example, custom browser extensions are used by academic researchers in order to gather data about users’ web browsing habits. After testing, a small team of software developers could use the same technique to harvest data unavailable through the Facebook API. Additionally, such activity can be made difficult to distinguish from ordinary, user-driven Facebook browsing, and could be done at scale given sufficient development resources.
While detection is within Facebook’s technical capabilities, it would take an effort to detect specific anomalies to discover this type of automated browsing. This means that the main hurdles to executing this strategy would center on getting the extension past the browser vendor’s approval process and convincing the vendor (and your users) that your autonomous browsing is harmless. Unfortunately, this may be a low hurdle; browser extensions embedded with malware have been spotted in the wild, at scale.
The way forward
Any effective solution to the privacy and ethical problems raised by incidents like the Cambridge Analytica event must primarily be rooted in public policy. Most available technical measures can only increase the difficulty with which misuse can occur; they cannot prevent them outright. Deterrence can only be accomplished if the consequences for abusive actions are high enough to outweigh any gains.
In this light, we find Facebook’s initial actions – suspension of involved individuals and companies from their platform – relatively weak. It is true that being cut off from the largest social network platform in existence can be unappealing to those who rely on it for news and communication; but, ultimately, those functions can be subsumed by other means. For sufficiently small organizations, a suspension is arguably meaningless: companies can dissolve and re-incorporate, acquiring new legal identities, with little evidence to connect the new organization to the discarded one.
How do we deter this type of action? Federal law, like the Computer Fraud and Abuse Act, cover situations where a site’s terms of service are violated. Given Kogan’s misrepresentation of academic use and data transfer, such a law could apply here given Facebook’s developer policy. Enforcement of existing orders by regulators like the Federal Trade Commission to Facebook could also serve as a deterrent. State data protection laws, like Massachusetts’, also hold promise as they can prompt enforcement actions from attorneys general. Finally, investigations by international data protection bodies like the U.K.’s Information Commissioner’s Office and the European Union’s privacy regulators can also increase pressure on Facebook.
However, to effectively deter future misuse of private information, we need legal mechanisms that apply stronger penalties for such activity, providing significant and lasting consequences for violating people’s trust in how their data is used.
Technical measures
While policy remains in our view the best tool for dealing with data misuse, there are some technical measures that can improve things.
Visibility
Research from IPRI has shown that increasing the visibility of data flows from smartphones enables users to make smarter privacy choices. While Facebook currently makes a list of connected applications visible under “App Settings”, it does not immediately surface the permissions held by those apps. Permissions are shown on an app-by-app basis, making it difficult for a user to compare permissions they have granted. For instance, if a user wishes to find all applications that have access to her email address, this is currently hard to do; she must click on every app in the list and read each popup.
Other large platforms like Google have side-stepped this issue by classifying permissions as “low risk” or “high risk” based on their sensitivity. We believe a similar step would greatly increase the usability of Facebook’s App Settings.
Transparency and Auditing
The ability to be auditable and transparent go hand-in-hand with visibility. We believe increased transparency about data being accessed would be of great help to Facebook users. While Facebook currently surfaces the data an application is authorized to access, it doesn’t show a user what data has actually been accessed, by whom, and for what purpose. This type of transparency would allow users to see which developers potentially hold what data. Similar indicators have been shown to be successful in other contexts, such as showing which smartphone apps have accessed an iPhone’s location.
Transparency would have allowed many of the 50 million users whose profiles were accessed by Kogan’s app to see that their data had been gathered without consent and allow a subset of them to raise a red flag. While the average user might not care, some out of the 50 million would likely be tech savvy and sufficiently privacy-conscious enough to alert Facebook, their friends, and potentially the media about what was going on.
This same data would also enable Facebook to present friendlier privacy options to its users. For example, if a user wants to “remove info collected by an app”, Facebook doesn’t surface what information has been collected and places responsibility on the user to contact the app’s developer.
While we don’t have visibility into Facebook’s internal practices, we believe increased data transparency would also allow Facebook to conduct internal audits of applications and developers. By keeping a record of what data has been transferred to whom, it may be easier for Facebook to detect large-scale misuse of profile data.
Redesign the architecture around privacy and consent
Many of the privacy harms in this case come from the fact that third-party developers must hold data from Facebook in order to use it in their own applications. Abuse could potentially be mitigated by an alternative architecture where Facebook runs all server-side application code on their own infrastructure, allowing applications to use Facebook data in approved ways without requiring Facebook or its users to cede data ownership to developers. There are some practical issues that would have to be resolved for a solution like this to work. For example, effectively providing detailed information to mobile and desktop applications would be a technical challenge. However, something of this form could do a great deal to make privacy abuses by third-party developers more difficult to perform and easier to detect.
Decentralization
On the other hand, the Cambridge Analytica case also demonstrates the pitfalls of an infrastructure where sensitive personal data is held by one central authority, such as Facebook. This makes it more difficult for users to impose access constraints on data use and maintain ownership of their own personal data. The team behind Solid, a research project at MIT, have been working toward a different approach: what if people’s personal information could be held separately from the applications that needed it? One of Solid’s major objectives is to allow people to choose where their data lives—on their own personal server, or in a data hub of their choice—and make that data available to applications according to their own preferences. Ensuring individual users maintain individual control over their data through a mechanism resembling Solid is a first step towards preventing the sharing of users’ data without their consent.