Data Access

ISD Explainers

July 4, 2023

ISD Germany, ISD UK

Tech Accountability and Safety

Ensuring public interest researchers have what is considered ‘meaningful access’ to social media data (data access) is essential for evidence gathering, informed decision-making, and platform accountability. Yet, researchers continue to face barriers to accessing the data needed to establish a complete picture of platforms’ content moderation and curation systems – and, more broadly, their impact on users and society. A data access infrastructure, backed by enforcement powers that are enshrined in legislation and co-regulatory frameworks, should aim to grapple with the information asymmetry (lack of equal access) between platforms and researchers.

Glossary

Application Programming Interfaces (APIs) are software intermediaries that allow two applications to communicate with each other. APIs have a huge range of uses, but in the context of this Explainer, they allow researchers to access certain types of data from some online platforms via requests. As an intermediary, APIs also provide an additional layer of security by not allowing direct access to data, alongside logging, managing and controlling the volume and frequency of requests.

Blockchain technology – the technology that underpins cryptocurrencies such as Bitcoin – commonly refers to a digital database that stores and distributes data in a decentralised and public peer-to-peer network. It stands out from other technologies due its transparent and decentralised structure, in which data is stored in multiple locations and continuously compared and updated. Blockchain technology allows for pseudonymous transactions and communication which makes it attractive for malign use.

Data donations (including crowdsourcing and surveying methods) involve users of platforms voluntarily reporting certain content to researchers through mechanisms such as browser extensions or reporting forms.

Data scraping is the process of collecting data directly and independently from a platform, typically by writing code to automatically process a website’s HTML/CSS (the code that the website’s visual interface is written in).

The Fediverse is an interconnected group of servers hosted by a multitude of individuals rather than one central company. Together the servers form a decentralised network.

Public interest research commonly refers to research with the explicit aim to develop society’s collective knowledge. Regulatory precedent suggests that public interest research must be independent of commercial interests and reveal the source of its funding. Public interest researchers are not necessarily linked to academic institutions and can also include researchers affiliated to non-profit or media organisations.

Sock puppet accounts impersonate users on a platform. Researchers use sock puppet accounts to understand what a particular user profile, or set of user profiles, may experience on a platform. The data generated by the platform in response to the programmed (fictitious) users is recorded and analysed.

Regulatory and co-regulatory frameworks

EU: The Digital Services Act (DSA) introduces harmonised rules across the European Union (EU) for intermediary services to ensure a safe and accountable online environment, including the effective protection of users’ fundamental rights online.

EU: Strengthened Code of Practice on Disinformation 2022 (CoPD) empowers industry to adhere to self-regulatory standards to combat disinformation.

EU: European Digital Media Observatory’s (EDMO) draft Code of Conduct on platform-to-researcher data access includes guidelines on how platforms can share data with independent researchers while protecting users’ rights, as described under Article 40 of the General Data Protection Regulation (GDPR).

US: Platform Accountability and Transparency Act (PATA), introduced by Senator Coons (D) alongside Senators Cassidy (R), Klobuchar (D), Cornyn (R), Blumenthal (D), and Romney (R), is a bipartisan bill that would support research about the impact of digital communication platforms on society by providing privacy-protected, secure pathways for independent research on data held by large internet companies.

US: Social Media Disclosure And Transparency of Advertisements Act of 2021 (Social Media DATA Act), introduced by Representative Trahan (D), is a bill that would require certain consumer-facing websites and mobile applications to maintain advertisement libraries and make them available to academic researchers and the Federal Trade Commission.

US: Digital Services Oversight and Safety Act of 2022 (DSOSA), introduced by Representative Trahan (D), is a bill that would provide for, among other things, the facilitation of independent research on covered platforms through the Federal Trade Commission.

US: American Data Privacy and Protection Act (ADPPA), co-sponsored by Representatives Pallone (D) and McMorris Rodgers (R) and Senator Wicker (R), is a bipartisan bill that would create a comprehensive federal consumer privacy framework, including exceptions for “publicly available information” and for “research” purposes.

A changing landscape – New rules versus restricted access

The lack of data access not only undermines the development of knowledge about human experiences, societal phenomena and trends in the online ecosystem, but will also soon affect the compliance assessments of obligations introduced by the EU’s Digital Services Act (DSA). As it entered into force in November 2022, the DSA is the first major legislation containing provisions related to data access for researchers. These will allow researchers to access data of very large online platforms (VLOPs) with more than 45 million active monthly users in the EU to conduct research on systemic risks. Such research may comprise monitoring platform actions to tackle illegal content, such as illegal hate speech, as well as a range of other societal risks such as the spread of disinformation, and risks that may affect users’ fundamental rights.

In the US, members of Congress have introduced numerous bills with provisions calling for external researcher access to platform data, including the Platform Accountability and Transparency Act, the Social Media DATA Act, and the Digital Services Oversight and Safety Act. If passed, these bills could address concerns about platform’s content moderation – ranging from claims that platforms are biased and too much content is removed, to criticism that platforms are not doing enough to tackle illegal hate speech.

While data access will be essential for national regulators who hold enforcement powers, as well as independent researchers who scrutinise platform action, some tech companies have started to restrict data access by cutting off or raising the costs of access to APIs. In February 2023, Twitter announced that it would no longer support free access to its API. The Coalition for Independent Technology Research released an open letter, criticising Twitter’s API plans, saying they “will devastate public interest research” – and noting that over 250 projects will be jeopardised by ending free and low-cost API access, including research into the spread of harmful content, mis- and disinformation, news consumption, public health, and elections.

Ultimately, data access for the purpose of public interest research is needed to support scrutiny and accountability of both platform action and government intervention – to ensure they are fit for purpose to protect users against harassment, abuse and incitement, while not setting precedent that threatens users’ rights of privacy and freedom of expression. Current barriers to access – including technological, legal and ethical challenges – often stand in the way of public interest research that aims to understand the spectrum of actors, content and behaviour on both mainstream and emerging alternative platforms.

Barriers to data access

Barriers to access currently risk undermining researchers’ ability to conduct causal research over time, for example, understanding the effects of platforms’ recommendations on user experiences. This is especially challenging when it comes to ‘disinformation studies’ which have been criticised at times for seeming to favour “rapid and attention-grabbing results over those deriving from more time-consuming and rigorous approaches.” Barriers have required researchers to resort to an array of independent data collection methods (such as sock puppet accounts or data donations), rather than accessing data directly from the platforms. This, combined with a lack of common data documentation practices and quality standards in the field, has made advancing cumulative research and peer-reviewing results more difficult. In sum, researchers are facing three types of barriers to data access:

Technological barriers may arise from platforms deliberately restricting access to data or having technological features which inadvertently create barriers. For example, certain content formats, particularly audio or audiovisual content, are hard to search through systematically, making video-sharing platforms such as YouTube or image-based platforms such as Instagram difficult to analyse at scale. Moreover, the use of blockchain technology by platforms such as Odysee and the emergence of decentralised networks such as the Fediverse pose relatively unexplored territory for systematic data collection.
Legal and/or ethical barriers may arise from platforms’ data privacy concerns. For example, the use of third-party technologies (such as browser extensions) for data donation purposes that are prohibited by the platforms’ Terms of Service could lead to legal action. Furthermore, platform efforts to prevent automated data collection through scraping or other methods inadvertently hinder researchers from verifying compliance with their own guidelines. There may also be the issue of platforms’ retention of data given research demands for deleted content, including data that platforms removed due to violations of their Community Guidelines. For example, some types of research require examining deleted content that could provide evidence of criminal activities. Ethical barriers also arise from uncertain expectations of user privacy, when the platform interface lies in a grey space between private and public such as large WhatsApp or Telegram groups, as well as emerging questions of informed consent from users.
Fragmentation barriers may arise when data that is publicly available is scattered among vast amounts of sources or features that cannot be searched systematically via platform-wide functions or through an API. For example, Discord’s public groups can only be searched server-by-server (individual channels on the messaging platform are known as servers) and not in a systematic way. Moreover, platforms use metrics with varying definitions and opaque methodology behind how they are tallied. For example, how individual ‘views’ are counted and what they describe can differ between platforms. This adds to the difficulty of comparing behaviour and content across platforms.

Categorisation – What types of social media data are needed?

As the online communication and information ecosystems evolve, so too has the nature of conducting research. Before outlining the reasoning behind data access and potential research questions, it is important to explain what types of social media data the research community may be interested in:

User-generated data includes information about user content and behaviour on a platform. This comprises of stats gathered from content such as posts and comments, as well as user behaviour, including likes, shares and other types of engagement. This data can be ‘public’ (for example, a post accessible to any member of the public) and ‘private’ (for example, a post shared in a private or closed chat) – though researchers lack a common definition of ‘private online spaces’, which creates additional ethical barriers and considerations. Data may include non-aggregated, individual-level data (personal data) as well as aggregated data on reach (unique number of users who saw a post at least once), impressions (number of times a post was seen) and engagement (likes, comments, shares). Platform APIs may enable access to public data with varying requirements and restrictions.
Platform curation data includes information relating to how human- and algorithmic systems moderate and sort (e.g. boost or demote) content on a platform. This also includes community and recommendation guidelines (e.g. content moderation policies), and how they are enforced, including by means of content removal, content demotion or account suspension. Transparency reporting of curation data may include aggregated information about content moderation decisions at scale, specifying the type of content, the detection method, the type of restriction applied, and whether removal or suspension was due to the Community Guidelines, legal requirements or government removal requests. This type of data is usually available in a non-machine-readable format (such as PDF documents containing tables of data), making further analysis difficult. On a granular level, platform curation data can include signals or ‘tags’ associated with specific types of content or accounts used for content moderation as well as recommendation algorithms.
Platform decision-making data includes information about internal decision-making, including decisions related to the introduction of new features on the platform or experiments conducted by platforms to test and evaluate the ranking algorithms of the recommender systems. For example, such data may include information about changes intended to increase certain types of engagement. Concretely, this can be the quantitative figures from the outcome of experiments with ranking systems. Information about methodology and decision-making would be accessible in the form of qualitative information. Researchers would rely on access to platform employees or company leadership, either through on-site inspections and interviews or access to internal documents, decision-making processes, and communications. In part, the so-called ‘Twitter Files’ uncovered this type of data, albeit with caveats regarding its selectivity and verifiability.

Reasoning behind data access – Advancement of collective knowledge

Researchers from multiple disciplines are interested in understanding a range of emerging social phenomena such as the spread of health-related misinformation, growing distrust in institutions or news consumption. This in turn requires a better (long-term) understanding of the relationship between the use of social media platforms and observed trends. Without reliable and searchable data access, researchers lack the resources to monitor and externally assess compliance with regulations such as the EU’s DSA that cover a range of risks and platform functionalities. Beyond compliance with regulation, social media data can serve as a proxy to assess multiple societal phenomena and trends, as well as human behaviour, attitudes or opinions. Scholars such as Nate Persily have argued that access to social media data has become a prerequisite to investigating and understanding most contemporary problems “in the real world” – whether in the context of election cycles, foreign interference, public health, or societal attitudes towards climate change, immigration or LGBTQ+ rights. Social media data can further offer timely and comprehensive datasets compared with traditional, retrospective social science methods, especially in crisis situations such as a global pandemic, conflict, natural disaster or terrorist attack. The table below outlines sample indicative research questions related to the compliance of platform regulation and beyond.

	Directly linked to compliance with current platform regulation	Indirectly linked to compliance with current platform regulation
Primarily user-generated data required	What is the prevalence of content that could be classified as “incitement to hatred” under the German penal code on Facebook? How many views did video clips of RT and Sputnik broadcasting activities receive on YouTube one month prior and one month after Russia’s invasion of Ukraine?	How do discussions around the COVID-19 pandemic differ across Facebook and Twitter? What online news outlets are shared most prominently among German-language influencers on Instagram?
Primarily platform curation data required	How effective are warning labels from independent fact-checkers or authoritative sources in reducing the spread of misinformation on Twitter? What types of users are more likely to be exposed to content categorised as hate speech? Do moderation decisions about what content is allowed on a platform affect some user groups disproportionately? Are Instagram’s ‘Explore’ page algorithms systematically amplifying the visibility of cyber-abuse content? What is the proportion of so-called ‘superusers’ that show hyperactive and abusive behaviour on Facebook? How can we measure the effect of ‘superusers’ on algorithmic feeds?	How does historical user behaviour impact YouTube ‘Shorts’ recommendation algorithms? What is the role and impact of feedback loops between user behaviour and algorithmic recommendations? How do users adapt their posting behaviour in response to a changed choice architecture of a platform (referring to the platform design)? For example, how did user interactions change when Facebook introduced the ‘angry’ reaction? To what extent does revealing the source of factual interventions affect the likelihood of users sharing misinformation? Does context added to posts such as Twitter’s Community Notes mitigate the spread of false and misleading information? To what extent do people from different points of view find them helpful? Does opting for a reverse-chronological timeline over an algorithmic feed alter the ‘stickiness’ of social media platforms (i.e. whether users spend more time, or engage more on a platform)?
Primarily platform decision-making data required	Are high-profile users treated preferentially in content moderation processes? Are TikTok’s algorithms intentionally demoting Black Lives Matter activists, i.e., reducing how frequently their videos appear on the ‘For You’ feed? Are users able to silence others through the misuse of moderation tools or through systemic harassment designed to censor certain viewpoints?	Is it possible to generate a quantitative estimate of the proportion of reach and engagement resulting from algorithmic ‘amplification’? How could platforms and researchers assess user behaviour in a ‘counterfactual’ scenario, e.g. comparing user groups engaging with algorithmic vs. reverse-chronological feeds. How are Meta’s Oversight Board decisions received by company leadership? What effect do these decisions have on content moderation practices of other companies? How do ranking and product teams at social media companies decide on and use experiments to test and evaluate changes to the algorithms?

Privacy-compliant access to data

Access to social media data, including personal data, can activate data privacy obligations included in the EU’s GDPR, in particular its special regime for data processing for research purposes. The European Digital Media Observatory’s (EDMO) Working Group on Platform-to-Researcher Data Access has published a draft Code of Conduct, which establishes a process under which researchers can be given access to personal data in compliance with the GDPR. Specifically, the Code notes that the GDPR introduces exceptions to the limitations on processing personal data for research purposes by implementing appropriate safeguards. Safeguards can relate to informed consent, data storage (retention periods and criteria), pseudonymisation (processing personal data in such a way that this data can no longer be attributed to an individual), commingling of data (i.e., combining crowdsourced data with API data) or sharing with third parties (such as research partners).

In the media

NPR: Online extremist communities fuel new radicalization pattern targeting minors

Gaming platforms exploited to promote hate and violence among minors

One in 5 young people in the UK have a positive view of Andrew Tate, percentages higher among minorities

Data Access

Glossary

Regulatory and co-regulatory frameworks

A changing landscape – New rules versus restricted access

Barriers to data access

Categorisation – What types of social media data are needed?

Reasoning behind data access – Advancement of collective knowledge

Privacy-compliant access to data

In the media

ISD Contributors

Join our Mailing list

ISD Explainers

Data Access

Glossary

Regulatory and co-regulatory frameworks

A changing landscape – New rules versus restricted access

Barriers to data access

Categorisation – What types of social media data are needed?

Reasoning behind data access – Advancement of collective knowledge

Privacy-compliant access to data

In the media

ISD Contributors

Issues

Latest ISD Explainers

Non-Consensual Intimate Imagery (NCII) Abuse

Proud Boys

764

Join our Mailing list

Keep up to date with ISD’s latest research, events and publications.

Non-Consensual Intimate Imagery (NCII) Abuse