Beatriz Botero Arcila: “Who Owns Generative AI Training Data? Mapping The Issue And A Way Forward”

The Network Law Review is pleased to present a symposium entitled “Dynamics of Generative AI,” where lawyers, economists, computer scientists, and social scientists gather their knowledge around a central question: what will define the future of AI ecosystems? To bring all this expertise together, a conference co-hosted by the Weizenbaum Institute and the Amsterdam Law & Technology Institute will be held on March 22, 2024. Be sure to register in order to receive the recording.

This contribution is signed by Beatriz Botero Arcila, Assistant Professor of Law at Sciences Po and an Affiliate at the Berkman Klein Center at Harvard University. The entire symposium is edited by Thibault Schrepel (Vrije Universiteit Amsterdam) and Volker Stocker (Weizenbaum Institute).

***

1. Introduction

Access to data is a critical factor in developing and training artificial intelligence systems, and it plays a key role in AI progress.[1] With the rise of generative artificial intelligence (GAI) – the type of AI that powers popular applications like ChatGPT, DALL-E or Stable Diffusion – the demand for data has increased and the issue of data access and ownership has become more contentious. This raises questions about the entitlements over the information produced in our digital environment and society, and the best way to govern information and data produced online.

GAI generally refers to machine learning systems that are 1) “generative,” that is, they generate text, images, or other forms of outputs based on an input prompt, and 2) “foundations models” a form of neural network model on very large data sets that can be adapted for a wide range of tasks.[2] Large language models like GPT or LLaMa work essentially by taking a sequence of words as an input and then predicting the next word to generate text.[3] Similarly, GAI models that produce images take labeled images and find statistical patterns in them, so they can then generate images that resemble the ones they were trained on.[4] In both examples, data is essential for models to learn the structure of language, semantics, grammar or the structure, features and characteristics of images. The more comprehensive and high-quality the data, the more capable the GAI model becomes in producing coherent outputs.[5]

Policymakers and scholars around the world, myself included, have advocated for governments and even the private sector to make information (safely) available for re-use.[6] Data, we have argued, should be governed more like a commons and more collectively (with some exceptions to mitigate privacy and security concerns), than as a proprietary resource.[7] A key premise of this idea is that as a non-rivalrous good its use by one actor does not prevent its use by someone else and thus it could be widely shared to create value without losing its quality or being polluted.[8] At the same time, it is also generally understood that data is an excludable good, often because of different legal and technical interventions – such as technical decisions to store data in ways that are hard to access by third parties or establishing individual rights over it, such as data protection rights.[9]

Data governance, we have also argued, should thus focus on encouraging open data, access to data and data sharing to help incumbents develop new applications and AI systems in a highly concentrated digital economy.[10]

There are, however, some novel issues on the use of data to train GAI companies that make it worth revisiting this conversation:

First, though companies developing GAI are not particularly transparent about the data they use to train their models,[11] it is well known that one of the best sources of training data is the Internet. Techniques like web crawling have allowed actors like OpenAI and Google to train their models using data source from sources as varied as social media, Wikipedia, or the news.[12] That data is “available” online does not, however, mean that it is always in the public domain and available for anyone to use.[13] Rights on entitlements over data online are governed by a mixture of regimes: personal data protection in some jurisdictions, IP protections and its many exceptions, and the terms of service of platforms and websites.[14] OpenAI and other companies are being sued in court for the unlawful use of this information, potentially risking the business models of rights-holders, such as online news outlets.[15] With the appearance of GAI, often spearheaded by companies that are attracting billions of dollars in funding, like OpenAI, or are already powerful and established actors, like Meta, it is worth examining what are the distributive effects of letting all actors use informational resources to which some others have entitlements for free to train their models, and how they relate to broader societal and policy goals and outcomes.[16]

Second, is the effect of some of the outputs of GAI in our wider informational environment – which will then be used again by new and old technology companies to develop new services and products. The assumption of open data advocates is that reusing data will produce new and valuable outputs for society, but we have assumed that data sources remain the same. Some scholars and journalists have pointed out that the publishing of poor quality and often inaccurate data and information produced by GAIs may be corrupting and poisoning these spaces, by, for example, further replicating the biases or hallucinations often reproduced by these models.[17] Journalists have argued that untruths generated by chatbots that ended up online where then shown by search engines as facts, making search harder to trust.[18] If this is shown to be true and significant, GAI, and the main actors behind it, could be affecting the quality of the data “out-there,” perhaps challenging some of the assumptions of open data scholars. Similarly, a December 2023 lawsuit by The New York Times argues that GAI risks the survival of independent journalism, which is widely recognized to be vital to democracy.[19] If GAI chatbots can compete against the media, while relying on copying and using millions of articles by The Times and many other news outlets GAI developers would be free-riding the news-producers investment in journalism and risking their economic survival – and the provision of providing reliable and trustworthy information in what is already a deteriorated information environment.[20]

The importance of these later potential and systemic effects should not be underestimated. Commons, and open access commons specifically, are institutional structures that rely on the premise that opening their usage to the public, and substantive usage, produces enough positive externalities to compensate for congestion or pollution costs.[21] Much of information and non-congestible goods are indeed the easy cases of open access commons.[22] However if AI developers are allowed to use the informational available on the internet to train their models and this leads, however, to the congestion or pollution of some of that information – or the affectation of the broader information environment -, then there are policy reasons for governments to intervene and safeguard the quality of the information environment. This, at the same time, should have to be balanced against the interest of encouraging or enabling the development of the AI industry.

Empirical research is still needed to assess the actual effect of GAI in the digital economy and the information ecosystem.[23] But the questions are worth considering. This Essay thus maps the main legal and policy questions so far at issue and then situates them within a wider, well-known, conversation for internet scholars: the governance of data and the internet commons, this time in a world of GAI. In the end, this short piece argues that a focus on data governance is important – to ensure the renewability and the quality of the resource – but it may be that focusing solely on it for economic purposes is insufficient, as it overlooks broader societal goals like innovation and ensuring the quality of the information ecosystem.

2. Entitlements over the inputs: Who owns GAI inputs?

There seem to be three main disputed individual legal interests regarding the data that GAI companies are using to train their models: (1) data protection and privacy interests; (2) copyright interests; and (3) interests over platform-generated data, governed mainly by terms of service.

2.1. Privacy and data protection interests

The training of GAI raises privacy and data protection issues because, given the vast quantities of data required to train these models, and the various sources used by technology companies to do so, it is generally likely that personal data is used in their training. This, in turn raises questions over the fairness and transparency of these data processing processes and the protections applicable to personal data publicly available on the web.[24]

Data protection laws, such as the EU’s General Data Protection Law (GDPR), typically require that companies have a legal basis for the processing, storage and sharing of personal data.[25] Legal bases usually include that the data subject has given consent for a limited type of processing; that processing is necessary to perform a contract to which the data subject is a party; that processing is necessary for compliance with a legal obligation to which the controller is subject; and that processing is necessary for the purposes of pursuing a legitimate interest of the controller which, in any case, cannot be one that ends up overriding the essence of data protection.[26]

This has already been an issue in Europe. In March 2023, for example, the Italian Data Protection Authority (DPA) requested additional information from OpenAI and eventually ordered a temporary ban on ChatGPT over alleged privacy violations.[27] The authority believed that the company lacked a legal basis justifying “the mass collection and storage of personal data … to ‘train’ the algorithms” of ChatGPT.[28] OpenAI clarified that it processes users’ personal data based on the legal basis of its legitimate interest, but it implemented a series of measures intended to mitigate the effects on Europeans’ rights.[29] It published a comprehensive information notice on their website, offered an option to export data and, interestingly, enabled an option to opt-out from processing of personal data by creating a form that allows EU users to do so when they can provide relevant prompts that result in the model mentioning the data subject.[30]

Though the Italian authority acknowledged the steps taken by OpenAI and the service was made available again in Italy,[31] fact-finding continues under the umbrella of an ad-hoc task force set up by the European Data Protection Board in April 2023. In addition to the work of the taskforce, as of the end of November 2023 several other DPAs have ongoing investigations on OpenAI regarding the lawful processing of personal data and many are having discussions on the legal basis applicable for AI.[32]

2.2. Copyright interests 

The training of GAI raises copyright infringement issues because copyrighted materials have been used to train these models.[33] The question is, thus, whether accessing, preparing, analyzing, and mining data an act of copyright infringement, and if so, whether there are any applicable defenses. Copyright risks are accentuated as GAI models are shown to be capable of also regenerating close to exact copies of the copyrighted text, source code, and images they were trained with.[34]

Copyright covers creative works such as texts, photographs, music, artworks, but generally not facts, numbers, and general information.[35] When a work is protected, any use or collection of it without permission infringes on the exclusive copyright rights of to decide on the reproduction, distribution, and adaptation of their work.[36] Copyright law, however, also provides for certain exceptions to this rule. Thus, when materials that are covered by copyright protections are used to train AI systems, the main legal question is whether any exception to the exclusive rights of authors or other rights holders applies.[37]

In the US, much of the conversation falls upon the interpretation of the fair use doctrine and the transformative nature of the work: Under the fair use doctrine, individuals are allowed to use sections of a copyrighted work, like quotations, for purposes like analysis, critique, news coverage, and scholarly publications. Building Google Books without authorization from the authors was considered fair use because of the “highly transformative purpose” of Google’s actions, as they transformed various copyrighted books into a useful search tool that didn’t compete with the original works.[38] A limit to this defense, however, is when the replication of someone else’s work has the potential to serve as a substitute for the original work. The US Supreme Court has thus highlighted the importance of considering the effect that the allegedly infringing work might have on the market for the original work, advising lower courts to weigh this factor heavily.[39]

As of November 2023, several copyright lawsuits have been filed against main GAI developers. In September 2023, for example, seventeen fiction writers filed a lawsuit against OpenAI in the United States for copyright infringement arguing that “it’s clear that ChatGPT has read everything they’ve written and used it to hone its own writing skills to produce what they described as “derivative works.”[40] The authors – as other artists – also point out that ChatGPT cannot just summarize their work, but it can also imitate them, potentially taking paid opportunities from them.[41] Similarly, in late October 2023, the News & Media Alliance sent comments to the US’ Copyright Office focusing on GAI highlighting that LLMs have been shown to reproduce the content they were trained with.[42] In a November 2023 ruling, however, a Judge dismissed a copyright claim arguing that the plaintiffs had not been able to sufficiently allege at the pleading stage that the outputs (or portions of the outputs) of the Meta’s LLaMa were similar enough to their works, to be infringing derivative works.[43]

In late December 2023, The New York Times filed a lawsuit for copyright infringement against OpenAI and Microsoft, one of OpenAI’s main investors, arguing that millions of articles by the NYT were used to train ChatGPT, which now compete with The Times “as a source of reliable information.”[44] The lawsuit also shows that, sometimes, ChatGPT reproduces NYT content exactly, so called regurgitation.[45] This, according to Andrés Guadamuz, sets the lawsuit apart from previous lawsuits, as these “regurgitations” could be substantial reproductions, which are needed to prove a copyright infringement.[46] OpenAI replied by arguing that AI training is fair use, but that they provide an opt-out version; that regurgitation is rare, and that they are working to fix it; and that they had been working with The Times before the lawsuit to strike a partnership as part of broader efforts to partner with leading news organizations to support a healthy news ecosystem. In its response to the lawsuit, OpenAI also emphasized that no source, like the NYT, contributes meaningfully to the training of any of their models and isn’t sufficient to impact future training.[47]

It is worth noting that not all creators or copy-right holders find the use of their work for AI training problematic. In a piece for The Atlantic, author Ian Bogost considered this type of use as a natural evolution of internet culture. [48] He argued that permissionless uses of creative works are part of what the creative endeavors and proposed that “one of the facts (and pleasures) of authorship is that one’s work will be used in unpredictable ways.”[49] And indeed, whether establishing semi-absolute rights on copyright is a beneficial policy to holders is a contested question, one long opposed by information scholars favorable to more commons-oriented forms of information governance.[50] In the EU, for example, the 2019 Directive on Copyright in the Digital Single Market (CDSM) introduced two exceptions for copyright in text and data mining. Article 3 provides an exemption for reproductive acts ordinarily protected by copyright law when it comes to text and data mining made by research or cultural heritage organizations for the purpose of scientific research. Article 4 further expands the exemption to all actors for “reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.”[51] This exception applies so long as ‘opt-out’ or ‘contract-out’ options are provided for rightsholders.[52] Recital 8 of the Directive explains that these exceptions are intended to give organizations working on research and innovation with legal certainty pertaining to the legality of data mining, a prevalent practice across the digital economy.[53] This would, at least for now, allow mining of copyrighted works so long as there is an opt-out option for rightsholders.

This may not always stay like this, though. Already in September 2023, a draft bill was proposed in France suggesting that authors should have rights over works generated by AI systems without direct human intervention, raises significant concerns among commentators. [54] While the intention to protect authors is understandable, critics argue that such a broad approach could impede the development of AI systems and make jurisdictions adopting such measures less appealing to AI developers. [55]

2.3. Terms of service

Lastly, there are the websites that, increasingly, tend to make semi-proprietary claim over data via their terms of service. Reddit, for example, has recently made some changes on its APIs to prevent its content from being used to train AI tools and, according to the new terms, developers who use its APIs in ways intended for commercial usage are required to enter into a separate agreement with the company, most likely for a fee.[56]

The legality of web scraping, particularly of non-personal and copyrighted material, remains a contested issue. In the 2019 case hiQ Labs, Inc. v. LinkedIn Corp, the US Ninth Circuit ruled that hiQ Labs couldn’t scrape LinkedIn users’ public data due to LinkedIn’s terms of service and brought into question the legality of web scraping under the Computer Fraud and Abuse Act (CFAA). The Ninth Circuit eventually found hiQ Labs was prohibited from scraping based on LinkedIn’s terms of service, which forbade it to do so. Critics to the decision argued that this broad reading of LinkedIn’s terms of service limited competition and consumer choice in the data-driven economy and concentrated corporate power. [57]

3. The systemic question over the governance of information available online

The above issue spotting exercises simply shows that existing regimes over data-entitlements capture tensions between policy objectives as varied as protecting individual privacy, protecting creators and incentivizing (human) creation, and enabling and facilitating new ventures, supporting a healthy information environment, and scientific and commercial innovation. In several of these insights, judges and policymakers are facing legal and policy choices to fine tune these entitlements and balance them for an era of GAI development. Different choices will have different distributive effects.

The second layer of this question, however, is that GAI outputs are increasingly part of the internet, and the internet-supported information environment. This environment, in turn, is today central and structural to democracies and market economies. Indeed, researchers have shown that GAI models sometimes regenerate the data they were trained with, which can lead to the exposure of personal information in the training data set.[58] Copyright holders argue that GAI outputs sometimes reproduce their creations, and can directly compete against them.[59] Importantly, there is a chance GAI inputs are making the internet, and information available online, less reliable, which is necessary for functional democracies and market economies. The publishing of poor quality and often inaccurate data and information produced by GAIs may be (further) corrupting these spaces, by, for example, further replicating the biases or hallucinations often reproduced by these models.[60]

In this context, this section examines some of these issues in the light of open access and information commons literature, which has been very influential in the data governance policy conversation in recent years. The question this literature often seeks to answer is how to govern data, as an economic good, to facilitate the provision of better data for the digital economy, but also how do to dos with ethical and fairness considerations in mind.[61] To do so, it first introduces the two main theories in conversation in commons governance and it then discusses if, or whether, there are relevant insights that may be applicable the input-data context of GAI.

3.1. Key elements of commons and open access commons

Yochai Benkler explains that Garrett Hardin’s Tragedy of the Commons established the framework that later sparked and framed the conversation over the commons. According to Hardin, resources to which anyone has access would be overused and underinvested, eventually depleted. To avoid that, the resources hat to be either privately owned or regulated by the state.[62] In response to this theory, two main commons schools developed. First, the Ostrom school showed that groups can solve these problems of collective action via their own collectively created, non-state and non-state defined, systems of rules.[63] Crucial to these arrangements is, however, that members of the clearly defined group have a bundle of rights that includes the right to exclude nonmembers from using the common resource.[64] Second, the school of the open commons developed around information and the internet and showed that certain resources, but specially certain informational resources, can be managed on the basis of symmetric use privileges open to all, rather than on exclusive collective or individual proprietary rights.[65]

Scholars like Carol Rose, Yochai Benkler, Brett Frishman and Larry Lessig developed the intellectual and institutional history of open access commons with careful analysis of resources like unlicensed spectrum, the internet protocol, and free, open-source software (FOSS), Wikipedia but also roads and public squares.[66] Jason Potts integrated institutional economics too to provide insights on how commons operate in innovation and entrepreneurship.[67] Yochai Benkler describes open access commons as “a family of institutional solutions that respond to three practical problems under certain resource conditions. The three practical problems are (a) high persistent positive externalities, of which nonrivalry in information goods is an extreme case; (b) uncertainty, under which exploration trumps appropriation and has its primary impact in innovation; and (c) (…) the risk that markets will drive resource utilization in ways that will lead to social instability or political intervention.”[68]

FOSS, for example, is a “licensing practice that voluntarily “contracts out” of a proprietary regime” over software and adopts instead an open access common.[69] As explained by Benkler, FOSS licenses create an open access regime for the software developed, which means that anyone can copy the code, modify it, use it, and redistribute the modifications. Some licenses create a simple open access regime but some, like the GPL, imposes “a reciprocity condition on the rights of any user who modifies and distributes the software.”[70] This means that it essentially imposes a requirement to re-plenish the commons, as a condition for making intensive uses of it.[71] Today, 97% of all software uses open source software and it is supported by a varied community of individual volunteers, nonprofits and companies.[72] Despite its success, recent scholarship has shown that open source is significantly vulnerable to cybersecurity threats (though not particularly less vulnerable than proprietary code).[73] Chinmanyi Sharma identifies that there is a lack of institutions securing the code and has suggested a few avenues (as well as tradeoffs) in creating incentives for so doing.[74]

Wikipedia, the most accurate encyclopedia in the world, is similarly an open access common in the sense that anyone can edit it, whether they are logged in or not, and no one is paid for doing so. It has, however, a sophisticated system of governance that helps its community of editors guarantee its high quality. There are a set of rules – which include, for example, keeping a neutral point of view, assuming good faith, and ignoring all the rules that would “prevent you from improving, or preventing harm to, the encyclopedia”[75] – and a robust community of editors that object, question, and revert edits that eventually guarantee the high quality of the articles.[76] Benkler explains that many other open access commons that are not non-rival and require continues reinvestment or maintenance, like roads or the electricity grid, are often integrated with forms of public provisioning or some other form of payment for use, while still maintaining the symmetric use privileges.[77] This echoes, in some ways, both Sharma’s call for some form of institutional arrangement that focuses on investing in securing open source software, and Wikipedia’s strict rules to guarantee that the resource is well maintained.

3.2. Insights from the governance of open access commons to the governance of data

Nadya Purtova and G. Van Maanen, have recently mapped the academic and policy literature on data governance that has drawn from the literature on the commons.[78] Much of this literature departs from the assertion about the nature of data’s properties as an economic good – a non-rival good – and offers different analyses of governance options based on enhancing or facilitating data access, sharing or collective governance. At the same time, it is recognized that unlike information, data is excludable: data is non-rivalrous because it can be used infinitely, but it is excludable because access to data can be restricted through technical methods like encryption, in addition to legal measures that facilitate exclusion from digital data access.[79] From this insight, Purtva and Van Maanen submit that data is not in itself a common-pool resource, but that data can be seen as a part of other common-pool resources, such as scientific knowledge, data protection, or, perhaps, “the information environment.”[80] Thus, they argue that focusing solely on data governance for economic purposes is unproductive if the goal extends beyond ensuring an adequate quantity and quality of data. Models too focused on the economic qualities of data, may secure data provision but fail to address broader societal goals like innovation, digital economy, and privacy. Therefore, governance of the digital society should not exclusively revolve around data; otherwise, it may divert attention from other critical digital challenges.[81]

Because of this, Purtova and Van Mannen have significant reservations that Benkler’s commons-based peer production and governance framework can be applied to data, amongst others because it was developed in relation to other goods – mainly information, culture and knowledge -, and some of the assumptions of how different actors work voluntarily to produce and disseminate different goods may not apply to open data.[82] The key insight from the above examples of open access commons is, however, that open symmetric access, as in free to enter, does not need to mean free as in free rider. Rather, many kinds of open access commons have developed rules and sometimes pay-to-use rights to guarantee the quality and safety of the common resource. Thus, as in the case Wikipedia and FOSS, the governance of the information environment in the GAI context, and as it relates to data that is somewhat available online, could – and perhaps should – include rules about how to use data, how to “give back” to the information environment, and how to maintain the quality of the original resource – data. These requirements could, and perhaps should, be specially imposed on some of the largest actors, who are reaping the most economic benefits of the GAI economy.

At the same time, Purtova and Van Mannen’s larger insight on the limits of focusing on data governance should lead academics and policymakers to consider the limits of a data-focused governance regime when trying to balance interests such as innovation and the protected of established protected interests. As Andrés Guadamuz comments on the lawsuit by The New York Times, even if it is true that independent media is crucial for democracies, it is also true that The Times is one of the few outlets who has managed to developed a successful business model in the digital economy and that way too many other news-outlets have already faced important challenges in the digital economy leading to a deluge of trustworthy information. It may be that the economic reality of traditional media positions them in a less advantageous position vis à vis tech companies but, also, that the support a healthy information environment requires more structural attention that goes well beyond data governance (and the scope of this piece).[83]

4. Conclusion

 This Article mapped the key legal and governance questions regarding the training data of GAI. It suggested that regulators and judges are facing important questions on the individual entitlements – privacy, copyright, or via terms of service – that are to be recognized over the data GAI companies used to train their models The answers to those questions may have important distributive effects and may affect the future of AI development, but also the rights of creators, individuals, and the dominance of GAI companies.

At the same time, it also suggested that beyond the adjudication or recognition of individual entitlements, a wider question over the governance of the information available on the internet and the internet-supported information environment, may be at issue. As the scholars and policy makers continue to observe how the GAI economy and ecosystem develops it may be worth it to consider whether rules of access and maintenance should not be imposed on GAI developers too. This could be independent, to the eventual adjudication of individual rights, described in Part I of this Article, and could be tailored to, for example, create obligations for bigger players who use more of the common resource than others, more in line of the literature presented in Part II. How that could look like is, however, for future research to develop.

***

Citation: Beatriz Botero Arcila, Who Owns Generative AI Training Data? Mapping The Issue And A Way Forward, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023.

Note

Special thanks to Giovanna Hajdu Hungria Da Custódia for her helpful research assistance. I used ChatGPT for writing assistance, all mistakes are of course mine.

References

  • [1] See European Commission, ‘Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions A European strategy for data’ COM(2020) 66 final [Brussels, 19 February 2020 <https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:52020DC0066&from=EN> accessed May 5, 2023.
  • [2] Saffron Huang and Divya Siddarth, Generative AI and the Digital Commons (20 March 2023), https://arxiv.org/pdf/2303.11074.pdf, accessed November 30, 2023.
  • [3] Meta, Introducing LLaMA: A foundational, 65 billion-parameter large language model (Feb.y 24, 2023), https://ai.meta.com/blog/large-language-model-llama-meta-ai/> accessed November 30, 2023; Simon Willison, Think of language models like ChatGPT as a “calculator for words,” Simon Willison’s Webblog (April 2, 2023), https://simonwillison.net/2023/Apr/2/calculator-for-words/ accessed November 30, 2023.
  • [4] Tom Hartsfield, How do DALL-E, Midjourney, Stable Diffusion, and other forms of generative AI work?, Big Think (Sept. 23, 2022) https://bigthink.com/the-future/dall-e-midjourney-stable-diffusion-models-generative-ai/> accessed November 30, 2023.
  • [5] Bowles C et. al., GAN Augmentation: Augmenting Training Data Using Generative Adversarial Networks (25 October 2018) http://arxiv.org/abs/1810.10863, accessed November 30, 2023.
  • [6] See OECD, Data-Driven Innovation: Big Data for Growth and Well-Being (OECD Publishing, 2015), at 18, https://read.oecd-ilibrary.org/science-and-technology/data-driven-innovation_9789264229358-en#, accessed 18 May 2023 [hereinafter OECD, Data Driven Innovation]; OECD, Responding to societal challenges with data: Access, sharing, stewardship and control (OECD Publishing, 2022) <https://doi.org/10.1787/2182ce9f-en> [hereinafter OECD Responding to societal challenges with data]; Stefaan Verhulst et. al., The Emergence of a Third Wave of Open Data: How To Accelerate the Re-Use of Data for Public Interest Purposes While Ensuring Datta Rights and Community Flourishing, Open Data Policy Lab (2020); Beatriz Botero Arcila, ‘Future-Proofing Transparency: Re-Thinking Public Record Governance For the Age of Big Data’, Mich. S. L. Rev. (forthcoming, 2024).
  • [7] See OECD, Data Driven Innovation supra note 6; see in general Nadya Purtova & Gijs van Maanen, Data as an economic good, data as a commons and data governance, Law, Innovation and Technology (Nov. 01, 2023); Yochai Benkler, ‘Open Access Information Commons,’ Oxford Handbook of Law and Economics: Private and Commercial Law (Francesco Parisi, ed. 2016).
  • [8] NéstorDuch-Brown, Bertin Martens, and Frank Mueller-Langer, The Economics of Ownership, Access and Trade in Digital Data (JRC Digital Economy Working Paper 2017-01, 17 February 2017), https://ssrn.com/abstract=2914144 at 11; see also European Commission, supra note 1.; OECD, Data Driven Innovation supra note 6.
  • [9] See Angelina Fisher & Thomas Streinz, Confronting Data Inequality, 60(3) Columbia Journal of Transnational Law 829-956 (2022).
  • [10] See below discussion of Copyright Directive.
  • [11] See Rishi Bommasani et. al., The Foundation Model Transparency Index (Oct. 19, 2023), https://arxiv.org/abs/2310.12941.
  • [12] Emilia David, Now you can block OpenAI’s web crawler, The Verge (Aug. 7, 2023) https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai ; Andres Guadamuz, A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs (February 26, 2023), https://ssrn.com/abstract=4371204 [hereinafter Guadamuz, Copyright Liability and Exceptions in Artificial Intelligence]; Twitter Grok https://twitter.com/xai.
  • [13] The public domain is a word from copyright to refer to creative work to which no exclusive intellectual property rights apply. Because no one holds the exclusive rights, anyone can legally use or reference those works without permission. Here, I use it broadly, to refer to data to which also no data protection rights apply.
  • [14] See Beatriz Botero Arcila and Teodora Groza, ‘The New Law of the European Data Market: Demystifying the European Data Strategy’ (September 22, 2023), https://ssrn.com/abstract=4580036.
  • [15] See Michael M. Grynbaum and Ryan Mac, The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work, The New York Times (Dec. 27, 2023), https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html .
  • [16] Jon Gertner, Wikipedia’s Moment of Truth, The New York Times (July 18, 2023), https://www.nytimes.com/2023/07/18/magazine/wikipedia-ai-chatgpt.html (accessed Nov. 30, 2023)
  • [17] Benkler supra note 7.
  • [18] Will Knight, Chatbot Hallucinations Are Poisoning Web Search, Wired (Oct. 5, 2023) https://www.wired.com/story/fast-forward-chatbot-hallucinations-are-poisoning-web-search/ (accessed Nov. 30, 2023); Melissa Heikkilä, How AI-generated text is poisoning the internet, MIT Technology Review (Dec. 20, 2022); Carol Rose, The Comedy of the Commons: Custom, Commerce, and Inherently Public Property, 53 The University of Chicago L, Rev, 3 (1986) https://www.technologyreview.com/2022/12/20/1065667/how-ai-generated-text-is-poisoning-the-internet/
  • [19] As Andrés Guadamuz emphasizes however that “the economic reality is that traditional media needs more income sources, leading them to approach tech companies from a disadvantageous position.” Andrés Guadamuz, The Times lawsuit: The case and its wider implications, TechnoLlama (Jan. 5, 2024) [hereinafter Guadamuz, The Times lawsuit].
  • [20] See Grynbaum and Mac supra note 15; The New York Times Company v. Microsoft Corporation et. al., Case 1:23-cv-11195 Document 1 Filed 12/27/23 (Complaint). Available at: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf .
  • [21] See Rose, supra note 16, Benkler supra note 7.
  • [22] Benkler supra note 7.
  • [23] See Arvind Naryanan & Sayash Kapoor, The LLaMA is out of the bag. Should we expect a tidal wave of disinformation?, Knight First Amendment Institute at Columbia University (March 6, 2023), https://knightcolumbia.org/blog/the-llama-is-out-of-the-bag-should-we-expect-a-tidal-wave-of-disinformation (accessed Nov. 30, 2023)
  • [24] See Commission Nationale de Informatique et des Libertés, Artificial intelligence: the action plan of the CNIL, CNIL.fr (May 16 2023), https://www.cnil.fr/en/artificial-intelligence-action-plan-cnil, accessed November 30, 2023.
  • [25] See Article 6 GDPR
  • [26] Id.
  • [27] Garante Per la Protezione Dei Dati Personali, Intelligenza artificiale: il Garante blocca ChatGPT. Raccolta illecita di dati personali. Assenza di sistema per la verifica dell’età dei minori, GDPD.it (March 31, 2023), https://www.gpdp.it/web/guest/home/docweb/-/docweb-display/docweb/9870847 accessed November 30, 2023.
  • [28] Clothilde Goujard, Italian privacy regular bans ChatGPT, Politico (March 31, 2023), https://www.politico.eu/article/italian-privacy-regulator-bans-chatgpt/, accessed November 30, 2023.
  • [29] Garante Per la Protezione Dei Dati Personali, ChatGPT: OpenAI riapre la piattaforma in Italia garantendo più trasparenza e più diritti a utenti e non utenti europei, GDPD.it (April 28, 23)
  •  https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9881490#english, accessed November 30, 2023.
  • [30] Id; OpenAI, OpenAI Personal Data Removal Request, https://share.hsforms.com/1UPy6xqxZSEqTrGDh4ywo_g4sk30, accessed November 30, 2023.
  • [31] Natasha Lomas, ChatGPT resumes service in Italy after adding privacy disclosures and controls, TechCrunch (April 28, 2023) https://techcrunch.com/2023/04/28/chatgpt-resumes-in-italy/, accessed November 30, 2023.
  • [32] CNIL, supra note 24; European Data Protection Board, EDPB resolves dispute on transfer by Meta and creates task force on ChatGPT, EDPB (April 13, 2023) https://edpb.europa.eu/news/news/2023/edpb-resolves-dispute-transfers-meta-and-creates-task-force-chat-gpt_en; Lászlo Pók, Linkedn post Nov. 23,
  •  https://www.linkedin.com/posts/laszlopok_ai-how-to-sheets-activity-7132676582002835458-BKU9/?utm_source=share&utm_medium=member_desktop.
  • [33] Guadamuz, Copyright Liability and Exceptions in Artificial Intelligence, supra note 12.​​
  • [34] See Nicholas Carlini et. al. ‘Extracting Training Data from Diffusion Models’ (Jan. 30, 2023) https://arxiv.org/pdf/2301.13188.pdf, November 30, 2023.
  • [35] See Guadamuz, Copyright Liability and Exceptions in Artificial Intelligence, supra note 12.​​, at 13.
  • [36] See Guadamuz, Copyright Liability and Exceptions in Artificial Intelligence, supra note 12.​​​, at 15.
  • [37] Guadamuz, Copyright Liability and Exceptions in Artificial Intelligence, supra note 12.​
  • [38] Howard Hogan et. al., Copyright Liability for Generative AI Pivots on Fair Use Doctrine, Bloomberg Law (Sept. 22, 2023) https://news.bloomberglaw.com/us-law-week/copyright-liability-for-generative-ai-pivots-on-fair-use-doctrine, accessed Nov. 30, 2023.
  • [39] Id.; Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith, 598 U. S. ___ (2023)
  • [40] Brittany Loggins, 3 things to know about why authors are suing OpenAI, Fast Company (Sept. 22, 2023) https://www.fastcompany.com/90956516/fiction-authors-guild-lawsuit-openai-chatgpt-explained#:~:text=Along%20with%20the%20Authors%20Guild,create%20similar%20%E2%80%9Cderivative%20works.%E2%80%9D, accessed Nov. 30, 2023.
  • [41] The Authors Guild, Press Releases: The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and 13 Other Authors File Class-Action Suit Against OpenAI, The Authors Guild (Sept. 20, 2023) https://authorsguild.org/news/ag-and-authors-file-class-action-suit-against-openai/; Lauren Leffer, Your Personal Information is Probably Being Used to Train Generative AI Models, Scientific American (Oct. 19, 2023) https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/ , accessed Nov. 30, 2023.
  • [42] News Media Alliance, White Paper: How the Pervasive Copying of Expressive Works to Train and Fuel Generative Artificial Intelligence Systems Is Copyright Infringement And Not a Fair Use, News Media Alliance (Oct. 31, 2023), https://www.newsmediaalliance.org/generative-ai-white-paper/, accessed Nov. 30, 2023.
  • [43] Kadrey, et. Al., v. Meta Platforms, Inc, United States District Court, Norther District of California (Nov. 20, 2023)
  • [44] Grynbaum and Mac supra note 15; The New York Times Company v. Microsoft Corporation et. al. supra note 20.
  • [45] Id.
  • [46] Guadamuz, The Times lawsuit, supra note 19.
  • [47] OpenAI, OpenAI and journalism, OpenAI (Jan. 8, 2023), https://openai.com/blog/openai-and-journalism .
  • [48] Ian Bogost, My Books Were Used to Train Meta’s Generative AI. Good, The Atlantic (Sept. 27, 2023), https://www.theatlantic.com/technology/archive/2023/09/books3-database-meta-training-ai/675461/ , accessed Nov. 30, 2023.
  • [49] Id.
  • [50] See James Boyle, The Second Enclosure Movement and the Construction of the Public Domian, 66 Law and Contemporary Problems (2023)
  • [51] Directive (UE) 2019/790 du Parlement européen et du Conseil du 17 avril 2019 sur le droit d’auteur et les droits voisins dans le marché unique numérique et modifiant les directives 96/9/CE et 2001/29/CE (Texte présentant de l’intérêt pour l’EEE.) PE/51/2019/REV/1 OJ L 130, 17.5.2019 Art. 4, [hereinafter 2019 Directive on Copyright in the Digital Single Market (CDSM)]
  • [52] Thomas Margoni, Martin Kretschmer, A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology, GRUR International, Volume 71, Issue 8, August 2022, Pages 685–701 (2022) https://doi.org/10.1093/grurint/ikac054.
  • [53] 2019 Directive on Copyright in the Digital Single Market (CDSM).
  • [54] Assemblée nationale, Proposition de loi no. 1630 visant à encadred l’intelligence artificielle par le droit d’auteur (September 12, 2023) https://www.assemblee-nationale.fr/dyn/16/textes/l16b1630_proposition-loi
  • [55] Cristophe Geiger and Vincenzo Iaia, Generative AI, Digital Constitutionalism and Copyright: Towards a Statutotry Remuneration Right Grounded in Fundamental Rights, The Digital Constitutionalist (2023) https://digi-con.org/generative-ai-digital-constitutionalism-and-copyright-towards-a-statutory-remuneration-right-grounded-in-fundamental-rights/ , accessed Nov. 30, 2023.
  • [56] Umar Shakir, Reddit’s upcoming API changes will make AI companies pony up, The Verge (April 18, 2023) https://www.theverge.com/2023/4/18/23688463/reddit-developer-api-terms-change-monetization-ai, accessed Nov. 30, 2023.
  • [57] See e.g. Ram Bhadra, LinkedIn: A Case Study into How Tech Giants like Microsoft Abuse Their
  • Dominant Market Position to Create Unlawful Monopolies in Emerging Industries, 13
  • Hastings Sci. & TECH. L.J. 3 (2022).
  • [58] Carlini et. al. supra note 34.
  • [59] News Media Alliance supra note 42.
  • [60] See Huang and Siddarth supra note 2; Heikkilä supra note 18; Knight supra note 18; but see Sayash Kapoor & Arvind Naryanan, How to Prepare for the Deluge of Generative AI on Social Media, Knight First Amendment Institute at Columbia University (June 16, 2023) (arguing that social media remains a key factor in the distribution of GAI-generated content).
  • [61] See Purtova and Van Maanen supra note 7, in general explaining the limits of this focus on data as an economic good.
  • [62] Garrett Hardin, The Tragedy of the Commons, Science (1968) 162 (3859).
  • [63] See Elinor Ostrom, Background on the Institutional Analysis and Development Framework, 39 Pol Stud J 7 (2011).
  • [64] Elinor Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action (1991).
  • [65] Benkler supra note 7, at 1.
  • [66] Benkler supra note 7, at 10.
  • [67] See Jason Potts, Innovation Commons: The Origin of Economic Growth (2019).
  • [68] Benkler supra note 7, at 15.
  • [69] Benkler supra note 7, at 11.
  • [70] Id.
  • [71] Id.
  • [72] Synopsys, 2022 Open Source Security and Risk Analysis Report, https://www.synopsys.com/content/dam/synopsys/sig-assets/reports/rep-ossra-2022.pdf, accessed Nov. 30, 2023.
  • [73] See Sharma, Chinmayi, Tragedy of the Digital Commons, 101 North Carolina Law Review 1129 (2023).
  • [74] Chinmayi Sharma, Open-Source Security : How Digital Infrastructure is built on a House of Cards, Lawfare (July 25, 2022) https://www.lawfaremedia.org/article/open-source-security-how-digital-infrastructure-built-house-cards, accessed Nov. 30, 2023.
  • [75] ‘Wikipedia : Eigth simple rules for editing our encyclopedia’ https://en.wikipedia.org/wiki/Wikipedia:Eight_simple_rules_for_editing_our_encyclopedia, accessed Nov. 30, 2023,
  • [76] See also, Jane IM et. al. Deliberation and Resolution on Wikipedia : A Case Study of Requests for Comments (Proceedings of the ACM on Human-Computer Interaction, Vol. 2, No. CSCW, Article 74. Publication date: November 2018).
  • [77] Benkler supra note 7, at 16.
  • [78] See Purtova and Van Maanen supra note 7, at 7; see also e.g. Brett M. Frischmann, Michael J. Madison and Katherine J. Strandburg, Governing Knowledge Commons (Eds.) (2014).
  • [79] Purtova and Van Maanen supra note 7, at 14; citing Charles I Jones and Christopher Tonetti, Nonrivalry and Economics of Data, 110 American Economic Review 2819 (2020).
  • [80] Purtova and Van Maanen supra note 7, at 28-29.
  • [81] Purtova and Van Maanen supra note 7, at 5.
  • [82] Purtova and Van Maanen supra note 7, at 31.
  • [83] See Guadamuz, The Times lawsuit, supra note 19.

Related Posts