The Fight for Open Source in Generative AI

The fight for permissionless innovation is a timely and recurring theme. Today, it translates into the fight for open source in generative AI ecosystems. After addressing the technological (what open source is) and economic (how open source creates competitive pressure) aspects of the issue, I’ll offer my take on what lawyers should–and, in my view, shouldn’t–do about open source in generative AI.

1. The technology

Open source (“OS”) technologies play a critical, often invisible, role in today’s digital ecosystem. MySQL and PostgreSQL are popular open-source databases used to manage and store data. Browsers such as Mozilla Firefox and Chromium (behind Google Chrome) are built on open-source foundations. Android is based on a modified version of the Linux kernel. The WordPress web content management system powers over 60% of all websites on the Internet. Many programming languages, libraries and frameworks are open source, including Python, Ruby, Node.js and TensorFlow. In fact, most of the world’s computers use open-source bootloaders to load the operating system, one of the most famous being GRUB (GNU GRand Unified Bootloader).

In the field of generative AI, the proliferation of open-source foundation models — starting with Google BERT — is giving rise to a thriving ecosystem, now led by Hugging Face (BigScience and BigCode) and EleutherAI. Meta has also joined the open-source movement in recent months by releasing the weights of its LLaMA model. Meanwhile, open-access foundation models are also emerging, where the company releases the API but not the model or training data. OpenAI is one such open-access (“OA”) foundation model. HuggingFace hosts an Open LLM Leaderboard that lists dozens of open-access models.

Taking a step back, OS/OA models have different value propositions than closed source models. Sure, closed source models tend to have better performance standards and commercial support. But OS/OA models are easily audited. In a world where AI can be used to size power, having the ability to check AI code cannot hurt. And OS/OA also brings dynamic competition to AI ecosystems, which brings me to the market.

2. The market

As Sandy Pentland and I wrote in a recent working paper, the existence of open-source and open-access models could make a notable difference compared to the early days of Web2 giants’ core services, such as search, social media, etc. The openness of these models means that they can spread easily. They are up for grabs. And people are in fact grabbing them, as evidenced by the exponential growth of the Open LLM Leaderboard.

The momentum generated by open source is frightening big tech companies. In a leaked internal memo, Google cited open source as a key reason why the company is unable to win the race for foundation AI models. Google being scared is a good thing. It means Google is innovating as fast as it can to stay ahead. It means there is true competition.

Looking ahead, the question is whether OS and OA solutions have a viable path to deliver compelling features and continue to improve, or whether a handful of private companies are likely to take over. I would argue (as we did here) that answering this question first requires distinguishing between different types of foundation models underlining GenAI. I will focus here on general public foundation models (and above products), such as ChatGPT. Second, we need to consider two separate stages: (1) the ability to “reach quality,” and (2) the ability to benefit from increasing returns.

When it comes to “reaching quality,” no AI product can survive if users are highly disappointed by its poor features. In turn, companies that train AI systems must have access to large and unique datasets, and they must be able to afford the cost of compute (note: will these costs continue to fall and eventually become irrelevant? – Nvidia, whose shares are up 200% since the beginning of the year, claims that using its GPUs has reduced the price of training LLMs from $10 million to just $400,000).
But that is not enough. AI products with the best features won’t necessarily survive. These products can only survive if they benefit from strong increasing returns. And what is the source of these increasing returns? Users. Having users means having revenue, which allows you to cover the cost of computing, buy access to unique datasets, hire great employees, having companies develop compatible products (see ChatGPT store)… all of which leads to having a better product, which attracts more users, etc.

What I just described is the “natural” cycle of competition within Generative AI. Open AI solutions have a shot. But that window of survival will close if regulation pushes OS/OA out of the market. Which brings me to the law.

3. The law

The fact that incumbents feel threatened by open-source generative AI creates competition. That is a good thing. But it also creates an irresistible desire for regulatory capture.

We are already seeing a push to denigrate open source. OpenAI co-founder and chief scientist insisted that “at some point, it will be quite easy, if one wanted, to cause a great deal of harm with [open-source] models.” A researcher from the same company tweeted that “an important test for humanity will be whether we can collectively decide not to open-source LLMs that can reliably survive and spread on their own.” Nothing less.

I expect the pressure to intensify. As Elting E. Morison described in “Men, Machines, and Modern Times,” those who benefit from current technologies generally welcome new ones in three stages: first, they ignore new technologies; second, they downplay their chances of success with seemingly rational arguments; third, they engage in name-calling. Applied to our topic, it did not take long for some companies and politicians to stop ignoring the new wave of AI (stage 1). The advent of ChatGPT marked the beginning of the hostilities. Some are now busy explaining that AI cannot scale, that AI is stupid and limited, etc. (stage 2). And others have moved directly to a fear strategy (stage 3). They call AI dangerous, a threat to human life, to our democracies, etc.

A little (insider) bird tells me (…) that efforts to denigrate generative AI will increasingly focus on open source solutions in 2024. On the social front, open-source solutions will be labeled anti-democratic because they promote freedom of speech to a degree that is generally disputed. On the economic front, open source will be described as unfair, because open-source models are often trained on data without compensation, and because emerging business models are not easily controlled.

We are already seeing some large corporations pushing for regulations that require all companies to pay licences and royalties to content providers whose publications are used to train AI. Conveniently, large companies can afford these costs, while smaller and open-source competitors often cannot. These companies leaning on the side of closed source AI are beginning to market themselves as “the good guys” who are willing to pay content providers as opposed to the “bad open-source guys” who just “steal” content. They are calling for regulation of generative AI to favor the good guys by increasing compliance costs. For example, Sam Altman is calling for the US government to issue licences “for development and release of AI models above a crucial threshold of capabilities,” which OS/OA models won’t be able to get without an army of lawyers.

To be clear, I am not saying that because regulation should be avoided at all costs but it always comes with compliance costs. But I am saying that we should consider whether the objective any regulation is trying to achieve is worth eliminating the competitive pressure put by OS/OA players. And be transparent about the trade-offs.

Now, there is a way to achieve transparency in this area: have policymakers publish impact assessments, have the industry respond to those assessments, and have policymakers respond publicly to the arguments raised by industry. These sets of responses will make the trade-offs explicit. This will at least avoid going through Elting’s three stages and move on to what I would call stage four: sensible regulation for a complex world. Oh, but one small detail: policymakers must have the right incentives to engage in such dialogue. As Munger once said, “show me the incentive and I will show you the result.” Let’s start thinking about that.

Thibault Schrepel

***

Citation: Thibault Schrepel, The Fight for Open Source in Generative AI, Network Law Review, Winter 2024.

Related Posts