Stephen Dnes: “Generative AI: The Input Data Riddle”

The Network Law Review is pleased to present a symposium entitled “Dynamics of Generative AI,” where lawyers, economists, computer scientists, and social scientists gather their knowledge around a central question: what will define the future of AI ecosystems? To bring all this expertise together, a conference co-hosted by the Weizenbaum Institute and the Amsterdam Law & Technology Institute will be held on March 22, 2024. Be sure to register in order to receive the recording.

This contribution is signed by Stephen Dnes, Partner, Dnes & Felver PLLC; Associated Senior Lecturer, University of Surrey; and Visiting Lecturer, Royal Holloway, University of London. The entire symposium is edited by Thibault Schrepel (Vrije Universiteit Amsterdam) and Volker Stocker (Weizenbaum Institute).

***

Abstract

Generative AI systems create new outputs from input data. This raises an issue under existing laws, notably the GDPR, which focus chiefly on the conditions of data input. Particular issues arise with consent-based Personal Data input models and the degree of control over the Personal Data inputs they envisage. The issue with risk-based evidence for identity revelation in input data is also assessed with reference to pertinent legal developments in the EU, UK and USA. This paper will assess the theoretical context surrounding these aspects of the existing laws and their implications for generative AI applications. It will suggest approaches to avoiding some of the issues experienced with the current generation of laws as they are updated for AI.

1. Introduction

The promise of generative AI is that by creating novel outputs, in some sense, it provides the unknown. This is so whether the output created is new, or is existing knowledge assembled through new and thus formerly unknown pathways. This fundamentally differs from existing computing systems which improved the efficiency of data handling without themselves generating a formerly unknown sophisticated output.

In principle, an innovation machine is a good thing for any society that, on balance, favors pro-innovation policy positions. As in the title of the recent book, we have all been waiting for the flying car for much too long.[1] However, the existing generation of laws and regulations on data handling start from assumptions which do not, typically, encompass a machine making its own transformative use. Instead, they focus chiefly on the input process by which data enters the system. Thus, the possibility of a machine which, effectively, creates its own inputs is not directly catered to.

The aim of this paper is to survey these tensions and provide recommendations for their resolution. First, the paper reviews the seminal paper by Latanya Sweeney which attempted to show that a specific individual’s identity could sometimes be revealed from three non-personal data points. Second, it assesses how different laws have approached this identity revelation risk. Third, the paper assesses how identity revelation relates to privacy and the possibility that generative AI systems will create identity-linked outputs without necessarily having been fed identity-linked input data. This is referred to as the unknown future uses issue, and the paper considers how it is addressed in the current nascent AI laws. Finally, the paper concludes with some infrequently asked questions relevant to the role of data in AI systems.

2. Identity revelation: Risks to specific individuals

The starting point for conventional analysis of harm from data is the question of identity-linkage. That is, there must be a data point related to a particular living individual. This distinguishes laws giving control over data from more general regulation of statistical insight. Whereas a general statistical property is simply a reflection of societal conditions information about someone gives rise to regulation arising from a different premise: that one should be able to control, or at least know of and be able to correct, information about oneself including in systems owned or controlled by others.

This right is a classic Hohfeldian right: it creates duties in others.[2] They must give access to systems which would otherwise sit beyond reach. The liberty of the system operators is thus constrained to give effect to the right. This is significant: property rights are attenuated and some sticks in the bundle of rights are effectively handed to others where information is about them.

Thus, familiar property rights issues emerge. Most obviously, there is the numerus clausus problem, that is, the classic property law notion that property rights exist in a closed list. The distinguishing features of property rights – especially, their ability to bind others – cannot simply be conferred by individuals at their “fancy and caprice”[3] but must instead correlate to existing recognized rights.

For data about people to be protected in the way intended, this is arguably necessary: it could hardly work without an exception to private property rights of at least some sort since without the exception to property rights there would not be the desired ability, at least for Personal Data, for the individual to control information where it is in somebody else’s system.

So, if it is accepted that there are fundamental rights in Personal Data, then the key question is how broad such an exception should be, and the basis of it. For instance, is there a property rule or a liability rule in the data?[4]

A discretionary open-ended right would effectively give control of the information systems owned by others even where there is no link to a particular individual. This may be desired in some quarters, and could certainly be abused, especially if industrial policy considerations were to creep in. There is also the visible tendency of some data protection complainants to omit analysis of the likelihood of individual identifiability from their complaints.[5] There are arguments about the role of data in society and specifically the role of statistical insight.[6] While these are interesting questions, it is submitted that they are properly addressed as policy questions or legal questions about the use of data, and which data is actually Personal Data, rather than the scope for individuals to control systems owned by others.

Thus, to avoid an unprincipled exception to property rights doctrines, the critical question for the data protection laws and regulations is – at least at the threshold – whether data relating to an individual is involved. In itself, this is a simple Bayesian question: About Person X? Y/N. But given the richness of data sets, this question quickly moves to the closely adjacent issue of how much certainty is permitted in potential identity-linking. This is where the law has responded to a seminal contribution from Latanya Sweeney to which we now turn.

2.1. Identity linking risks: Match keys across data sets

In 2002, Latanya Sweeney published an important article on the concept of anonymity.[7] The paper used 1990 US census records and influentially demonstrated that just three data points – 5-digit ZIP, gender, and date of birth – would result in a unique or nearly unique data point for 87% of the population. Memorably, she identified the Governor of Massachusetts in the list.

Thus, data combinations could result in uniquely identifiable data points. Whether these data points relate to individuals is, however, a further question. As we shall see below, some laws have taken a precautionary approach to the risk that identity linking can occur given the ubiquity of unique data points.

However, it bears emphasis that this does not follow directly from the noted result in the Sweeney paper. The 87% statistic calls to mind dossiers and files about particular people: as though 87% of the population has a secret file about them. But this does not follow from the result. The question is whether the unique data point is matched to an individual.[8]There is also emphasis in the original Sweeney paper on the fact that it is the particular data sets (census data, zip code and date of birth) that drive uniqueness. The result may be confined to data sets with rich uniqueness properties.

Despite this care to contextualize the issue in the original paper, the paper has come to support a much broader data-truncation agenda than was perhaps ever intended as explored further below in relation to the GDPR.

The following examples of data points which would be likely to fall within X% uniqueness:

Had a coffee at 7:00 AM, wears red socks, likes to run.
Listens to classical music, reads a newspaper every day, watches documentaries.
Does grocery shopping on a Friday, a school run every day, and listens to songs about dinosaurs on the way.

All of these are useful data points for various value-adding purposes. Without any link to identity, they are simply data points. As is sometimes missed, however, the point of the paper is to consider contexts in which uniqueness is a problem – and when that is so, it will be a significant issue e.g., health-related data. In those instances, particularly “identifiable” data points, e.g., date of birth, should be stripped out. But this is far from a global property of data sets and the paper does not set out to regulate cross-correlations between music, beverage and shopping data as in the examples above.[9]

Thus, we can advance an important proposition: it is not the uniqueness of data in itself that drives any harm. The identity linking is the true risk factor as regards revealing identity, and it requires different analysis.[10]

This is significant because many perceived risks related to generative AI relate arise because of impacts on individuals in terms of their identity, and not the mere creation of data however finely grained that data might be. Some of the most prominent examples – deep-fakes; identity theft – are harmful only because they are identity-linked. In such cases, they can be very harmful. But the move from harmless to harmful hinges entirely on the linking of identity, as distinct from other online harms which are not linked to identity and do not therefore engage the regulation of Personal Data.

Whatever boundary is created on data use based on the concept of identity linkage will proscribe the data set and therefore its utility. The decision as to ex post precautionary vs ex post evidence-based analysis will trade-off the richness of utility against the level of protection – and not just in terms of static insight but unknown future dynamic ones too. This is particularly so for generative AI applications where the AI may learn how to link to identity and where safeguards against this outcome can help support a richer data set, with more insight, while still preventing harm to individuals.

Thus, there is a need to weigh the risk of identity revelation and the utility of data sets. We shall now turn to whether the law does so.

2.2. Legal and regulatory approaches to identity linkage risks

The input data identifiability debate is essentially a discussion as to the relative roles of risk- and hazard-based interpretation. This then affects the use of Generative AI systems to the extent that the data for them is truncated.

There is no single approach to “privacy” worldwide. Some jurisdictions have chosen, deliberately, not to regulate some species of data, or not to regulate them as strictly, because of a greater emphasis on the ability to handle data, and the pro-consumer innovations this can bring.[11] Any global system would need to take care to preserve this diversity, and to avoid inadvertently applying rules that are stricter than a jurisdiction has chosen.

Differences are subtle but significant. A prominent example is the different treatment of pseudonymization and re-identification risks under the EU’s General Data Protection Regulation (“GDPR”) and the California Consumer Privacy Act (“CCPA”). Whereas the GDPR takes a precautionary approach, providing jurisdiction to regulate even potential re-identification under the definition of Personal Data in Art 4(2),[12] the CCPA takes a risk-based approach under which data is not regulated provided that reasonable risk-based safeguards are in place under the definition of exemptions in 1798.145. This reflects different underlying regulatory philosophies on ex ante vs ex post regulation – although, that can be an overgeneralization, as we shall see, as a German settlement with Google is very similar to a US one in this regard. So, there will always be differences and patterns of convergence and divergence on these critical points.

Still, on balance, it is clear from the GDPR and CCPA that there is a tendency towards more ex ante regulation in the European Union, whereas in the United States, there is more emphasis on business liberty, provided that the business is taking reasonable care in context. This finds expression in the choice of ex post reviews (e.g., FTC cases) in the US, and the reasonable risk-based approach to identity revelation safeguards in the CCPA, whereas the GDPR in the EU has tried instead to set ground rules on a somewhat precautionary basis.

***

Examples of divergence on the data control assumption

CCPA

CCPA is a significant case in terms of different approaches to privacy: as explored further below, taking reasonable care with regard to organizational and technical measures to de-identify a data set can enough to discharge the duty on the business, because CCPA takes a risk-based approach to identity revelation. Whereas in the EU there is always residual jurisdiction over “legitimacy” of Personal Data handling under Art 6(1)(f) GDPR. The logic following cases such as SRB and Breyer (explored further below) is that identity linkage is presumed unless disproved. This results in a hazard-based approach to identity revelation from even potential data relinking. This is visible debates such as whether and when pseudonymized data remain regulated. These debates do not have clear answers, and compared with CCPA, the GDPR approach is more cautious.

No general regulation of data handling in the USA

It remains the case that, for now, the US does not have a federal equivalent to GDPR, and this decision not to require user “control” over input Personal Data where risks are low should also not be assumed away. This is important because if transaction costs are high, then requiring consent in low risk cases can chill innovation for no commensurate benefit.

Proposals such as the Gillibrand and Klobuchar bills for a federal privacy law would have applied different approaches than GDPR and do not necessarily mandate control. For example, the Gillibrand bill proposed regulating high-risk activities but would have left processing liberty unless there was evidence of high risk. The Klobuchar bill emphasizes transparency (such as disclosures), and frames its opt out right so that a service provider can still condition access to a system on the provision of data if the system is otherwise “inoperable”.

This is not the same thing as necessarily giving end users a control or consent right in all cases, and chooses to prioritize the ability to run a data-driven business model over “control” in cases where this is necessary for “operability.” It would not be the same thing as “control” in all cases and deliberately does not align to GDPR in this regard, reflecting emphasis on different prioritization of consumer interests.

UK rethink on GDPR after Brexit

Another significant example can be seen in UK proposals to alter the UK implementation of GDPR following independence from the European Union. It remains to be seen what form this would take, but the UK Government’s consultation on the changes spoke to easier data handling for innocuous use and moves towards the risk-based approach seen under laws such as CCPA. This is developed further below in relation to the Data Protection Bill (No. 2) currently before Parliament, which in rejecting the current hazard-based approach would move to a risk-based approach to identity revelation.

***

A major example of this can be seen in assumptions on data control. Whereas there is an ex ante right to control data in the EU under the GDPR – a hazard based approach – there is instead a risk based approach looking to reasonable safeguards under the leading US state privacy law, the California Consumer Privacy Act.

Looking at these partially diverging and partially converging approaches, what then is the exact scope of identity-linked data for the purposes of data regulation? And by extension, what is the scope to use input Personal Data under AI law where safeguards such as de-identification are used?

This critical legal question receives subtly but importantly different questions in different jurisdictions. This section will survey the treatment of identity linkage risk in the EU, the UK, and the US. This reveals subtly different approaches to identity-linkage, reflecting different risk postures and evidence requirements.

2.2.1. EU: GDPR

In the EU, the GDPR applies and is predominantly an input-based model. The relevant gating definition is that of the Data Subject:

Personal Data: Art 4(1) GDPR

any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

Significantly, there is also specific provision for pseudonymous data use that might address the possible harm of identity-linkage from a data set: Article 25 GDPR encourages pseudonymization as a “privacy by design” measure, but it does not affirmatively allow this expensive step to be omitted where there is no consumer interest in privacy; nor does it create a safe harbor where pseudonymization is employed. Instead, the GDPR simply encourages pseudonymization: in a law, a somewhat meaningless exercise as it does not pin down the prioritization of interests in cases of conflict.

Instead, a hazard-based approach is taken. This is seen in recital 26 of the GDPR which sits somewhat in tension with Article 25 and states: “data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information, should be considered to be information on an identifiable natural person.”

This provision is stylistically verbose (“should be considered to be” = is). It would be helpful to tighten the language of such a key provision for the data driven economy.

Moving past style to substance, the provision strips pseudonymization of safe harbor status since the risk of re-attribution is still a gate to liability, even if theoretical possibility of re-attribution poses no material consumer risk.[13] The critical reference is that to “could be” re-identified.

Article 25 GDPR also refers to the context of processing, including cost, but still affirmatively requires “state of the art” protection, despite the possibility that this is not merited. It would be a brave business who would argue that “state of the art” can be interpreted to be zero, even where this is the pro-consumer outcome. In other words, at least if read at face value, there is an assumption that safeguards are always required, and should always be the most costly, even where this is harmful to those using the system (as in a low risk case).

The recital 26 definition is also significantly in tension with the definition of “pseudonymization” in recital 5:

(5) ‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;

This definition seems to have in mind actual separation of the match key back to identity from the other data points. But that is not what recital 26 says, which instead applies the could be language as above.[14]

There are also some significant drafting ambiguities. The essential concept is the possibility of reattribution. However, as this is defined with reference to an unclear boundary, vagueness creeps in on a critical boundary to the law.

The fundamental issue here is that defining Personal Data with reference to both Personal Data and to de-identification factors drives circularity between them. That circularity is that because they are defined with reference to each other, neither is clearly defined. There should instead be clearer differentiation based on evidence of risks, to avoid the artificial truncation of data sets on a non-evidenced basis.

It would be more helpful to define the concept using risk-based criteria. There is a serious risk of curtailing input data, or over-regulating it, even where identity revelation is not at risk simply because of ambiguity rather than any affirmative policy choice

Shortfalls in the input based approach

Regulation of data in thin air

There are also some curiosities which seem to be unduly legalistic. The fact that the possibility of data linking gives rise to regulation at the threshold stage has some odd implications for the downstream scope of regulation. The essential point is that it denies a safe harbor for innovation, even though there is no risk of identity revelation in a de-identified data set without the addition of further, identity-linked information.

This is a powerful impediment to generative AI innovation. In a typical generative AI use there is a large database whose contents are partially unknown, at least at the detail level. Read at face value, the GDPR places that large database under regulation in that it must have a legal basis, such as consent or the loaded term “legitimacy” (according to the regulator).[15] The Article 6(1)(a) GDPR test is particularly restrictive of this unknown innovation because it requires a “specific” purpose consent. While this could theoretically apply to an unknown future use, it seems very unlikely that a regulator would accept the argument that the “consent” meant by the law could encompass consent to innovation.

As a result, entire datasets can drop out of use because of the difficulty in complying with Art 6 GDPR’s requirements, even where there is no harm in the internal processing inherent in AI because there is no link to identity revelation.

This fails important theories on innovation, as noted by the Nobel prize winning economist Phelps:

“Innovations… are not determinate from current knowledge, thus are not foreseeable. Being new, they could not have been known before.”[16]

In other words. the GDPR, in its informed consent requirement, applies the fallacy that innovations can be foreseen. This gives little to no weight to innovative use from novel data sets even where they only contain de-identified data points.[17]

A good example of this can be seen in rectification rights. As AI can make errors, there is theoretical scope for a need to rectify the output. If the view is taken that the AI linking the output to input data because it might reveal identity, then the requirements of Article 6 would apply even though it is only the output that is causing a problem. This is highly precautionary and appears to be unmerited in the absence of evidence of harm at the output stage.

This might be put as strongly as the level of a category error. From a consumer welfare perspective, the issue is the harm from the misuse of Personal Data. While a rectification right could be sound in the case of actual Personal Data – the right to have corrections of things about oneself – it should not be extended to the de-identified use case at issue here. That would essentially be a right to correct information about User123. But it is only in relation to material risk of linking to an individual that rectification would actually make sense.

Thus, it is unclear why a precautionary gating definition is required, there being no utility to correcting a de-identified data fragment about User123 except in instances where there is a material risk of linkage back to identity. It might not even be possible to do so: how is the data point identified for rectification, other than via evidence of an identity link? That would be a rectification right in thin air.

It follows that in the case of de-identified data, use is chilled at the margin, for a purely hypothetical benefit. A tighter, risk-based definition of Personal Data would avoid this because there would then be a link between risks to individuals and the duty of care. Expending resources, and curtailing data availability, would only take place in cases where there is material risk.

Arbitrary discrimination based on device vs cloud use

The same trend has been seen more recently in relation to debates about where processing takes place. This has major implications for generative AI: this is usually cloud based, at least on present technology, so rules favouring silos of data on devices amount to a tilt in favour of the vertically integrated Web 2.0 model, discriminating against Web 3.0.

Consider the draft EDPB guidance on the e-Privacy Directive. Significantly, there is a partial carve out for on-device storage. This risks a tilt towards those controlling devices, unless the rules are technologically neutral. The proposal is to capture movement into and out of local storage:

“The use of such information by an application would not be subject to Article 5(3) ePD as long as the information does not leave the device, but when this information or any derivation of this information is accessed through the communication network, Article 5(3) ePD may apply.”

This is the question of where not what is done with the data and it is irrelevant from the consumer perspective. This artificially segments the market in favor of on-device processing. That would be a particularly damaging limitation if applied to AI systems.[18]

The missing point, from a risk-based perspective, is to engage with definitions of harm. That is simply missing from the definition above, reflecting an assumption towards fundamental rights treatment of data. The issue with such a view, however, is that it understates the role of the definition in cases where fundamental rights are not reasonably involved, as where there is no significant persistence or harm from data. In such cases, the definition brings in data uses which are de-identified and low risk with no gate to distinguish the higher risk cases in which fundamental rights are more directly engaged.[19]

In short, the EU level experience would suggest that a fundamental rethink is required in relation to AI law. There may well be a case simply to dispense with the inputs-based conceptions of the GDPR, at least as regards AI, and replace it with an outputs-based measure.

2.2.2. Germany: FCO settlement with Google

By contrast, under the German Federal Cartel Office’s settlement with Google, there is a much clearer distinction between directly linked and de-identified data. Applying the same logic by which de-identified data would be differentiated from Personal Data would allow more use of data for generative AI applications.

On 5 October 2023 Google settled on the basis of drawing a distinction between Personal Data and non-authenticated Users:

“A differentiation is made between users signed into a Google account and non-authenticated users… The settings of non-authenticated users are stored via a cookie…. “[20]

As the Decision summarized:

Google will no longer apply data processing terms to users allowing Google to:

combine personal data from a service covered by the Commitments with personal data from other Google services (with the exception of the relevant core platform services under the DMA) or with personal data from third-party services, or

[interoperate] personal data from a covered service in other services provided separately by Google (with the exception of the relevant core platform services under the DMA) and vice versa without giving users sufficient choice options to consent to or decline consent to such cross-service data processing (para. 1(1)).[21]

Furthermore, Google committed not to use terms allowing it to combine personal data with third party Personal data, or the cross-use of Personal data from other covered services, without first giving adequate consent choices.

So, in Germany, it is necessary to give consent to allow cross-service personal data processing. Very significantly, as the combination of data sources restrictions are cast in terms of personal data, there is differentiation between Personal Data and de-identified data (although it should be noted that there is not a deregulated state for either).

This provides a balance between processing freedoms and personal data protection and the differentiation based on personal data is a welcome distinction. Indeed, it may be an overdue clarification in the case of the GDPR with which the settlement must comply.

2.2.3. US

Two key examples are advanced in relation to the US position. The first is the definition of data in the CCPA, which has been touched upon above.

CCPA

It is notable that the CCPA is somewhat more precise, and provides a safe harbor for de-identified use. By defining de-identified data on a reasonable risk approach, the CCPA avoids the vagueness issues with the hazard-based approach of the GDPR.[22]

Settlement with State Attorneys General

A further Google settlement, this time with 41 US State Attorneys General, provides further insights on the boundaries of large scale de-identified data sets.[23] This settlement draws a distinction between logged in and non-authenticated data, which is essentially a practical and risk-based distinction between identifiable and de-identified data, since the log in allows linkage to identity (i.e., an authenticated User Account with Google).

Geolocation data provides a core example. This is a very useful data set for optimization, but it is also potentially one of the most invasive. The settlement requires opt-in consent when Google links such data to authenticated User Accounts, but it requires only transparency in relation to de-identified data. This is similar to (although not identical to) Google’s position in Germany – since the choice requirement is broader in the German settlement.

2.2.4. UK

In the UK, relevant ICO Guidance drafts, an AdTech Opinion of November 2021, and a Joint Statement with UK CMA provide further examples of moving towards a risk-based approach towards de-identified data. Of particular note is the Data Protection Bill (No. 2) currently before Parliament. In the Guidance drafts, the UK ICO has explicitly stated that the goal of data protection laws is not to reduce risk to zero.

This would move away from the EU’s hazard-based approach to privacy regulation. A new s.3A of the Data Protection Act would be added which is based on a reasonable risk analysis.[24] This is a particularly striking example, as it shows that the democratic process sometimes moves towards less, rather than more, control over data flows.

3. Conclusion: the impact on Generative AI applications

Drawing these cases together, there is a clear development in favour of risk-based approaches to de-identified data. It would be helpful to apply a risk-based approach to the output data from generative AI systems, rather than analysis of the input data. This would be a fundamental re-think. The emphasis would be on risk-based outputs from generative AI, rather than hazard-based inputs. Control rights would be attenuated to the valid cases of true risks of Personal Data revelation, and a de-identified data set used in generative AI would not be regulated as Personal Data unless there are re-identified outputs. This would enable innovation and allow an evidence-based approach that prioritizes innovation from the use of large data sets over the hypothetical risk that identity re-linking will occur.

Then, compliance efforts could focus on managing the true consumer-facing risk, which is that of reidentification in the outputs – there being no interest in the “innards” of a giant innovative AI data set for its own sake. Nonetheless, regulating the “innards” from the input data onwards is the position of the GDPR, and it has for now been inherited with potential chilling effects on innovation from generative AI-based systems

It is unfortunate that the latest developments in AI governance do not provide a clear framework on this crucial point. For example, the Biden Administration’s Executive Order refers to Privacy Enhancing Technologies which can be taken to be a need for global restrictions even on de-identified data (e.g., separate “clean room” processing even where risks are low). It will be essential to avoid the same issues seen in the GDPR from recurring, and statements of this sort should be framed more precisely in terms of Personal Data (EU/UK) or Personally Identifiable Information (US) risks.[25]

This is critical because otherwise a large data set of harmless but de-identified data points will be lost with knock-on losses of innovation in generative AI systems. Transactions costs will increase and vertical integration is encouraged. A competitive ecosystem of AI vendors is therefore diminished, simply because de-identified data would not be available to them without any principled basis as to why this needs to be so.

There is still the possibility that generative AI systems will create identity-linked outputs without necessarily having been fed identity-linked input data. This is referred to as the unknown future uses issue, and it is a real issue. However, it follows from the above analysis that the conceptualization of it ought to be based on outputs and not inputs. This is because otherwise the de-identified use case preserved by the emerging differentiation in the GDPR cases would be lost. That would be to fail to learn the lessons of the GDPR in relation to AI. It is, in fact, a category error as the category of de-identified that has recently emerged in data law would be lost.

4. Infrequently asked questions about AI systems and the choice of input vs output data regulation

The paper concludes with some infrequently asked questions (IAQs) about input vs output data as the focal point for generative AI systems:

Should analysis be risk based, or hazard based? The two are often used interchangeably but as above they are very different across the input and output divide. This follows from the impossibility of ex ante regulation of very large data sets, and the need instead to adopt a risk-based approach using safeguards (e.g., de-identification).
Should there be a fundamental right to control within a machine? In “conventional” data law this derives from the Fair Information Principles but it is based on much earlier technology. Is it applicable to generative AI systems which appear to require output-based analysis? The FIPs appear to assume a static model and make a critical assumption that there is a link to individual identity.
What about data unrelated to personal identity? Even the most trivial insight could bar the upload of data. Perhaps some contextualization of the Personal Data concepts is also required – a reminder that a strong de-identification regime is not the only fulcrum of analysis for a risk-based approach. Indeed, the main risk factor for harm may well be the sensitivity of the subject matter, does the potential for identity-linkage; and if so, the GDPR-based model would require a fundamental rethink?
If data is de-identified, is there still a need for an ex ante regime? Would ex post enforcement, perhaps applying a punishment to close any resulting enforcement gaps, suffice? It is notable that the EU AI Act applies strong ex ante regulation regardless of the risk of the underlying data set.
International integration: will the convergence between California, Germany and the UK on the recognition of de-identified data continue? Or will the earlier rift between the EU and the US re-emerge? From a consumer welfare perspective, a coherent risk-based approach would be very helpful.

Analysis of these and other questions will bear fruit as generative AI systems and legal analysis of them continue to develop.

***

Citation: Stephen Dnes, Generative AI: The Input Data Riddle, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2024.

Note

I am grateful to Joshua Koran and Saule Sabaliauskaite for comments on an earlier draft, but the usual disclaimer applies.

References

[1] If data policy fails to get these questions right, future consumers might come to ask: Where’s My Flying Car? A memoir of the future past (J Storrs Hall, Stripe Press, 2021).
[2] Wesley Newcomb Hohfeld, Fundamental Legal Conceptions as Applied in Judicial Reasoning and Other Legal Essays (1913 and 1919, pub’d Yale UP 1964). As early as 1980, this was said to be “…now a standard part of legal thinking.” Walker, Oxford Companion to Law p. 575 (OUP, 1980). Nonetheless, the GDPR did not apply clear demarcations of the Hohfeldian rights, despite passage at a time when these points had become extremely well known.
[3] Keppel v Bailey 39 ER 1042 (1834) (owner cannot create “incident of a novel kind [to be] devised and attached to property at the fancy and caprice of any owner.”)
[4] Calabresi, Guido, and A. Douglas Melamed. “Property Rules, Liability Rules, and Inalienability: One View of the Cathedral.” Harvard Law Review 85, no. 6 (1972): 1089–1128. https://doi.org/10.2307/1340059.
[5] See especially the complaints of Privacy International available at https://privacyinternational.org/advocacy/2426/our-complaints-against-acxiom-criteo-equifax-experian-oracle-quantcast-tapad which do not address the important point of identifiability vs de-identified data and safeguards.
[6] See e.g. S Wachter, “Normative Challenges of Identification in the Internet of Things: Privacy, Profiling, Discrimination, and the GDPR” 34(3) Computer Law and Security Review 436-449 (asserting net harm from GDPR from a broad perspective of fairness, implicitly trusting regulators to achieve the correct cost benefit analysis).
[7] L Sweeney, “K-Anonymity: A Model for Protecting Privacy.” International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10.5 (2002): 557-570.
[8] Indeed, in Sweeney’s analysis only the attacker possessing identity could match this back to a specific individual using the deidentified set of data points. The point is that 87% of the data points are unique. It is only with the addition of a matching system (a match key) back to identity that identity would then be revealed. It follows that the identity revelation risk arises from the application of the match key.
[9] Another understated aspect of the paper is that restrictions such as disclosure control and query restrictions are designed to address the uniqueness issue, but much less focus has been given to these.
[10] In particular, the well-known prescription to strip data sets of data unless there are at least two instances (“k-anonymity, k=2”) will be unduly restrictive of data handling in low risk contexts and may not have been intended to be applied outside of true sensitive data instances.
[11] See especially D Solove and W Hartzog. “The FTC and the new common law of privacy.” Colum. L. Rev. 114 (2014): 583 (noting emergence of common law precedent-based approach to US FTC privacy cases).
[12] We note that GDPR Art 25 (privacy by design) does however rely on a reasonable balance of interest test of cost of precaution relative to the likelihood and severity of risk to a specific individual, as explored further below.
[13] Following Case T-557/20 SRB v EDPS there is scope to argue that technical and organizational measures can suffice for de-identified data not to be Personal Data. However, the posture is still to assume identity linkage unless proven otherwise, and there is the difficult question of how far SRB applies and will be applied (for example, there is so far relatively little impact at the EU Member State level).
[14] The issue is compounded by the Article 29 Working Party Opinion 5/2014 on Anonymization Techniques 8 (April 10, 2014) which seems to require not even the slightest possibility of re-identification: See, J Polonetsky, O Tene, and K Finch, “Shades of Gray: Seeing the Full Sepctrum of Practical Data De-identificaiton”,” 56 Santa Clara L. R. 593 (2016), citing K El Emam and C Alvarez, “A Critical Appraisal of the Article 29 Working Party Opinion 05/2014 on Data Anonymization Techniques,” Int’l Data Privacy Law (Dec. 13, 2014).
[15] Art 6(1) GDPR.
[16] Edmund Phelps, Mass Flourishing, p.32 (Princeton Univ. Press, 2006)
[17] It is very striking in this context that leading commentators have asserted that this is a good thing on no evidence basis, e.g., Purtova, “The law of everything. Broad concept of personal data and future of EU data protection law” 10 J. Law, Innovation and Technology, 40-81 (2018) (““At present, the broad notion of personal data is not problematic and is even welcome.”) In fairness, the author also notes the problem of possible “system overload” – but again, that is a governance interest: there is no clear articulation of societal interests. There is certainly no emphasis on innovation.
[18] See draft Guidelines 2/2023 on Technical Scope of Art.5(3) of the ePrivacy Directive (in places applying distinctions based on processing location which are arbitrary from the consumer point of view).
[19] Interestingly, however, there have been at least two cases which have suggested a more risk-based approach: Case T-557/20 SRB v EDPS (use of match key not necessarily Personal Data) and Case C-582/14 Breyer (IP address not necessarily Personal Data). The cases stop short of providing a clear safe harbour, however, as they care contradicted in other cases and do not definitively state that de-identified data is protected, in contrast with the risk-based approach of the CCPA.
[20] Case B7-70/21 Decision pursuant to Section 19a(2) sentence 4 in conjunction with Section 32b(1) GWB See especially: 5. Google commits in this regard in particular to design the choice options to be offered to Users pursuant to para. 1 for cross-service data processing in a transparent manner…. [Obligation Paragraph 1] For Covered Services [limited to Google’s consumer-facing services], Google will not use Data Processing Terms that provide Google with the possibility to 1. combine Personal Data from a Covered Service with Personal Data from other Non-CPSs or with Personal Data from third party services; or 2. cross-use Personal Data from a Covered Service in other Non-CPSs provided separately by Google and vice versa;… A sufficient choice option is given when Users have been presented with the specific choice to permit or decline the cross-service data processing under para 1(a) and (b) and can give consent within the meaning of Article 4 no. 11 and Article 7 of the GDPR…. “Personal Data” means personal data within the meaning ascribed to it in the General Data Protection Regulation (GDPR – Regulation (EU) 2016/679). “User(s)” means signed-out end users (B2C) that access Google’s services with a German IP address and signed-in end users whose Google Account location is Germany.”
[21] Case B7-70/21 Decision pursuant to Section 19a(2) sentence 4 in conjunction with Section 32b(1) GWB Para 62.

[22] § 1798.145(a)(6) – Deidentified Data defined as information that “the business that possess the information: (1) takes reasonable measures…; (2) publicly commits to maintain and use the information in deidentified form…; (3) Contractually obligates any recipients of information to [not reidentify]. CCPA defines Personal Information as any information linked to a Consumer, who is “a natural person who is a California resident” or Household, “a group, however identified, of consumers”. There is also a more robust definition of pseudonymization in § 1798.140(r): “[M]eans the processing of personal information in a manner that renders the personal information no longer attributable to a specific consumer without the use of additional information, provided that the additional information is kept separately and is subject to technical and organizational measures to ensure that the personal information is not attributed to an identified or identifiable consumer.”
[23] See especially, the summary from the Oregon Department of Justice summarized at: Google: AG Rosenblum Announces Largest AG Consumer Privacy Settlement in U.S. History – Oregon Department of Justice : Media (state.or.us). (See especially Order #17, reference to account name, as a gating definition for consent requirements, but the application of a transparency framework elsewhere).
[24] See especially the proposed new s.3A(2) and (3) which would consider the reasonable risk of reidentification “by reasonable means at the time of processing” (s.3A(2)) (emphasis added) or where another person “will, or is likely to, obtain” information from processing and there is also a risk that someone else will re-identify based on “reasonable means at the time of processing.” (s.3A(3)). Technical and organizational measures are required, as is normal in data-driven systems (s.3A(4)). But very significantly, and in contrast with the GDPR, the “reasonable means” of reidentification are defined to include evidence measures including the “time, efforts and costs” of reidentification and “technology and resources” of the third party (s.3A(6)). Effectively, this confers a risk-based safe harbor on de-identified data use which did not exist under the EU membership era Data Protection Act.
[25] Particular needs for definitional clarity on the point arise in the EU’s AI Act and the Biden Administration’s Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, neither of which currently provide a clear distinction between input and output data.

Related Posts