Generative Artificial Intelligence, Data Minimization, and Today’s Gold Rush

Lexis Nexis

This article discusses the principle of data minimization in the context of commercial applications of generative artificial intelligence (GenAI) technology and tools.

Download PDF here.

In the United States, the principle of data minimization is embedded firmly within the Federal Trade Commission (FTC) Act, through FTC enforcement activities, and in the host of state-level privacy laws and rules that have proliferated in recent years.

The explosive emergence in recent months of commercial applications of GenAI technology and tools, their requirements to train on very large data sets, and the need to continue to develop user-generated data supplied in GenAI prompts (prompt data) present challenges in applying this principle.

Now is the time to take stock of your data-minimization strategies to ensure that your technology and tools based on GenAI are resilient, can withstand regulatory scrutiny, and can position your organization to compete effectively in a market estimated to experience a compound annual growth rate of over 35% through 2030—more than 10 times higher than the rate of the U.S. economy.[1]

Data Minimization Laws

In general, the data-minimization principle holds that controllers should only collect and process the personal information they need to accomplish a disclosed purpose or a contextually compatible purpose, should only transfer such data consistent with those purposes, and should only maintain personal information as long as is necessary for those purposes.

The FTC’s enforcement posture has changed dramatically over the past 11 years. As far back as 2012, the FTC advocated reasonable collection limitation.[2] Now, according to the FTC, using an interface to steer consumers to an option to provide more information than the context makes necessary may be considered a dark pattern, in violation of Section 5.[3]

Focusing more narrowly on AI and machine learning in a recent case, all three sitting commissioners stated that “machine learning is no excuse to break the law. Claims from businesses that data must be indefinitely retained to improve algorithms do not override legal bans on indefinite retention of data. The data you use to improve your algorithms must be lawfully collected and lawfully retained.” In a clear warning shot far beyond the contours of the case at hand, the FTC continued, “companies would do well to heed this lesson.”[4]

The FTC’s Commercial Surveillance Advanced Notice of Proposed Rulemaking makes clear that the FTC is considering codifying data minimization into federal law.[5] In the meantime, the FTC has already brought a number of enforcement actions focused on data minimization. These cases allege that companies violated laws enforced by the FTC when they:

  • Collected more personal information than they disclose or need for the purposes for which it was collected[6]
  • Used[7] or shared[8] personal information for incompatible purposes
  • Retained the information in violation of their own representations, or beyond the period for which the data is required for the purposes for which it was collected[9]

U.S. Laws

The California Privacy Protection Act, as amended by the California Privacy Rights Act, was the first comprehensive privacy law in the United States to reduce the data-minimization principle to codified law. Collection of personal information must be proportionate to the purpose for which it was collected or reasonably necessary for another purpose, provided that purpose is compatible with the context of collection.[10] New laws taking effect this year in Colorado,[11] Connecticut,[12] Virginia,[13] and laws passed this legislative cycle that take effect in 2024 and beyond in Indiana,[14] Iowa,[15] Tennessee,[16] Montana,[17] and Texas[18] all share common principles. In short, it is now black-letter law in the United States that personal information can only be collected for disclosed and contextually relevant purposes.

Contracts

One risk associated with licensing GenAI technology is that it may have been trained on data sets including personal information or sensitive personal information—or both. Companies can limit their risk in this regard by focusing their attention on the representations, warranties, limitations of liability, and indemnity provisions. In the GenAI context, these terms are not yet standard. The market is still developing. But savvy organizations are familiar with risk shifting. Do not let the rush-to-market period we’re in now expose your organization to undue risk. Regulators have shown a willingness to seek algorithmic disgorgement—the death penalty that could cripple your GenAI rollout—for algorithms based on data improperly collected.[19] Do your best to make sure that you are building your tool on a solid foundation and that you are protected against downside risk.

What about prompt data? Consider whether this data will go to the GenAI technology developer itself, and for what purposes. Will it be used to continue the development of the tool just for your organization, or for others as well? If the toolmaker will use the data just for you, can the toolmaker be your service provider or processor just for this purpose? Appropriate data-processor or service-provider agreements under the new state laws may get your organization some control over the further use and disclosure of user prompt data, and such agreements may limit your risk to that extent. Your processor/service agreement should define the uses to which the GenAI technology developer will make of prompt data and should be parallel with the purposes you disclose at the point of collection and in your privacy policy. You should also make sure that the toolmaker is equipped to assist you in responding to consumer rights requests.

Your Disclosures: Proximate to the Prompt and Privacy Policy

Because privacy laws place an emphasis on disclosed and contextually relevant purposes, it is critical to have clear and conspicuous disclosures proximate to the prompt field. These disclosures should make clear that data submitted as a GenAI prompt will be used by your organization and (if applicable) the AI technology developer to generate content and to train the tool (and, if applicable, the underlying GenAI technology) on an ongoing basis. The company’s privacy policy should also contain the same disclosures.

These disclosures should also explain that the user may prevent this use by not entering any personal information into the prompt field. If possible, end users should have an opportunity to opt out of the processing of prompt data for further development of the GenAI tool and the underlying technology. But before you offer that, be sure you can honor it.

De-identifying Prompt Data

Because GenAI’s fuel is data, and because of the expansive definitions of personal information and personal data in the state privacy laws, it may not be feasible over time to sort through all of your organization’s prompt data to delete all personal information before the data is used for GenAI product development. But what about de-identification? California’s Consumer Privacy Act (CCPA) excludes de-identified data,[20] it and contains a typical standard that organizations must meet to enjoy this protection, borrowed from FTC enforcement and policy work.

Section 1798.140(m) of the CCPA states:

“Deidentified” means information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer provided that the business that possesses the information:

  1. Takes reasonable measures to ensure that the information cannot be associated with a consumer or household.

  2. Publicly commits to maintain and use the information in deidentified form and not to attempt to reidentify the information, except that the business may attempt to reidentify the information solely for the purpose of determining whether its deidentification processes satisfy the requirements of this subdivision.

  3. Contractually obligates any recipients of the information to comply with all provisions of this subdivision.[21]

Well-known work by the National Institute of Standards and Technology[22] and the U.S. Dept. of Health & Human Services[23] serve as tactical guideposts. The point is to do what you can to maintain the volume of data needed to develop GenAI tools while avoiding data minimization risks associated with prompt data.

Conclusion

Privacy law has long wrestled with the urge to collect and keep data for future use. What’s new is that with GenAI, what was once a question of “I may want to use the data in the future” has now become “I will need to use the data in the future.” Data-minimization standards do not act as a ban on the use of training data and prompt data for the development of commercial GenAI technology and tools. In fact, done with care, you can use data-minimization standards as both a shield to avoid regulatory scrutiny and as a sword to distinguish your GenAI tools from others in an almost limitless market.


[10] Cal. Civ. Code § 1798.100(c). (“A business’ collection, use, retention, and sharing of a consumer’s personal information shall be reasonably necessary and proportionate to achieve the purposes for which the personal information was collected or processed, or for another disclosed purpose that is compatible with the context in which the personal information was collected, and not further processed in a manner that is incompatible with those purposes.”)

Contacts

Continue Reading