A California law firm on behalf of 16 plaintiffs and others is seeking an injunction against OpenAI and Microsoft for allegedly using millions of people’s data scraped from the internet over years without permission to train the models at the core of ChatGPT and other AI products that are quickly being incorporated into software products, including legal technology.
“Defendants … use stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge,” says the lawsuit, which seeks class action status for users of ChatGPT and Microsoft products as well as millions of nonusers whose personal information has been used without their knowledge.
- “Defendants continue to unlawfully collect and feed additional personal data from millions of unsuspecting consumers worldwide, far in excess of any reasonably authorized use, in order to continue developing and training the Products,” plaintiffs allege in the lawsuit, filed by Clarkson Law Firm.
Although the charges focus on the unauthorized use of data, the lawsuit sounds the alarm over where AI is heading and the ability of OpenAI or any developer to exercise control over the software as it evolves.
“Novel capabilities often emerge in more powerful models,” the plaintiffs say, drawing on concerns raised by technology leaders, including OpenAI’s founder, Sam Altman, pictured above. “Some [capabilities] that are particularly concerning are the ability to create and act on long term plans, to accrue power and resources, and to exhibit behavior that is increasingly ‘agentic,’” referring to a version of AI that acts in its own self interest.
The claim that OpenAI has been stealing data for years to populate the large language models it needs to train its tools is not unlike claims against Clearview AI, a company that has been embroiled in similar charges for scraping the internet for images to power its facial-recognition software used by law enforcement agencies.
“Clearview AI is the poster boy for what not to be doing,” Paul Reinitz, director of advocacy and legal operations counsel at Getty, has said. “Their position is they can just crawl the web and it’s public.”
The lawsuit against OpenAI cites the experience of the facial-recognition company, which has since become a registered data broker in California and Vermont.
“Clearview’s illegal scraping practices also went undetected for years,” the lawsuit says. “The public was rightfully upset, as were state and federal regulators.”
Unlike Clearview, OpenAI has yet to register as a data broker, one of its several alleged violations of law.
“Plaintiffs and the Classes have a right to opt out of this ongoing scraping of internet information but no mechanism to exercise that right, absent the injunctive relief sought in this Action,” the plaintiffs say.
From just one of the sources OpenAI is said to have used in training its models, the company has amassed a dataset of almost a trillion words, indicating the scope of its scraping,
Charges include larceny
The lawsuit lists 15 counts against OpenAI and its biggest investor, Microsoft, which has agreements with the company to use its software in its products, including Outlook, Teams and Bing.
The counts focus on privacy and other violations of the federal Electronics Communications Privacy Act and the Computer Fraud and Abuse Act, and violations under California law, including the state’s Invasion of Privacy Act, Unfair Competition Law, right to exclusion laws and its constitutional privacy protections.
The counts also include alleged violations of Illinois law, including the state’s Biometric Information Privacy Act and Consumer Fraud and Deceptive Business Practices Act.
One count alleges larceny under California’s penal code.
“Defendants stole the contents of the internet,” the lawsuit says. “Everything individuals posted, information about the individuals, personal data, medical information, and other information — all used to create their Products to generate massive profits. At no point did Defendants have individuals consent to take/scrape this information in order to train their AI Products.”
OpenAI didn’t have to go down this road, the lawsuit says. The company started in 2015 as a nonprofit dedicated to the open and transparent development of AI and only veered into questionable territory when it launched a for-profit subsidiary in 2019 that quickly ballooned in value to almost $30 billion after Microsoft made a series of investments in it.
“OpenAI’s shift in organizational structure has raised eyebrows,” the lawsuit says.
Vered Horesh, chief of strategic AI partnerships at AI company Bria, says it’s possible to train AI models using data without violating people’s protections.
“Viable alternatives exist that respect copyright laws and personal privacy, indicating the potential for a sustainable data economy,” Horesh said in an emailed statement provided to Legal Dive. “Bria AI has been staunchly supporting this ethos since its inception.”
OpenAI did not respond immediately to a request for comment.