Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim
(www.cnbc.com)
from L4s@lemmy.world to technology@lemmy.world on 08 Jan 2024 02:00
https://lemmy.world/post/10446079
from L4s@lemmy.world to technology@lemmy.world on 08 Jan 2024 02:00
https://lemmy.world/post/10446079
Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim::The new copyright infringement lawsuit against Microsoft and OpenAI comes a week after The New York Times filed a similar complaint in New York.
threaded - newest
All the grifters coming out to feed đ«Ł
Iâm not a huge fan of Microsoft or even OpenAI by any means, but all these lawsuits just seem so⊠lazy and greedy?
It isnât like ChatGPT is just spewing out the entirety of their works in a single chat. In that context, I fail to see how seeing snippets of said work returned in a Google summary is any different than ChatGPT or any other LLM doing the same.
Should OpenAI and other LLM creators use ethically sourced data in the future? Absolutely. They shouldâve been doing so all along. But to me, these rich chumps like George R. R. Martin complaining that they felt their data was stolen without their knowledge and profited off of just feels a little ironic.
Welcome to the rest of the 6+ billion people on the Internet whoâve been spied on, data mined, and profited off of by large corps for the last two decades. Whereâs my god damn check? Maybe regulators shouldâve put tougher laws and regulations in place long ago to protect all of us against this sort of shit, not just businesses and wealthy folk able to afford launching civil suits and shakey grounds. Itâs not like deep learning models are anything new.
Edit:
Already seeing people come in to defend these suits. I just see it like this: AI is a tool, much like a computer or a pencil are tools. You can use a computer to copyright infringe all day, just like a pencil can. To me, an AI is only going to be plagiarizing or infringing if you tell it to. How often does AI plagiarize without a user purposefully trying to get it to do so? Thatâs a genuine question.
Regardless, the catâs out of the bag. Multiple LLMs are already out in the wild and more variations are made each week, and thereâs no way in hell theyâre all going to be reigned in. Iâd rather AI not exist, personally, as I donât see protections coming for normal workers over the next decade or two against further evolutions of the technology. But, regardless, good luck to these companies fighting the new Pirate Bay-esque legal wars for the next couple of decades.
If I want to be able to argue that having any copyleft stuff in the training dataset makes all the output copyleft â and I do â then I necessarily have to also side with the rich chumps as a matter of consistency. Itâs not ideal, but it canât be helped. ÂŻ\_(ă)_/ÂŻ
In your mind are the publishers the rich chumps, or Microsoft?
For copyleft to work, copyright needs to be strong.
I was just repeating the language the parent commenter used (probably shouldâve quoted it in retrospect). In this case, ârich chumpsâ are George R.R. Martin and other authors suing Microsoft.
Wait. I first thought this was sarcasm. Is this sarcasm?
No. I really do think that all AI output should be required to be copyleft if thereâs any copyleft in the training dataset (edit for clarity: unless thereâs also something else with an incompatible license in it, in which case the output isnât usable at all â but protecting copyleft is the part I care about).
Huh. Obviously, you donât believe that a copyleft license should trump other licenses (or lack thereof). So, what are you hoping this to achieve?
Iâm not sure what you mean. No licenses âtrumpâ any other license; thatâs not how it works. You can only make something thatâs a derivative work of multiple differently-licensed things if the terms of all the licenses allow it, something the FSF calls âcompatibility.â Obviously, a proprietary license can never be compatible with a copyleft one, so what Iâm hoping to achieve is a ruling that says any AI whose training dataset included both copyleft and proprietary items has completely legally-unusable output. (And also that any AI whose training dataset includes copyleft items along with permissively-licensed and public domain ones must have its output be copyleft.)
Yes, but what do you hope to achieve by that?
I hear those kinds of arguments a lot, though usually from the exact same people who claimed nobody would be convicted of fraud for NFT and crypto scams when those were at their peak. The days of the wild west internet are long over.
Theft in the digital space is a very real thing in the eyes of the law, especially when it comes to copyright infringement. Itâs wild to me how many people seem to think Microsoft will just get a freebie here because they helped pioneering a new technology for personal gain. Copyright holders have a very real case here and Iâd argue even a strong one.
Even using user data (that they own legally) for machine learning could get them into trouble in some parts of the developed world because users 10 years ago couldnât anticipate it could be used that way and not give their full consent for that.
.
Personally, I think public info is fair game - consent or not, itâs public. Theyâre not sharing the source material, and the goal was never plagiarism. There was a period where it became coherent enough to get very close to plagiarism, but itâs been moving past that phase very quickly
Microsoft, especially with how they scraped private GitHub repos (and the things Iâm sure Google and Facebook just havenât gotten caught doing with private data) is way over the line for me. But I see that more as being bad stewards of private data - they shouldnât be looking at it, their AI shouldnât be looking at it, the public shouldnât be able to see it, and they probably failed on all counts
Granted, I think copyright is a bullshit system. Normal people donât get any protection, because you need to pay to play. Being unable to defend it means you lose it, and in most situations youâre going to spend way more on legal costs than you could possibly get back.
I also think the most important thing is that this tech is spread everywhere, because we canât have one group in charge of the miracle technology⊠Itâs too powerful.
Google has all the data they could need, theyâve bullied the web into submission⊠They donât have to worry about copyright, they control the largest ad network and dominate search (at least for now).
It sucks that you can take any artistâs visual work, and fine tune a network to replicate endless rough facsimile in a few days. I genuinely get how that must feel violating.
But theyâre going to be screwed when the corporate work dries up for a much cheaper option, and theyâre going to have to deal with the flood of AI work⊠Copyright wonât help them, itâs too late for it to even slow it down
If companies did something wrong, have it out in court. My concern is that theyâre going to pass laws on this that claim itâs for the artists, but effectively gatekeep AI to tech giants
Where, for example?
The European Union, for example.
Thatâs not right. It explicitly is legal in the EU.
That is not how the EU works. Member states can get together to tarif and sanction behavior, but just because the EU generally allows something doesnât mean all member states have to abide. Different constitutions and all. Besides Iâd like to know where exactly any EU resolution explicitly allows corporations to throw any data they have at any technology or LLMâs specifically even when nobody ever gave consent to that. Corporations have to be quite specific for how they process your data and broadly saying âmachine learning stuffâ 10 years ago isnât really water proof.
No. EU legislation often has so-called opening clauses that allow member states to tune âEU lawsâ to their needs but itâs not the default behavior.
You seem to have the GDPR in mind. It regulates personal data, meaning data that can be tied to a person. If that is not possible, the GDPR has no objections.
.
Its wild to me how so many people seem to have got it into their head that cheering for the IP laws that corporations fought so hard for is somehow left wing and sticking up for the little guy.
.
And your argument boils down to âHitler was a vegetarian, all vegetarians are Fascistsâ. IP laws are a huge stifle on human creativity designed to allow corporate entities to capture, control and milk innate human culture for profit. The fact that some times some corporate interests end up opposing them when it suits them does not change that.
.
I already have:
I thought that was a prima facie reason for why they are bad, And no I do not believe all copyright law is bad with no nuance, as you would have seen if you stalked deeper into my profile rather than just picking one that you thought you could have fun with.
.
There are plenty from people who actually study this stuff.
I donât have a significant opinion on the Disney case, though I will note that it stems from the fact that corporations are able to buy and sell rights to works as pieces of capital (in this case Disney buying it from Lucasfilm).
.
Stifling a writing tool because GRRM wants a payday, on the basis that it can spit out small parts of his work if you specifically ask it too, is the opposite of advancing the art.
âŠyet allowing individuals to build upon existing works. Its literally the rest of the statement you put in bold, stop trying not to see on purpose.
.
Iâm clearly talking about the technology when I say tool (large language models) and not the company itself.
If we canât freely use copyrighted material to train, it completely and unequivocally kills any kind of open source or even small to medium model. Only a handful of companies would have enough data or funds to build LLMs. And since AI will be able to do virtually all desk jobs in the near future, it would guarantee Microsoft and Google owning the economy.
So no, Iâm not taking the sides of the corporation. The corporations want more barriers and more laws, it kills competition and broadens their moat.
I donât think GRRM is evil, just a greedy asshole thatâs willingly playing into their hand. I also donât think loss of potential profit because the domain has been made even more competitive equals stealing. Nothing was stolen, the barrier for entry has been lowered.
This isnât helping anyone except big name author, the owners of publishing houses and Microsoft. The small time authors and artist arent getting a dime. Why should literally the rest of us get screwed so a couple of fat cats can have an other payday? How is this advancing the arts?
.
It doesnât matter what the subject is about, tim clearly not saying OpenAI the company when I use the term âwriting toolâ
Iâm advocating for us and society as a whole. If only google and Microsoft hold the keys to AI, we all end up paying a surtax on everything we buy because every business will be forced into a subscription model to use it and stay competitive.
There is too much data involved to ask for consent, you would just end up with big players trading with each other. The small artists wouldnât get a dime, only Getty and Adobe. Itâs literally not pheasible.
Nothing was stolen except future potential jobs. You canât own a style or anything of the kind.
The small artists arenât going to get any kind of benefit out of these lawsuits. It sucks that itâs even more saturated of a market but the good ones learn to use these tools (LLMs and img/vid gen) to elevate their own art and push the boundaries.
.
.
.
Just a heads-up, libertarian is usually understood, in the american sense, as meaning right libertarian, including so-called anarcho-capitalists. Itâs understood to mean people who believe that the right to own property is absolutely fundamental. Many libertarians donât believe in intellectual property but some do. Which is to say that in american parlance, the label âlibertarianâ would probably include you. Just FYI.
Also, I donât know what definition of âleftâ you are using, but itâs not a common one. Left ideologies typically favor progress, including technological progress. They also tend to be critical of property, and (AFAIK universally) reject forms of property that allow people to draw unearned rents. They tend to side with the wider interests of the public over an individualâs right to property. The grandfather comment is perfectly consistent with left ideology.
Sure. Trickle-down FTW.
Just because it was available for the public internet doesnât mean it was available legally. Google has a way to remove it from their index when asked, while it seems that OpenAI has no way to do so (or will to do so).
.
You are misrepresenting the issue. The issue here is not if a tool just happens to be able to be used for copyright infringement in the hands of a malicious entity. The issue here is whether LLM outputs are just derivative works of their training data. This is something you cannot compare to tools like pencils and pcs which are much more general purpose and which are not built on stole copyright works. Notice also how AI companies bring up âfair useâ in their arguments. This means that they are not arguing that they are not using copryighted works without permission nor that the output of the LLM does not contain any copyrighted part of its training data (they canât do that because you canât trace the flow of data through an LLM), but rather that their use of the works is novel enough to be an exception. And that is a really shaky argument when their services are actually not novel at all. In fact they are designing services that are as close as possible to the services provided by the original work creators.
I disagree and I feel like youâre equally misrepresenting the issue if I must be as well. LLMs can do far more than simply write stories. They can write stories, but that is just one capability among numerous. Can it write stories in the style of GRRM? I suppose, but honestly doesnât GRRM also borrow a lot of inspiration from other authors? Any writer claiming to be so unique that they arenât borrowing from other writers is full of shit.
Iâm not a lawyer or legal expert, Iâm just giving a laymanâs opinion on a topic. I hope Sam Altman and his merry band get nailed to the wall, I really do. Itâs going to be a clusterfuck of endless legal battles for the foreseeable future, especially now that OpenAI isnât even pretending to be nonprofit anymore.
This story is about a non-fiction work.
What is the purpose of a non-fiction work? Itâs to give the reader further knowledge on a subject.
Why does an LLM manufacturer train their model on a non-fiction work? To be able to act as a substitute source of the knowledge.
End result is that
So, not only have they stolen their work, theyâve stolen their income and reputation.
If youâre using an LLM as any form of authoritative source-and literally any LLM specifically warns NOT to do thatâthen youâre going to have a bad time. No one is using them to learn in any serious capacity. Ideally, the AI should absolutely be citing its sources, and if someone is able to figure out how to do that reliably, theyâll be made quite rich, Iâd imagine. In my opinion, the fiction writers have a stronger case than non-fiction (I believe the fiction writersâ class action against OpenAI in September is still ongoing).
For someone who claimed to not be a fan of OpenAI, you sure do know all the fan arguments against regulation for AI.
.
Iâm not here to argue the finer points, and in general I simply try to aim for the practical actions that lead to better circumstances. I agree with many of your points.
This lawsuit wonât fix anything but it will slow down the progress of OpenAI and their ability to loot culture and content for all itâs value. I see it as a foot in the door for less economically capable artists and such.
Lawsuits are not isolated incidents. The outcome of this will have far reaching impacts on the future of how peopleâs work is treated in regards to AI and training data.
Thereâs a big difference between borrowing inspiration and just using entire paragraphs of text or images wholesale. If GRRM uses entire paragraphs of JK Rowling with just the names changed and uses the same cover with a few different colors you have the same fight. LLM can do the first, but also does the second.
The âin the style ofâ is a different issue thatâs being debated, as style isnât protected by law. But apparently if you ask in the style of, the LLM can get lazy and produces parts of the (copyrighted) source material instead of something original.
Just as with the right query you could get a LLM to output a paragraph of copyrighted material, you can with the right query get Google to give you a link to copyrighted material. Does that make all search engines illegal?
Legally itâs very different. One is a link, the other content. Itâs the same difference as pointing someone to the street where the dealers hang out or opening your coat and asking how many grams they want.
Websites that provide links to copyrighted material are illegal in the US. Itâs why torrent sites are taken down and need to be hosted in countries with different copyright laws .
So Google can be used to pirate but thatâs not itâs intention. It requires careful queries to get Google to show pirate links. Making a tool that could be used for unintentional copyright violation illegal makes all search engines illegal.
It could even make all programming languages illegal. I could use C to write a program to add two numbers or to crawl the web and return illegal movies.
Oh. Linking and even downloading torrents is legal in my place. Hosting and sharing is not. My bad.
From how I understand it is that the copyright holders want the LLM to do at least the same as Google is doing against torrents: it checks so no parts of the source material is in the output.
What does this mean? I donât care what you (claim) your model âcouldâ do, or what LLMs in general could do. What weâve got are services trained on images that make images, services trained on code that write code etc. If AI companies want me to judge the AI as if that is the product, then let them give us all equal and unrestricted access to it. Then maybe I would entertain the âtransformative useâ argument. But what we actually get are very narrow services, where the AI just happens to be a tool used in the backend and not part of the end product the user receives.
Talking about âstyleâ is misleading because âstyleâ cannot be copyrighted. Itâs probably impractical to even define âstyleâ in a legal context. But an LLM doesnât copy styles, it copies patterns, whatever they happen to be. Some patterns are copyrightable, eg a character name and description. And itâs not obvious what is ok to copy and what isnât. Is a characterâs action copyrightable? It depends, is the action opening a door or is it throwing a magical ring into a volcano? If you tell a human to do something in the style of GRRM, they would try to match the medieval fantasy setting and the mood, but they would know to make their own characters and story arcs. The LLM will parrot anything with no distinction.
This is a false equivalence between how an LLM works and how a person works. The core ideas expressed here is that we should treat products and humans equivalently, and that how an LLM functions is basically how humans think. Both of these are objectively wrong.
For one, humans are living beings with feelings. The entire point of our legal system is to protect our rights. When we restrict human behavior it is justified because it protects others; at least thatâs the formal reasoning. We (mostly) judge people based on what theyâve done and not what we know they could do. This is not how we treat products and that makes sense. We regulate weapons because they could kill someone, but we only punish a person after they have committed a crime. Similarly a technology designed to copy can be regulated, whereas a person copying someone elseâs works could be (and often is) punished for it after it is proven that they did it. Even if you think that products and humans should be treated equally, it is a fact that our justice system doesnât work that way.
People also have many more functions and goals than an LLM. At this point it is important to remember that an LLM does literally one thing: for every word it writes it chooses the one that would âmost likelyâ appear next based on its training data. I put âmost likelyâ in quotes because it sounds like a form of prediction, but actually it is based on the occurrences of words in the training data only. It has nothing else to incorporate to its output, and it has no other need. It doesnât have ideas or a need to express them. An LLM canât build upon or meaningfully transform the works it copies, itâs only trick is mixing together enough data to make it hard for you to determine the sources. That can make it sometimes look original but the math is clear, it is always trying to maximize the similarity to the training data, if you consider choosing the âmost likelyâ word at every step to be a metric of similarity. Humans are generally not trying to maximize their worksâ similarity to other peoplesâ works. So when a creator is inspired by another creatorâs work, we donât automatically treat that as an infringement.
But even though comparing human behavior to LLM behavior is wrong, Iâll give you an example to consider. Imagine that you write a story âin the style of GRRMâ. GRRM reads this and thinks that some of the similarities are a violation of his copyright so he sues you. So far it hasnât been determined that youâve done something wrong. But you go to court and say the following:
How do you think the courts would view any similarities between your works? You basically confessed that anything that look
I wish the protections placed on corporate control on cultural and intellectual assets were placed on the average persons privacy instead.
Like I really donât care that someoneâs publicly available book and movie in the last century is analysed and used to create tools, but I do care that without peopleâs actual knowledge a intense surveillance apparatus is being built to collect every minute piece of data about their lives and the lives of those around them to be sold without ethical oversight or consent.
IP is bull, but privacy is a real concern. No one is going to using a extra copy of NY times article to hurt someone, but surveillance is used by authoritarians to oppress and harass innocent people.
If itâs not infringement to input copyrighted materials, then itâs not infringement to take the output.
Copyright can be enforced at both ends or neither end, not one or the other.
Because⊠why?
A better question is: Why not?
If Copyright doesnât protect what goes in, why should it protect what comes out?
.
.
The part that youâre apparently having trouble understanding is that a language model is not a human mind and a human mind is not a language model.
Because sometimes it spits it out verbatim, and sometimes GPLed code gets spat out in the case of Copilot.
See: the time Copilot spat out the Quake inverse square root algorithm, comments and all.
Also, if itâs legal to disregard libre/open source licenses for this, then why isnât it legal for me to look at leaked code, which I also do not have permission to use, and use the knowledge gained from that to write something else?
Well. That sounds perfectly legal. However, mind that âleakedâ implies unauthorized copying and/or a violation of trade secrets. But itâs not a given, that looking at such code violates any law.
And if theyâre not going to respect the copyleft, they are also performing unauthorised copying.
âCopyleftâ means certain types of copyright licenses. Since these licenses generally allow and encourage public distribution/copying, such code is certainly not leaked. Laws pertaining to trade secrets cannot be involved in principle.
I think the copies made during AI training would be typically allowed under copyleft licenses. In any case, as it is a copyright license, it is subject to the same limitations.
Public distribution and copying is allowed, but only if the license in itâs entirety is respected.
And when the license is void, itâs all rights reserved, right?
Sure. Is there a problem with any copyleft license?
? Iâm not sure what your point is
You asked?
Which is exactly why the output of an AI trained on copyrighted inputs should not be copyrightable. It should not become the private property of whichever company owns the language model. That would be bad for a lot more reasons than the potential for laundering open source code.