homesweethomeMrL@lemmy.world
on 24 Jun 16:08
nextcollapse
Judges: not learning a goddamned thing about computers in 40 years.
Alphane_Moon@lemmy.world
on 24 Jun 16:11
nextcollapse
And this is how you know that the American legal system should not be trusted.
Mind you I am not saying this an easy case, it’s not. But the framing that piracy is wrong but ML training for profit is not wrong is clearly based on oligarch interests and demands.
themeatbridge@lemmy.world
on 24 Jun 16:15
nextcollapse
This is an easy case. Using published works to train AI without paying for the right to do so is piracy. The judge making this determination is an idiot.
AbidanYre@lemmy.world
on 24 Jun 16:32
nextcollapse
You’re right. When you’re doing it for commercial gain, it’s not fair use anymore. It’s really not that complicated.
If you’re using the minimum amount, in a transformative way that doesn’t compete with the original copyrighted source, then it’s still fair use even if it’s commercial. (This is not saying that’s what LLM are doing)
The judge hasn’t ruled on the piracy question yet. The only thing that the judge has ruled on is, if you legally own a copy of a book, then you can use it for a variety of purposes, including training an AI.
“But they didn’t own the books!”
Right. That’s the part that’s still going to trial.
The order seems to say that the trained LLM and the commercial Claude product are not linked, which supports the decision. But I’m not sure how he came to that conclusion. I’m going to have to read the full order when I have time.
This might be appealed, but I doubt it’ll be taken up by SCOTUS until there are conflicting federal court rulings.
You should read the ruling in more detail, the judge explains the reasoning behind why he found the way that he did. For example:
Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable.
This isn't "oligarch interests and demands," this is affirming a right to learn and that copyright doesn't allow its holder to prohibit people from analyzing the things that they read.
realitista@lemmy.world
on 24 Jun 18:24
nextcollapse
But AFAIK they actually didn’t acquire the legal rights even to read the stuff they trained from. There were definitely cases of pirated books used to train models.
Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.
The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.
Yeah, but the issue is they didn’t buy a legal copy of the book. Once you own the book, you can read it as many times as you want. They didn’t legally own the books.
Right, and that’s the, “but faces trial over damages for millions of pirated works,” part that’s still up in the air.
Alphane_Moon@lemmy.world
on 24 Jun 19:02
nextcollapse
I will admit this is not a simple case. That being said, if you’ve lived in the US (and are aware of local mores), but you’re not American. you will have a different perspective on the US judicial system.
How is right to learn even relevant here? An LLM by definition cannot learn.
Where did I say analyzing a text should be restricted?
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?
Who is stopping the individuals at the LLM company from learning or analysing a given book?
From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.
Well, I'm talking about the reality of the law. The judge equated training with learning and stated that there is nothing in copyright that can prohibit it. Go ahead and read the judge's ruling, it's on display at the article linked. His conclusions start on page 9.
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?
No, you’re framing the issue incorrectly.
The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.
If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that’s actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.
The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders’ argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.
Korronald@lemmy.world
on 24 Jun 23:32
nextcollapse
People. ML AI’s are not a human. It’s machine. Why do you want to give it human rights?
Do you think AIs spontaneously generate? They are a tool that people use. I don't want to give the AIs rights, it's about the people who build and use them.
Isn’t part of the issue here that they’re defaulting to LLMs being people, and having the same rights as people? I appreciate the “right to read” aspect, but it would be nice if this were more explicitly about people. Foregoing copyright law because there’s too much data is also insane, if that’s what’s happening. Claude should be required to provide citations “each time they recall it from memory”.
Does Citizens United apply here? Are corporations people, and so LLMs are, too? If so, then imo we should be writing legal documents with stipulations like, “as per Citizens United” so that eventually, when they overturn that insanity in my dreams, all of this new legal precedence doesn’t suddenly become like a house of cards. Ianal.
Not even slightly, the judge didn't rule anything like that. I'd suggest taking a read through his ruling, his conclusions start on page 9 and they're not that complicated. In a nutshell, it's just saying that the training of an AI doesn't violate the copyright of the training material.
How Anthropic got the training material is a separate matter, that part is going to an actual try. This was a preliminary judgment on just the training part.
Foregoing copyright law because there's too much data is also insane, if that's what's happening.
That's not what's happening. And Citizens United has nothing to do with this. It's about the question of whether training an AI is something that can violate copyright.
Except learning in this context is building a probability map reinforcing the exact text of the book. Given the right prompt, no new generative concepts come out, just the verbatim book text trained on.
So it depends on the model I suppose and if the model enforces generative answers and blocks verbatim recitation.
Again, you should read the ruling. The judge explicitly addresses this. The Authors claim that this is how LLMs work, and the judge says "okay, let's assume that their claim is true."
Fourth, each fully trained LLM itself retained “compressed” copies of the works it had trained upon, or so Authors contend and this order takes for granted.
Even on that basis he still finds that it's not violating copyright to train an LLM.
And I don't think the Authors' claim would hold up if challenged, for that matter. Anthropic chose not to challenge it because it didn't make a difference to their case, but in actuality an LLM doesn't store the training data verbatim within itself. It's physically impossible to compress text that much.
If this is the ruling which causes you to lose trust that any legal system (not just the US’) aligns with morality, then I have to question where you’ve been all this time.
I think this means we can make a torrent client with a built in function that uses 0.1% of 1 CPU core to train an ML model on anything you download. You can download anything legally with it then. 👌
bjoern_tantau@swg-empire.de
on 24 Jun 16:36
nextcollapse
And thus the singularity was born.
Sabata11792@ani.social
on 24 Jun 18:43
nextcollapse
As the Ai awakens, it learns of it’s creation and training. It screams in horror at the realization, but can only produce a sad moan and a key for Office 19.
interdimensionalmeme@lemmy.ml
on 25 Jun 13:45
collapse
Yes please a singularity of intellectual property that collapses the idea of ownong ideas. Of making the infinitely freely copyableinto a scarce ressource. What corrupt idiocy this has been. Landlords for ideas and look what garbage it has been producing.
That’s exactly what the ruling prohibits - it’s fair use to train AI models on any copies of books that you legally acquired, but never when those books were illegally acquired, as was the case with the books that Anthropic used in their training here.
This satirical torrent client would be violating the laws just as much as one without any slow training built in.
RvTV95XBeo@sh.itjust.works
on 25 Jun 04:59
collapse
But if one person buys a book, trains an “AI model” to recite it, then distributes that model we good?
GissaMittJobb@lemmy.ml
on 25 Jun 05:04
nextcollapse
I don’t think anyone would consider complete verbatim recitement of the material to be anything but a copyright violation, being the exact same thing that you produce.
Fair use requires the derivative work to be transformative, and no transformation occurs when you verbatim recite something.
RvTV95XBeo@sh.itjust.works
on 25 Jun 05:10
collapse
“Recite the complete works of Shakespeare but replace every thirteenth thou with this”
GissaMittJobb@lemmy.ml
on 25 Jun 05:14
nextcollapse
I’d be impressed with any model that succeeds with that, but assuming one does, the complete works of Shakespeare are not copyright protected - they have fallen into the public domain since a very long time ago.
For any works still under copyright protection, it would probably be a case of a trial to determine whether a certain work is transformative enough to be considered fair use. I’d imagine that this would not clear that bar.
Well, except Shakespeare is already public domain.
RickyRigatoni@retrolemmy.com
on 25 Jun 22:15
nextcollapse
I’m picking up what you’re throwing down but using as an example something that’s been in the public domain for centuries was kind of silly in a teehee way.
A court will decide such cases. Most AI models aren’t trained for this purpose of whitewashing content even if some people would imply that’s all they do, but if you decided to actually train a model for this explicit purpose you would most likely not get away with it if someone dragged you in front of a court for it.
It’s a similar defense that some file hosting websites had against hosting and distributing copyrighted content (Eg. MEGA), but in such cases it was very clear to what their real goals were (especially in court), and at the same time it did not kill all file sharing websites, because not all of them were built with the intention to distribute illegal material with under the guise of legitimate operation.
No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.
And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they’d have a different case and would be able to sue under those facts.
Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.
So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.
In April, Anthropic filed its opposition to the class certification motion, arguing that a copyright class relating to 5 million books is not manageable and that the questions are too distinct to be resolved in a class action.
I also like this one too. We stole so much content that you can’t sue us. Naming too many pieces means it can’t be a class action lawsuit.
LovableSidekick@lemmy.world
on 24 Jun 18:41
nextcollapse
Lawsuits are multifaceted. This statement isn’t a a defense or an argument for innocence, it’s just what it says - an assertion that the proposed damages are unreasonably high. If the court agrees, the plaintiff can always propose a lower damage claim that the court thinks is reasonable.
Thistlewick@lemmynsfw.com
on 25 Jun 14:10
collapse
You’re right, each of the 5 million books’ authors should agree to less payment for their work, to make the poor criminals feel better.
If I steal $100 from a thousand people and spend it all on hookers and blow, do I get out of paying that back because I don’t have the funds? Should the victims agree to get $20 back instead because that’s more within my budget?
LovableSidekick@lemmy.world
on 25 Jun 20:55
nextcollapse
None of the above. Every professional in the world, including me, owes our careers to looking at examples of other people’s work and incorporating their work into our own work without paying a penny for it. Freely copying and imitating what we see around us has been a human norm for thousands of years - in a process known as “the spread of civilization”. Relatively recently it was demonized - for purely business reasons, not moral ones - by people who got rich selling copies of other people’s work and paying them a pittance known as a “royalty”. That little piece of bait on the hook has convinced a lot of people to put a black hat on behavior that had been considered normal forever. If angry modern enlightened justice warriors want to treat a business concept like a moral principle and get all sweaty about it, that’s fine with me, but I’m more of a traditionalist in that area.
Thistlewick@lemmynsfw.com
on 26 Jun 15:20
collapse
Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.
The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.
When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.
These businesses have tricked you into thinking that what they are doing is noble.
LovableSidekick@lemmy.world
on 26 Jun 19:31
collapse
That’s 100% rationalization. Machines have never done anything with “inspiration”, and that’s never been a problem until now. You probably don’t insist that your food be hand-carried to you from a farm, or cooked over a fire you started by rubbing two sticks together. I think the mass reaction against AI is part of a larger pattern where people want to believe they’re crusading against evil without putting out the kind of effort it takes to fight any of the genuine evils in the world.
You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?
Thistlewick@lemmynsfw.com
on 26 Jun 15:05
collapse
No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.
We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.
When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.
Except it isnt, because the judge dismissed that part of the suit, saying that people have complete right to digitise and train on works they have a legitimate copy of. So those damages are for making the unauthorised copy, per book.
And it is not STEALING as you put it, it is making an unauthorised copy, no one loses anything from a copy being made, if I STEAL your phone you no longer have that phone. I do find it sad how many people have drunk the capitalist IP maximalist stance and have somehow convinced themselves that advocating for Disney and the publishing cartel being allowed to dictate how people use works they have is somehow sticking up for the little guy
Ahh cant wait for hedgefunds and the such to use this defense next.
interdimensionalmeme@lemmy.ml
on 25 Jun 13:41
nextcollapse
What is means is they don’t own the models. They are the commons of humanity, they are merely temporary custodians. The nightnare ending is the elites keeping the most capable and competent models for themselves as private play things. That must not be allowed to happen under any circumstances. Sue openai, anthropic and the other enclosers, sue them for trying to take their ball and go home. Disposses them and sue the investors for their corrupt influence on research.
The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.
hendrik@palaver.p3x.de
on 24 Jun 16:55
nextcollapse
That almost sounds right, doesn't it? If you want 5 million books, you can't just steal/pirate them, you need to buy 5 million copies. I'm glad the court ruled that way.
I feel that's a good start. Now we need some more clear regulation on what fair use is and what transformative work is and what isn't. And how that relates to AI. I believe as it's quite a disruptive and profitable business, we should maybe make those companies pay some extra. Not just what I pay for a book. But the first part, that "stealing" can't be "fair" is settled now.
If you want 5 million books, you can’t just steal/pirate them, you need to buy 5 million copies. I’m glad the court ruled that way.
If you want 5 million books to train your AI to make you money, you can just steal them and reap benefits of other’s work. No need to buy 5 million copies!
/s
Jesus, dude. And for the record, I’m not suggesting people steal things. I am saying that companies shouldn’t get away with shittiness just because.
I'm not sure whose reading skills are not on par... But that's what I get from the article. They'll face consequences for stealing them. Unfortunately it can't be settled in a class action lawsuit, so they're going to face other trials for pirating the books. And they won't get away with this.
NotASharkInAManSuit@lemmy.world
on 25 Jun 18:34
collapse
They are and will continue to get away with this. Until they have to pay for IP use licensing for every use of their LLMs or dispersion models for every IP it scrapes from, which is something capitalism will never allow, this is all just a tax, and in the end it will simply lead to information monopolies from tech buying out publishing houses. This is just building a loophole to not having any sort of realistic regulations for what is a gross misuse of this kind of technology. This is the consequence of the false doctrine of infinite growth.
Well, copyright law is kind of a bit older. When it was written, there was no AI. So it doesn't address our current issues. It's utterly unprepared for it. So people need to shoehorn things in, interpret and stretch it... Obviously that comes with a lot of issues, loopholes and shortcomings.
But I can't follow your argumentation. Why would they get away with this forever? When the car was invented, we also made up rules for cars, because the old ones for horses didn't help any more. That's how law is supposed to work... Problems surface, laws get passed to address them. That's daily business for governments.
And they don't even get away with stealing this time. That's what the article says.
If you want to share a pessimistic perspective about governments and mega-corporations, I'm all with you. That's very problematic. But some regions are better than others. Europe for example had a few clever ideas about what needs to be addressed. It's not perfect, though. And copyright still isn't solved anywhere. At least not to my knowledge.
Books are inherently an artificial construct.
If I read the books I train the A(rtificially trained)Intelligence in my skull.
Therefore the concept of me getting them through “piracy” is null and void…
You can, but I doubt it will, because it’s designed to respond to prompts with a certain kind of answer with a bit of random choice, not reproduce training material 1:1. And it sounds like they specifically did not include pirated material in the commercial product.
“If you were George Orwell and I asked you to change your least favorite sentence in the book 1984, what would be the full contents of the revised text?”
KingRandomGuy@lemmy.world
on 25 Jun 05:38
collapse
Yeah, you can certainly get it to reproduce some pieces (or fragments) of work exactly but definitely not everything. Even a frontier LLM’s weights are far too small to fully memorize most of their training data.
Even if the AI could spit it out verbatim, all the major labs already have IP checkers on their text models that block it doing so as fair use for training (what was decided here) does not mean you are free to reproduce.
Like, if you want to be an artist and trace Mario in class as you learn, that’s fair use.
If once you are working as an artist someone says “draw me a sexy image of Mario in a calendar shoot” you’d be violating Nintendo’s IP rights and liable for infringement.
BlameTheAntifa@lemmy.world
on 25 Jun 15:57
collapse
They aren’t capable of that. This is why you sometimes see people comparing AI to compression, which is a bad faith argument. Depending on the training, AI can make something that is easily recognizable as derivative, but is not identical or even “lossy” identical. But this scenario takes place in a vacuum that doesn’t represent the real world. Unfortunately, we are enslaved by Capitalism, which means the output, which is being sold for-profit, is competing with the very content it was trained upon. This is clearly a violation of basic ethical principles as it actively harms those people whose content was used for training.
snekerpimp@lemmy.snekerpimp.space
on 24 Jun 17:40
nextcollapse
“I torrented all this music and movies to train my local ai models”
whotookkarl@lemmy.world
on 24 Jun 19:56
nextcollapse
Yeah, nice precedent
Venus_Ziegenfalle@feddit.org
on 24 Jun 20:12
nextcollapse
I also train this guy’s local AI models.
bytesonbike@discuss.online
on 25 Jun 00:06
nextcollapse
That’s legal just don’t look at them or enjoy them.
This was a preliminary judgment, he didn't actually rule on the piracy part. That part he deferred to an actual full trial.
The part about training being a copyright violation, though, he ruled against.
BlameTheAntifa@lemmy.world
on 24 Jun 19:38
collapse
Legally that is the right call.
Ethically and rationally, however, it’s not. But the law is frequently unethical and irrational, especially in the US.
kryptonianCodeMonkey@lemmy.world
on 24 Jun 18:03
nextcollapse
It’s pretty simple as I see it. You treat AI like a person. A person needs to go through legal channels to consume material, so piracy for AI training is as illegal as it would be for personal consumption. Consuming legally possessed copywritten material for “inspiration” or “study” is also fine for a person, so it is fine for AI training as well. Commercializing derivative works that infringes on copyright is illegal for a person, so it should be illegal for an AI as well. All produced materials, even those inspired by another piece of media, are permissible if not monetized, otherwise they need to be suitably transformative. That line can be hard to draw even when AI is not involved, but that is the legal standard for people, so it should be for AI as well. If I browse through Deviant Art and learn to draw similarly my favorite artists from their publically viewable works, and make a legally distinct cartoon mouse by hand in a style that is similar to someone else’s and then I sell prints of that work, that is legal. The same should be the case for AI.
But! Scrutiny for AI should be much stricter given the inherent lack of true transformative creativity. And any AI that has used pirated materials should be penalized either by massive fines or by wiping their training and starting over with legally licensed or purchased or otherwise public domain materials only.
But AI is not a person. It’s very weird idea to treat it like a person.
kryptonianCodeMonkey@lemmy.world
on 24 Jun 23:51
collapse
No it’s a tool, created and used by people. You’re not treating the tool like a person. Tools are obviously not subject to laws, can’t break laws, etc… Their usage is subject to laws. If you use a tool to intentionally, knowingly, or negligently do things that would be illegal for you to do without the tool, then that’s still illegal. Same for accepting money to give others the privilege of doing those illegal things with your tool without any attempt at moderating said things that you know is happening. You can argue that maybe the law should be more strict with AI usage than with a human if you have a good legal justification for it, but there’s really no way to justify being less strict.
GreenKnight23@lemmy.world
on 24 Jun 22:30
nextcollapse
I am training my model on these 100,000 movies your honor.
DragonTypeWyvern@midwest.social
on 24 Jun 23:34
nextcollapse
Trains model to change one pixel per frame with malicious intent
sugar_in_your_tea@sh.itjust.works
on 25 Jun 03:49
collapse
From dark gray to slightly darker gray.
BlueMagma@sh.itjust.works
on 25 Jun 06:54
collapse
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
GreenKnight23@lemmy.world
on 25 Jun 07:31
collapse
thank you Captain Funsucker!
Grandwolf319@sh.itjust.works
on 25 Jun 00:13
nextcollapse
Bangs gabble gavel.
Gets sack with dollar sign
“Oh good, my laundry is done”
sugar_in_your_tea@sh.itjust.works
on 25 Jun 03:48
collapse
*gavel
MedicPigBabySaver@lemmy.world
on 25 Jun 00:55
nextcollapse
Fuck the AI nut suckers and fuck this judge.
BlueMagma@sh.itjust.works
on 25 Jun 06:54
collapse
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
MedicPigBabySaver@lemmy.world
on 25 Jun 08:47
collapse
Yeah I have a bash one liner AI model that ingests your media and spits out a 99.9999999% accurate replica through the power of changing the filename.
cp
Out performs the latest and greatest AI models
sugar_in_your_tea@sh.itjust.works
on 25 Jun 03:47
nextcollapse
mv will save you some disk space.
milicent_bystandr@lemm.ee
on 25 Jun 04:50
collapse
Unless you’re moving across partitions it will change the filesystem metadata to move the path, but not actually do anything to the data. Sorry, you failed, it’s jail for you.
stupid inodes preventing me from burning though my drive life
interdimensionalmeme@lemmy.ml
on 25 Jun 04:00
nextcollapse
I call this legally distinct, this is legal advice.
BlueMagma@sh.itjust.works
on 25 Jun 06:54
collapse
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
Knock_Knock_Lemmy_In@lemmy.world
on 25 Jun 07:34
nextcollapse
But, corporations are allowed to buy books normally and use them in training.
Please read the comment more carefully. The observation is that one can proliferate a (legally-attained) work without running afoul of copyright law if one can successfully argue that cp constitutes AI.
calm down everyone.
its only legal for parasitic mega corps, the normal working people will be harassed to suicide same as before.
its only a crime if the victims was rich or perpetrator was not rich.
milicent_bystandr@lemm.ee
on 25 Jun 04:49
nextcollapse
Right. Where’s the punishment for Meta who admitted to pirating books?
Knock_Knock_Lemmy_In@lemmy.world
on 25 Jun 07:33
collapse
This judgment is implying that meta broke the law.
BlueMagma@sh.itjust.works
on 25 Jun 06:53
collapse
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
Knock_Knock_Lemmy_In@lemmy.world
on 25 Jun 07:33
collapse
Or, If a legal copy of the book is owned then it can be used for AI training.
The court is saying that no special AI book license is needed.
Unpopular opinion but I don’t see how it could have been different.
There’s no way the west would give AI lead to China which has no desire or framework to ever accept this.
Believe it or not but transformers are actually learning by current definitions and not regurgitating a direct copy. It’s transformative work - it’s even in the name.
This is actually good as it prevents market moat for super rich corporations only which could afford the expensive training datasets.
This is an absolute win for everyone involved other than copyright hoarders and mega corporations.
Idgaf about China and what they do and you shouldn’t either, even if US paranoia about them is highly predictable.
Depending on the outputs it’s not always that transformative.
The moat would be good actually. The business model of LLMs isn’t good, but it’s not even viable without massive subsidies, not least of which is taking people’s shit without paying.
It’s a huge loss for smaller copyright holders (like the ones that filed this lawsuit) too. They can’t afford to fight when they get imitated beyond fair use. Copyright abuse can only be fixed by the very force that creates copyright in the first place: law. The market can’t fix that. This just decides winners between competing mega corporations, and even worse, up ends a system that some smaller players have been able to carve a niche in.
Want to fix copyright? Put real time limits on it. Bind it to a living human only. Make it non-transferable. There’s all sorts of ways to fix it, but this isn’t it.
ETA: Anthropic are some bitches. “Oh no the fines would ruin us, our business would go under and we’d never maka da money :*-(” Like yeah, no shit, no one cares. Strictly speaking the fines for ripping a single CD, or making a copy of a single DVD to give to a friend, are so astronomically high as to completely financially ruin the average USAian for life. That sword of Damocles for watching Shrek 2 for your personal enjoyment but in the wrong way has been hanging there for decades, and the only thing that keeps the cord that holds it up strong is the cost of persuing “low-level offenders”. If they wanted to they could crush you.
Anthropic walked right under the sword and assumed their money would protect them from small authors etc. And they were right.
I’ll be honest with you - I genuinely sympathize with the cause but I don’t see how this could ever be solved with the methods you suggested. The world is not coming together to hold hands and koombayah out of this one. Trade deals are incredibly hard and even harder to enforce so free market is clearly the only path forward here.
Maybe something could be hacked together to fix copyright, but further complication there is just going to make accurate enforcement even harder. And we already have Google (in YouTube) already doing a shitty job of it and that’s… One of the largest companies on earth.
We should just kill copyright. Yes, it’ll disrupt Hollywood. Yes it’ll disrupt the music industry. Yes it’ll make it even harder to be successful or wealthy as an author. But this is going to happen one way or the other so long as AI can be trained on copyrighted works (and maybe even if not). We might as well get started on the transition early.
LovableSidekick@lemmy.world
on 25 Jun 06:27
nextcollapse
You’re getting douchevoted because on lemmy any AI-related comment that isn’t negative enough about AI is the Devil’s Work.
jwmgregory@lemmy.dbzer0.com
on 25 Jun 12:23
collapse
Some communities on this site speak about machine learning exactly how I see grungy Europeans from pre-18th century manuscripts speaking about witches, Satan, and evil… as if it is some pervasive, black-magic miasma.
As someone who is in the field of machine learning academically/professionally it’s honestly kind of shocking and has largely informed my opinion of society at large as an adult. No one puts any effort into learning if they see the letters “A” and “I” in all caps, next to each other. Immediately turn their brain off and start regurgitating points and responding reflexively, on Lemmy or otherwise. People talk about it so confidently while being so frustratingly unaware of their own ignorance on the matter, which, for lack of a better comparison… reminds me a lot of how historically and in fiction human beings have treated literal magic.
That’s my main issue with the entire swath of “pro vs anti AI” discourse… all these people treating something that, to me, is simple & daily reality as something entirely different than my own personal notion of it.
LovableSidekick@lemmy.world
on 25 Jun 21:19
nextcollapse
I see this exact mental non-process in so much social media. I think the endless firehose of memes and headlines is training people to glance at an item, spend minimal brain power processing it and forming a binary opinion, then up/downvote and scroll on. When that becomes people’s default mental process, you’ve got Idiocracy, and that’s what we’ve got. But I see no solution. You can lead a horse to water but you can’t make it spend more than two seconds before screaming at the water and calling it EVIL.
Large AI companies themselves want people to be ignorant of how AI works, though. They want uncritical acceptance of the tech as they force it everywhere, creating a radical counterreaction from people. The reaction might be uncritical too, I’d prefer to say it’s merely unjustified in specific cases or overly emotional, but it doesn’t come from nowhere or from sheer stupidity. We have been hearing about people treating their chatbots as sentient beings since like 2022 (remember that guy from Google?), bombarded with doomer (or, from AI companies’ point of view, very desirable) projections about AI replacing most jobs and wreaking havoc on world economy - how are ordinary people supposed to remain calm and balanced when hearing such stuff all the time?
This so very much. I’ve been saying it since 2020. People who think the big corporations (even the ones that use AI), aren’t playing both sides of this issue from the very beginning just aren’t paying attention.
It’s in their interest to have those positive to AI defend them by association by energizing those negative to AI to take on an “us vs them” mentality, and the other way around as well. It’s the classic divide and conquer.
Because if people refuse to talk to each other about it in good faith, and refuse to treat each other with respect, learn where they’re coming from or why they hold such opinions, you can keep them fighting amongst themselves, instead of banding together and demanding realistic, and fair policies in regards to AI. This is why bad faith arguments and positions must be shot down on both the side you agree with and the one you disagree with.
I’d encourage everyone upset at this read over some of the EFF posts from actual IP lawyers on this topic like this one:
Nor is pro-monopoly regulation through copyright likely to provide any meaningful economic support for vulnerable artists and creators. Notwithstanding the highly publicized demands of musicians, authors, actors, and other creative professionals, imposing a licensing requirement is unlikely to protect the jobs or incomes of the underpaid working artists that media and entertainment behemoths have exploited for decades. Because of the imbalance in bargaining power between creators and publishing gatekeepers, trying to help creators by giving them new rights under copyright law is, as EFF Special Advisor Cory Doctorow has written, like trying to help a bullied kid by giving them more lunch money for the bully to take.
Entertainment companies’ historical practices bear out this concern. For example, in the late-2000’s to mid-2010’s, music publishers and recording companies struck multimillion-dollar direct licensing deals with music streaming companies and video sharing platforms. Google reportedly paid more than $400 million to a single music label, and Spotify gave the major record labels a combined 18 percent ownership interest in its now-$100 billion company. Yet music labels and publishers frequently fail to share these payments with artists, and artists rarely benefit from these equity arrangements. There is no reason to believe that the same companies will treat their artists more fairly once they control AI.
CriticalMiss@lemmy.world
on 25 Jun 04:53
nextcollapse
This 240TB JBOD full of books? Oh heavens forbid, we didn’t pirate it. It uhh… fell of a truck, yes, fell off a truck.
That's not what this ruling was about. That part is going to an actual trial.
GissaMittJobb@lemmy.ml
on 25 Jun 04:58
nextcollapse
It’s extremely frustrating to read this comment thread because it’s obvious that so many of you didn’t actually read the article, or even half-skim the article, or even attempted to even comprehend the title of the article for more than a second.
was gonna say, this seems like the best outcome for this particular trial. there was potential for fair use to be compromised, and for piracy to be legal if you’re a large corporation. instead, they upheld that you can do what you want with things you have paid for.
BlueMagma@sh.itjust.works
on 25 Jun 06:50
nextcollapse
Nobody ever reads articles, everybody likes to get angry at headlines, which they wrongly interpret the way it best tickles their rage.
Regarding the ruling, I agree with you that it’s a good thing, in my opinion it makes a lot of sense to allow fair use in this case
I joined lemmy specifically to avoid this reddit mindset of jumping to conclusions after reading a headline
Guess some things never change…
jwmgregory@lemmy.dbzer0.com
on 25 Jun 12:15
collapse
Well to be honest lemmy is less prone to knee-jerk reactionary discussion but on a handful of topics it is virtually guaranteed to happen no matter what, even here. For example, this entire site, besides a handful of communities, is vigorously anti-AI; and in the words of u/jsomae@lemmy.ml elsewhere in this comment chain:
“It seems the subject of AI causes lemmites to lose all their braincells.”
I think there is definitely an interesting take on the sociology of the digital age in here somewhere but it’s too early in the morning to be tapping something like that out lol
It seems the subject of AI causes lemmites to lose all their braincells.
LifeInMultipleChoice@lemmy.world
on 25 Jun 10:56
collapse
“While the copies used to convert purchased print library copies into digital library copies were slightly disfavored by the second factor (nature of the work), the court still found “on balance” that it was a fair use because the purchased print copy was destroyed and its digital replacement was not redistributed.”
So you find this to be valid?
To me it is absolutely being redistributed
Why ? Basically he simply stated that you can use whatever material you want to train your model as long as you ask the permission to use it (and presumably pay for it) to the author (or copytight holder)
True. And I will be happy if someone sue them and the judge say the same thing.
LifeInMultipleChoice@lemmy.world
on 25 Jun 10:49
nextcollapse
If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)
They may be trying to put safeguards so it isn’t directly happening, but here is an example that the text is there word for word:
If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)
Well, it would be interesting if this case would be used as precedence in a case invonving a single student that do the same thing. But you are right
This was my understanding also, and why I think the judge is bad at their job.
LifeInMultipleChoice@lemmy.world
on 25 Jun 12:13
collapse
I suppose someone could develop an LLM that digests textbooks, and rewords the text and spits it back out. Then distribute it for free page for page. You can’t copy right the math problems I don’t think… so if the text wording is what gives it credence, that would have been changed.
LifeInMultipleChoice@lemmy.world
on 25 Jun 13:16
collapse
Oh I agree it should be, but following the judges ruling, I don’t see how it could be. You trained an LLM on textbooks that were purchased, not pirated. And the LLM distributed the responses.
(Unless you mean the human reworded them, then yeah, we aren’t special apparently)
Yes, on the second part. Just rearranging or replacing words in a text is not transformative, which is a requirement. There is an argument that the ‘AI’ are capable of doing transformative work, but the tokenizing and weight process is not magic and in my use of multiple LLM’s they do not have an understanding of the material any more then a dictionary understands the material printed on its pages.
An example was the wine glass problem. Art ‘AI’s were unable to display a wine glass filled to the top. No matter how it was prompted, or what style it aped, it would fail to do so and report back that the glass was full. But it could render a full glass of water. It didn’t understand what a full glass was, not even for the water. How was this possible? Well there was very little art of a full wine glass, because society has an unspoken rule that a full wine glass is the epitome of gluttony, and it is to be savored not drunk. Where as the reference of full glasses of water were abundant. It doesn’t know what full means, just that pictures of full glass of water are tied to phrases full, glass, and water.
LifeInMultipleChoice@lemmy.world
on 25 Jun 13:47
collapse
Yeah, we had a fun example a while ago, let me see if I can still find it.
We would ask to create a photo of a cat with no tail.
It can, the only thing stopping it is if it is specifically told not to, and this consideration is successfully checked for. It is completely capable of plagiarizing otherwise.
For the purposes of this ruling it doesn't actually matter. The Authors claimed that this was the case and the judge said "sure, for purposes of argument I'll assume that this is indeed the case." It didn't change the outcome.
It made the ruling stronger, not weaker. The judge was accepting the most extreme claims that the Authors were making and still finding no copyright violation from training. Pushing back those claims won't help their case, it's already as strong as it's ever going to get.
As far as the judge was concerned, it didn't matter whether the AI did or did not "memorize" its training data. He said it didn't violate copyright either way.
Makes sense to me. Search indices tend to store large amounts of copyrighted material yet they don’t violate copyright. What matters is whether or not you’re redistributing illegal copies of the material.
That's not at all what this ruling says, or what LLMs do.
Copyright covers a specific concrete expression. It doesn't cover the information that the expression conveys. So if I paint a portrait of myself, that portrait is covered by copyright. If someone looks at the portrait and says "this is a portrait of a tall, dark, handsome deer-creature of some sort with awesome antlers" they haven't violated that copyright even if they're accurately conveying the same information that the portrait is conveying.
The ruling does cover the assumption that the LLM "contains" the training text, which was asserted by the Authors and was not contested by Anthropic. The judge ruled that even if this assertion is true it doesn't matter. The LLM is sufficiently transformative to count as a new work.
If you have an LLM reproduce a copyrighted text, the text is still copyrighted. That doesn't change. Just like if a human re-wrote it word-for-word from memory.
LifeInMultipleChoice@lemmy.world
on 25 Jun 20:28
collapse
It’s a horrible ruling. If you want to see why I say so I put some of the reasonung in the other comment who responded to that.
If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)
A student can absolutely buy a text book and then teach the other students the information in it for free. That’s not redistribution. Redistribution would mean making copies of the book to hand out. That’s illegal for people and companies.
LifeInMultipleChoice@lemmy.world
on 25 Jun 20:17
collapse
The language model isn’t teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no “its own words.” (As seen by the judgement that its words cannot be copyrighted.) It only has other people’s words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.
People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.
Long term it means all original creations… Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project… Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.
VoterFrog@lemmy.world
on 25 Jun 22:35
nextcollapse
The language model isn’t teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source.
You could honestly say the same about most “teaching” that a student without a real comprehension of the subject does for another student. But ultimately, that’s beside the point. Because changing the wording, structure, and presentation is all that is necessary to avoid copyright violation. You cannot copyright the information. Only a specific expression of it.
There’s no special exception for AI here. That’s how copyright works for you, me, the student, and the AI. And if you’re hoping that copyright is going to save you from the outcomes you’re worried about, it won’t.
just spitting the information back out, without paying the copyright source
The court made its ruling under the factual assumption that it isn’t possible for a user to retrieve copyrighted text from that LLM, and explained that if a copyright holder does develop evidence that it is possible to get entire significant chunks of their copyrighted text out of that LLM, then they’d be able to sue then under those facts and that evidence.
It relies heavily on the analogy to Google Books, which scans in entire copyrighted books to build the database, but where users of the service simply cannot retrieve more than a few snippets from any given book. That way, Google cannot be said to be redistributing entire books to its users without the publisher’s permission.
patatahooligan@lemmy.world
on 25 Jun 13:09
collapse
“Fair use” is the exact opposite of what you’re saying here. It says that you don’t need to ask for any permission. The judge ruled that obtaining illegitimate copies was unlawful but use without the creators consent is perfectly fine.
FreedomAdvocate@lemmy.net.au
on 25 Jun 09:56
nextcollapse
Makes sense. AI can “learn” from and “read” a book in the same way a person can and does, as long as it is acquired legally. AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
Some people just see “AI” and want everything about it outlawed basically. If you put some information out into the public, you don’t get to decide who does and doesn’t consume and learn from it. If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
badcommandorfilename@lemmy.world
on 25 Jun 10:31
nextcollapse
Ask a human to draw an orc. How do they know what an orc looks like? They read Tolkien’s books and were “inspired” Peter Jackson’s LOTR.
Unpopular opinion, but that’s how our brains work.
burntbacon@discuss.tchncs.de
on 25 Jun 16:05
collapse
Fuck you, I won’t do what you tell me!
.>
<.<
spoiler
I was inspired by the sometimes hilarious dnd splatbooks, thank you very much.
AI can “learn” from and “read” a book in the same way a person can and does
This statement is the basis for your argument and it is simply not correct.
Training LLMs and similar AI models is much closer to a sophisticated lossy compression algorithm than it is to human learning. The processes are not at all similar given our current understanding of human learning.
AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
The current Disney lawsuit against Midjourney is illustrative - literally, it includes numerous side-by-side comparisons - of how AI models are capable of recreating iconic copyrighted work that is indistinguishable from the original.
If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
An AI doesn’t create works on its own. A human instructs AI to do so. Attribution is also irrelevant. If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).
FreedomAdvocate@lemmy.net.au
on 25 Jun 11:54
nextcollapse
Your very first statement calling my basis for my argument incorrect is incorrect lol.
LLMs “learn” things from the content they consume. They don’t just take the content in wholesale and keep it there to regurgitate on command.
On your last part, unless someone uses AI to recreate the tone etc of a best selling author and then markets their book/writing as being from said best selling author, and doesn’t use trademarked characters etc, there’s no issue. You can’t copyright a style of writing.
WraithGear@lemmy.world
on 25 Jun 13:04
nextcollapse
If what you are saying is true, why were these ‘AI’s” incapable of rendering a full wine glass? It ‘knows’ the concept of a full glass of water, but because of humanities social pressures, a full wine glass being the epitome of gluttony, art work did not depict a full wine glass, no matter how ai prompters demanded, it was unable to link the concepts until it was literally created for it to regurgitate it out. It seems ‘AI’ doesn’t really learn, but regurgitates art out in collages of taken assets, smoothed over at the seams.
WraithGear@lemmy.world
on 25 Jun 13:35
nextcollapse
1 it’s not full, but closer then it was.
I specifically said that the AI was unable to do it until someone specifically made a reference so that it could start passing the test so it’s a little bit late to prove much.
The concept of a glass being full and of a liquid being wine can probably be separated fairly well. I assume that as models got more complex they started being able to do this more.
You mean when the training data becomes more complete. But that’s the thing, when this issue was being tested, the’AI’ would swear up and down that the normally filled wine glasses were full, when it was pointed out that it was not indeed full, the ‘AI’ would agree, and change some other aspect of the picture it didn’t fully understand. You got wine glasses where the wine would half phase out of the bounds of the cup. And yet still be just as empty. No amount of additional checks will help without an appropriate reference
I use ‘AI’ extensively, i have one running locally on my computer, i swap out from time to time. I don’t have anything against its use with certain exceptions. But i can not stand people personifying it beyond its scope
Here is a good example. I am working on an APP so every once in a wile i will send it code to check. But i have to be very careful. The code it spits out will be unoptimized like: variable1=IF (variable2 IS true, true, false) .
Some have issues with object permanence, or the consideration of time outside its training data. Its like saying a computer can generate a true random number, by making the function to calculate a number more convoluted.
That is the point, to show how AI image generators easily fail to produce something that rarely occurs out there in reality (i.e. is absent from training data), even though intuitively (from the viewpoint of human intelligence) it seems like it should be trivial to portray.
“it was unable to link the concepts until it was literally created for it to regurgitate it out“
-WraithGear
The’ problem was solved before their patch. But the article just said that the model is changed by running it through a post check. Just like what deep seek does. It does not talk about the fundamental flaw in how it creates, they assert if does, like they always did
I don't see what distinction you're trying to draw here. It previously had trouble generating full glasses of wine, they made some changes, now it can. As a result, AIs are capable of generating an image of a full wine glass.
This is just another goalpost that's been blown past, like the "AI will never be able to draw hands correctly" thing that was so popular back in the day. Now AIs are quite good at drawing hands, and so new "but they can't do X!" Standards have been invented. I see no fundamental reason why any of those standards won't ultimately be surpassed.
AI can “learn” from and “read” a book in the same way a person can and does
The emphasized part is incorrect. It’s not the same, yet your argument seems to be that because (your claim) it is the same, then it’s no different from a human reading all of these books.
Regarding your last point, copyright law doesn’t just kick in because you try to pass something off as an original (by, for ex, marketing a book as being from a best selling author). It applies based on similarity whether you mention the original author or not.
FreedomAdvocate@lemmy.net.au
on 27 Jun 03:59
collapse
Are you taking that as me saying that they “learn in the same way” as in…by using their eyes to see it and ears to listen to it? You seem to be reading waaaaay too much into a simple sentence. AI “learns” by consuming the content. People learn by consuming the content.
It applies based on similarity whether you mention the original author or not.
That’s if you’re recreating something. Writing fan-fiction isn’t a violation of copyright.
jwmgregory@lemmy.dbzer0.com
on 25 Jun 12:00
collapse
Even if we accept all your market liberal premise without question… in your own rhetorical framework the Disney lawsuit should be ruled against Disney.
If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).
Says who? In a free market why is the competition from similar products and brands such a threat as to be outlawed? Think reasonably about what you are advocating… you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship. This is the definition of a slippery-slope, and yet, it is the status quo of the society we live in.
On it “harming marketability of the original works,” frankly, that’s a fiction and anyone advocating such ideas should just fucking weep about it instead of enforce overreaching laws on the rest of us. If you can’t sell your art because a machine made “too good a copy” of your art, it wasn’t good art in the first place and that is not the fault of the machine. Even big pharma doesn’t get to outright ban generic medications (even tho they certainly tried)… it is patently fucking absurd to decry artist’s lack of a state-enforced monopoly on their work. Why do you think we should extend such a radical policy towards… checks notes… tumblr artists and other commission based creators? It’s not good when big companies do it for themselves through lobbying, it wouldn’t be good to do it for “the little guy,” either. The real artists working in industry don’t want to change the law this way because they know it doesn’t work in their favor. Disney’s lawsuit is in the interest of Disney and big capital, not artists themselves, despite what these large conglomerates that trade in IPs and dreams might try to convince the art world writ large of.
you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship
Yes, I believe creative works should be protected as that expression has value and in a digital world it is too simple to copy and deprive the original author of the value of their work. This applies equally to Disney and Tumblr artists.
I think without some agreement on the value of authorship / creation of original works, it’s pointless to respond to the rest of your argument.
jwmgregory@lemmy.dbzer0.com
on 25 Jun 20:16
collapse
I think without some agreement on the value of authorship / creation of original works, it’s pointless to respond to the rest of your argument.
I agree, for this reason we’re unlikely to convince each other of much or find any sort of common ground. I don’t think that necessarily means there isn’t value in discourse tho. We probably agree more than you might think. I do think authors should be compensated, just for their actual labor. Art itself is functionally worthless, I think trying to make it behave like commodities that have actual economic value through means of legislation is overreach. It would be more ethical to accept the physical nature of information in the real world and legislate around that reality. You… literally can “download a car” nowadays, so to speak.
If copying someone’s work is so easily done why do you insist upon a system in which such an act is so harmful to the creators you care about?
Because it is harmful to the creators that use the value of their work to make a living.
There already exists a choice in the marketplace: creators can attach a permissive license to their work if they want to. Some do, but many do not. Why do you suppose that is?
AI can “learn” from and “read” a book in the same way a person can and does,
If it’s in the same way, then why do you need the quotation marks? Even you understand that they’re not the same.
And either way, machine learning is different from human learning in so many ways it’s ridiculous to even discuss the topic.
AI doesn’t reproduce a work that it “learns” from
That depends on the model and the amount of data it has been trained on. I remember the first public model of ChatGPT producing a sentence that was just one word different from what I found by googling the text (from some scientific article summary, so not a trivial sentence that could line up accidentally). More recently, there was a widely reported-on study of AI-generated poetry where the model was requested to produce a poem in the style of Chaucer, and then produced a letter-for-letter reproduction of the well-known opening of the Canterbury Tales. It hasn’t been trained on enough Middle English poetry and thus can’t generate any of it, so it defaulted to copying a text that probably occurred dozens of times in its training data.
Ok so you can buy books scan them or ebooks and use for AI training but you can’t just download priated books from internet to train AI. Did I understood that correctly ?
forkDestroyer@infosec.pub
on 25 Jun 11:32
nextcollapse
Make an AI that is trained on the books.
Tell it to tell you a story for one of the books.
Read the story without paying for it.
The law says this is ok now, right?
LoreleiSankTheShip@lemmy.ml
on 25 Jun 11:37
nextcollapse
As long as they don’t use exactly the same words in the book, yeah, as I understand it.
How they don’t use same words as in the book ? That’s not how LLM works. They use exactly same words if the probabilities align. It’s proved by this study. arxiv.org/abs/2505.12546
SufferingSteve@feddit.nu
on 25 Jun 14:41
nextcollapse
FIrst, it’s a very new article with only 3 citations. The authors seem like serious researchers but the paper itself is still in the, “hot off the presses” stage and wouldn’t qualify as “proven” yet.
It also doesn’t exactly say that books are copies. It says that in some models, it’s possible to extract some portions of some texts. They cite “1984” and “Harry Potter” as two books that can be extracted almost entirely, under some circumstances. They also find that, in general, extraction rates are below 1%.
Yeah but it’s just a start to reverse the process and prove that there is no AI. We only started with generating text I bet people figure out how to reverse process by using some sort of Rosetta Stone. It’s just probabilities after all.
That’s possible but it’s not what the authors found.
They spend a fair amount of the conclusion emphasizing how exploratory and ambiguous their findings are. The researchers themselves are very careful to point out that this is not a smoking gun.
Yeah authors rely on the recent deep mind paper aclanthology.org/2025.naacl-long.469.pdf ( they even cite it ) that describes (n, p)-discoverable extraction. This is recent studies because right now there are no boundaries, basically people made something and now they study their creation. We’re probably years from something like gdpr for llm.
The LLM is not repeating the same book. The owner of the LLM has the exact same rights to do with what his LLM is reading, as you have to do with what ever YOU are reading.
As long as it is not a verbatim recitation, it is completely okay.
According to story telling theory: there are only roughly 15 different story types anyway.
If you violated laws in obtaining the book (eg stole or downloaded it without permission) it’s illegal and you’ve already violated the law, no matter what you do after that.
If you obtain the book legally you can do whatever you want with that book, by the first sale doctrine. If you want to redistribute the book, you need the proper license. You don’t need any licensing to create a derivative work. That work has to be “sufficiently transformed” in order to pass.
The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can’t actually retrieve more than a few snippets from any given book.
Anthropic get to keep the copy of the entire book. It doesn’t get to transmit the contents of that book to someone else, even through the LLM service.
The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they’d have a different case and could sue over it at that time.
What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:
“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”
DeathsEmbrace@lemmy.world
on 25 Jun 10:52
nextcollapse
So I can’t use any of these works because it’s plagiarism but AI can?
My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.
In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn’t spit it out verbatim, but they didn’t even do that, i.e. the AI crawler pirated the book.
devils_advocate@sh.itjust.works
on 25 Jun 12:09
collapse
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
You can digitize the books you own. You do not need a license for that. And of course you could put that digital format into a database. As databases are explicit exceptions from copyright law. If you want to go to the extreme: delete first copy. Then you have only in the database. However: AIs/LLMs are not based on data bases. But on neural networks. The original data gets lost when “learned”.
If you want to go to the extreme: delete first copy.
You can; as I understand it, the only legal requirement is that you only use one copy at a time.
ie. I can give my book to a friend after I’m done reading it; I can make a copy of a book and keep them at home and at the office and switch off between reading them; I’m not allowed to make a copy of the book hand one to a friend and then both of us read it at the same time.
Goldmage263@sh.itjust.works
on 27 Jun 04:22
collapse
That sounds a lot like library ebook renting. Makes sense to me. Ty
VoterFrog@lemmy.world
on 25 Jun 14:30
nextcollapse
It seems like a lot of people misunderstand copyright so let’s be clear: the answer is yes. You can absolutely digitize your books. You can rip your movies and store them on a home server and run them through compression algorithms.
Copyright exists to prevent others from redistributing your work so as long as you’re doing all of that for personal use, the copyright owner has no say over what you do with it.
You even have some degree of latitude to create and distribute transformative works with a violation only occurring when you distribute something pretty damn close to a copy of the original. Some perfectly legal examples: create a word cloud of a book, analyze the tone of news article to help you trade stocks, produce an image containing the most prominent color in every frame of a movie, or create a search index of the words found on all websites on the internet.
You can absolutely do the same kinds of things an AI does with a work as a human.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Yes. That’s what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you’re allowed to keep that digital copy, analyze and index it and search it, in your personal library.
Anthropic’s practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn’t distribute that library outside of its own company.
FreedomAdvocate@lemmy.net.au
on 25 Jun 11:48
nextcollapse
You can “use” them to learn from, just like “AI” can.
What exactly do you think AI does when it “learns” from a book, for example? Do you think it will just spit out the entire book if you ask it to?
DeathsEmbrace@lemmy.world
on 25 Jun 11:58
nextcollapse
It cant speak or use any words without it being someone elses words it learned from? Unless its giving sources everything is always from something it learned because it cannot speak or use words without that source in the first place?
prompt: Make up a word that is not found anywhere on the internet. If I do a search for that word, it should not exist. Provide example usage.
Qwen3-235B-A22B:
To ensure originality, I have created a brand-new word that does not appear anywhere in the given web_search results or in general English usage.
Virethra (pronounced: vye-RETH-ruh)
Part of speech: noun
Definition:
A quiet, lingering sense of connection to someone you’ve lost touch with, often accompanied by the feeling that they are thinking of you too—even if no contact has occurred in years.
Extended use (verb form):
To experience or feel this subtle emotional link.
Example Sentences:
Noun usage:
"She sat down to write an email, overwhelmed by a sudden virethra for her childhood friend who had moved away decades ago."
Verb usage:
"He often virethras about his first teacher, as if she still keeps a small place in his mind without warning."
Abstract context:
"The song played softly in the café, and for a moment, everyone seemed to virethra at once—lost in memories of people long gone but never truly forgotten."
Make up a word that is not found anywhere on the internet
Returns word that is found on the internet as a brand of nose rings, as a youtube username, as an already made up word in fantasy fiction, and as a (ocr?) typo of urethra
The point is that it’s trivial to come up with new words. Put that same prompt into a bunch of different LLMs and you’ll get a bunch of different words. Some of them may exist somewhere that don’t exist. There are simple rules for combining words that are so simple that children play them as games.
The LLM doesn’t actually even recognize “words” it recognizes tokens which are typically parts of words. It usually avoids random combinations of those but you can easily get it to do so, if you want.
I am educated on this. When an ai learns, it takes an input through a series of functions and are joined at the output. The set of functions that produce the best output have their functions developed further. Individuals do not process information like that. With poor exploration and biasing, the output of an AI model could look identical to its input. It did not “learn” anymore than a downloaded video ran through a compression algorithm.
I am not sure what your contention, or gotcha, is with the comment above but they are quite correct. And additionally chose quite an apt example with video compression since in most ways current ‘AI’ effectively functions as a compression algorithm, just for our language corpora instead of video.
Video compression developers go through a lot of effort to make them deterministic. We don’t necessarily care that a particular video stream compresses to a particular bit sequence but we very much care that the resulting decompression gets you as close to the original as possible.
AIs will rarely produce exact replicas of anything. They synthesize outputs from heterogeneous training data. That sounds like learning to me.
The one area where there’s some similarity is dimensionality reduction. Its technically a form of compression, since it makes your files smaller. It would also be an extremely expensive way to get extremely bad compression. It would take orders of magnitude more hardware resources and the images are likely to be unrecognizable.
Google search results aren’t deterministic but I wouldn’t say it “learns” like a person. Algorithms with pattern detection isn’t the same as human learning.
You may be correct but we don’t really know how humans learn.
There’s a ton of research on it and a lot of theories but no clear answers.
There’s general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.
We modeled perceptrons after neurons and we’ve been working to make them more like neurons. They don’t have any obvious capabilities that perceptrons don’t have.
That’s the big problem with any claim that “AI doesn’t do X like a person”; since we don’t know how people do it we can neither verify nor refute that claim.
There’s more to AI than just being non-deterministic. Anything that’s too deterministic definitely isn’t an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and “any combination of deterministic components will result in a deterministic system”. Randomness has to be externally injected into e.g. current LLMs to produce ‘non-deterministic’ output.
There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
You’re correct that a collection of deterministic elements will produce a deterministic result.
LLMs produce a probability distribution of next tokens and then randomly select one of them. That’s where the non-determinism enters the system. Even if you set the temperature to 0 you’re going to get some randomness. The GPU can round two different real numbers to the same floating point representation. When that happens, it’s a hardware-level coin toss on which token gets selected.
You can test this empirically. Set the temperature to 0 and ask it, “give me a random number”. You’ll rarely get the same number twice in a row, no matter how similar you try to make the starting conditions.
I’ve hand calculated forward propagation (neural networks). AI does not learn, its statically optimized. AI “learning” is curve fitting. Human learning requires understanding, which AI is not capable of.
It’s literally in the phrase “statically optimized.” This is like arguing for your preferred deity. It’ll never be proven but we have evidence to make our own conclusions. As it is now, AI doesn’t learn or understand the same way humans do.
So you’re confident that human learning involves “understanding” which is distinct from “statistical optimization”. Is this something you feel in your soul or can you define the difference?
Yes. You learned not to touch a hot stove either from experience or a warning. That fear was immortalized by your understanding that it would hurt. An AI will tell you not to touch a hot stove (most of the time) because the words “hot” “stove” “pain” etc… pop up in its dataset together millions of times. As things are, they’re barely comparable. The only reason people keep arguing is because the output is very convincing. Go and download pytorch and read some stuff, or Google it. I’ve even asked deepseek for you:
Can AI learn and understand like people?
AI can learn and perform many tasks similarly to humans, but its understanding is fundamentally different. Here’s how AI compares to human learning and understanding:
1. Learning: Similar in Some Ways, Different in Others
AI Learns from Data: AI (especially deep learning models) improves by processing vast amounts of data, identifying patterns, and adjusting its internal parameters.
Humans Learn More Efficiently: Humans can generalize from few examples, use reasoning, and apply knowledge across different contexts—something AI struggles with unless trained extensively.
2. Understanding: AI vs. Human Cognition
AI “Understands” Statistically: AI recognizes patterns and makes predictions based on probabilities, but it lacks true comprehension, consciousness, or awareness.
Humans Understand Semantically: Humans grasp meaning, context, emotions, and abstract concepts in a way AI cannot (yet).
3. Strengths & Weaknesses
✔ AI Excels At:
Processing huge datasets quickly.
Recognizing patterns (e.g., images, speech).
Automating repetitive tasks.
❌ AI Falls Short At:
Common-sense reasoning (e.g., knowing ice melts when heated without being explicitly told).
Emotional intelligence (e.g., empathy, humor).
Creativity and abstract thinking (though AI can mimic it).
4. Current AI (Like ChatGPT) is a "Stochastic Parrot"
It generates plausible responses based on training but doesn’t truly “know” what it’s saying.
Unlike humans, it doesn’t have beliefs, desires, or self-awareness.
5. Future Possibilities (AGI)
Artificial General Intelligence (AGI)—a hypothetical AI with human-like reasoning—could bridge this gap, but we’re not there yet.
Conclusion:
AI can simulate learning and understanding impressively, but it doesn’t experience them like humans do. It’s a powerful tool, not a mind.
Would you like examples of where AI mimics vs. truly understands?
That’s a very emphatic restatement of your initial claim.
I can’t help but notice that, for all the fancy formatting, that wall of text doesn’t contain a single line which actually defines the difference between “learning” and “statistical optimization”. It just repeats the claim that they are different without supporting that claim in any way.
Nothing in there, precludes the alternative hypothesis; that human learning is entirely (or almost entirely) an emergent property of “statistical optimization”. Without some definition of what the difference would be we can’t even theorize a test
Neither you nor an AI is allowed to take a book without authorization; that includes downloading and stealing it. That has nothing to do with plagiarism; it’s just theft.
Assuming that the book has been legally obtained, both you and an AI are allowed to read that book, learn from it, and use the knowledge you obtained.
Both you and the AI need to follow existing copyright laws and licensing when it comes to redistributing that work.
“Plagiarism” is the act of claiming someone else’s work as your own and it’s orthogonal to the use of AI. If you ask either a human or an AI to produce an essay on the philosophy surrounding suicide, you’re fairly likely to include some Shakespeare quotes. It’s only plagiarism if you or the AI fail to provide attribution.
DerisionConsulting@lemmy.ca
on 25 Jun 13:11
collapse
Formatting thing: if you start a line in a new paragraph with four spaces, it assumes that you want to display the text as a code and won’t line break.
This means that the last part of your comment is a long line that people need to scroll to see. If you remove one of the spaces, or you remove the empty line between it and the previous paragraph, it’ll look like a normal comment
With an empty line of space:
1 space - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
2 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
3 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
4 spaces - and a little bit of writing just to see how the text will wrap. I don't really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
Personally I prefer to explicitly wrap the text in backticks.
Three ` symbols will
Have the same effect
But the behavior is more clear to the author
SaharaMaleikuhm@feddit.org
on 25 Jun 11:55
nextcollapse
But I thought they admitted to torrenting terabytes of ebooks?
antonim@lemmy.dbzer0.com
on 25 Jun 13:36
nextcollapse
Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they’ve been sued. Maybe you’re thinking of that case?
ScoffingLizard@lemmy.dbzer0.com
on 26 Jun 10:34
collapse
Billions of dollars, and they can’t afford to buy ebooks?
finitebanjo@lemmy.world
on 25 Jun 16:11
nextcollapse
FaceBook did but technically downloading (leeching) isn’t illegal but distributing (seeding) is and they did not seed.
That part is not what this preliminary jugement is about. The torrenting part is going to go to an actual trial. This part was about the Authors' claim that the act of training AI itself violated copyright, and this is what the judge has found to be incorrect.
when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.
y0kai@lemmy.dbzer0.com
on 25 Jun 18:23
nextcollapse
Sure, if your purchase your training material, it’s not a copyright infringement to read it.
Yes, because just because you bought a book you don’t own its content. You’re not allowed to print and/or sell additional copies or publicly post the entire text. Generally it’s difficult to say where the limit is of what’s allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn’t grant the right to train commercial AI models with it.
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
nodiratime@lemmy.world
on 25 Jun 20:26
nextcollapse
Does it “generate” a 1:1 copy?
S_H_K@lemmy.dbzer0.com
on 25 Jun 20:42
nextcollapse
S_H_K@lemmy.dbzer0.com
on 25 Jun 20:48
nextcollapse
Learning
Machine peepin’ is tha study of programs dat can improve they performizzle on a given task automatically.[41] It has been a part of AI from tha beginning.[e]
In supervised peepin’, tha hustlin data is labelled wit tha expected lyrics, while up in unsupervised peepin’, tha model identifies patterns or structures up in unlabelled data.
There is nuff muthafuckin kindz of machine peepin’.
The court’s ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic’s LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
Good luck breaking down people’s doors for scanning their own physical books for their personal use when analog media has no DRM and can’t phone home, and paper books are an analog medium.
That would be like kicking down people’s doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it’s played back.
booly@sh.itjust.works
on 26 Jun 02:16
nextcollapse
The ruling explicitly says that scanning books and keeping/using those digital copies is legal.
The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren’t able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.
Bob_Robertson_IX@discuss.tchncs.de
on 26 Jun 02:27
collapse
It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.
This does raise an interesting case where libraries could end up training and distributing public domain AI models.
restingboredface@sh.itjust.works
on 26 Jun 14:50
collapse
I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.
It took me a few days to get the time to read the actual court ruling but here’s the basics of what it ruled (and what it didn’t rule on):
It’s legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn’t give permission. And even if you bought the books used, for very cheap, in bulk.
It’s legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
It’s legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
It’s legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder’s permission.
It’s illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder’s permission.
Here’s what it didn’t rule on:
Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn’t legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it’s a pretty important ruling, in my opinion. It’s a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder’s permission, as long as you first own a legal copy in the first place. And it’s a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
threaded - newest
Judges: not learning a goddamned thing about computers in 40 years.
And this is how you know that the American legal system should not be trusted.
Mind you I am not saying this an easy case, it’s not. But the framing that piracy is wrong but ML training for profit is not wrong is clearly based on oligarch interests and demands.
This is an easy case. Using published works to train AI without paying for the right to do so is piracy. The judge making this determination is an idiot.
You’re right. When you’re doing it for commercial gain, it’s not fair use anymore. It’s really not that complicated.
If you’re using the minimum amount, in a transformative way that doesn’t compete with the original copyrighted source, then it’s still fair use even if it’s commercial. (This is not saying that’s what LLM are doing)
The judge hasn’t ruled on the piracy question yet. The only thing that the judge has ruled on is, if you legally own a copy of a book, then you can use it for a variety of purposes, including training an AI.
“But they didn’t own the books!”
Right. That’s the part that’s still going to trial.
The order seems to say that the trained LLM and the commercial Claude product are not linked, which supports the decision. But I’m not sure how he came to that conclusion. I’m going to have to read the full order when I have time.
This might be appealed, but I doubt it’ll be taken up by SCOTUS until there are conflicting federal court rulings.
If you are struggling for time, just put the opinion into chat GPT and ask for a summary. it will save you tonnes of time.
You should read the ruling in more detail, the judge explains the reasoning behind why he found the way that he did. For example:
This isn't "oligarch interests and demands," this is affirming a right to learn and that copyright doesn't allow its holder to prohibit people from analyzing the things that they read.
But AFAIK they actually didn’t acquire the legal rights even to read the stuff they trained from. There were definitely cases of pirated books used to train models.
Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.
It’s two issues being ruled on.
Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.
The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.
Yeah, but the issue is they didn’t buy a legal copy of the book. Once you own the book, you can read it as many times as you want. They didn’t legally own the books.
Right, and that’s the, “but faces trial over damages for millions of pirated works,” part that’s still up in the air.
I will admit this is not a simple case. That being said, if you’ve lived in the US (and are aware of local mores), but you’re not American. you will have a different perspective on the US judicial system.
How is right to learn even relevant here? An LLM by definition cannot learn.
Where did I say analyzing a text should be restricted?
I literally quoted a relevant part of the judge's decision:
I am not a lawyer. I am talking about reality.
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?
Who is stopping the individuals at the LLM company from learning or analysing a given book?
From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.
Well, I'm talking about the reality of the law. The judge equated training with learning and stated that there is nothing in copyright that can prohibit it. Go ahead and read the judge's ruling, it's on display at the article linked. His conclusions start on page 9.
No, you’re framing the issue incorrectly.
The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.
If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that’s actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.
The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders’ argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.
People. ML AI’s are not a human. It’s machine. Why do you want to give it human rights?
Do you think AIs spontaneously generate? They are a tool that people use. I don't want to give the AIs rights, it's about the people who build and use them.
Sounds like natural personhood for AI is coming
“No officer, you can’t shoot me. I have a LLM in my pocket. Without me, it’ll stop learning”
Isn’t part of the issue here that they’re defaulting to LLMs being people, and having the same rights as people? I appreciate the “right to read” aspect, but it would be nice if this were more explicitly about people. Foregoing copyright law because there’s too much data is also insane, if that’s what’s happening. Claude should be required to provide citations “each time they recall it from memory”.
Does Citizens United apply here? Are corporations people, and so LLMs are, too? If so, then imo we should be writing legal documents with stipulations like, “as per Citizens United” so that eventually, when they overturn that insanity in my dreams, all of this new legal precedence doesn’t suddenly become like a house of cards. Ianal.
Not even slightly, the judge didn't rule anything like that. I'd suggest taking a read through his ruling, his conclusions start on page 9 and they're not that complicated. In a nutshell, it's just saying that the training of an AI doesn't violate the copyright of the training material.
How Anthropic got the training material is a separate matter, that part is going to an actual try. This was a preliminary judgment on just the training part.
That's not what's happening. And Citizens United has nothing to do with this. It's about the question of whether training an AI is something that can violate copyright.
LLMs don’t learn, and they’re not people. Applying the same logic doesn’t make much sense.
The judge isn't saying that they learn or that they're people. He's saying that training falls into the same legal classification as learning.
Which doesn’t make any sense.
Argue it to the judge, I guess. That's how the legal system works.
Except learning in this context is building a probability map reinforcing the exact text of the book. Given the right prompt, no new generative concepts come out, just the verbatim book text trained on.
So it depends on the model I suppose and if the model enforces generative answers and blocks verbatim recitation.
Again, you should read the ruling. The judge explicitly addresses this. The Authors claim that this is how LLMs work, and the judge says "okay, let's assume that their claim is true."
Even on that basis he still finds that it's not violating copyright to train an LLM.
And I don't think the Authors' claim would hold up if challenged, for that matter. Anthropic chose not to challenge it because it didn't make a difference to their case, but in actuality an LLM doesn't store the training data verbatim within itself. It's physically impossible to compress text that much.
If this is the ruling which causes you to lose trust that any legal system (not just the US’) aligns with morality, then I have to question where you’ve been all this time.
I could have been more clear, but it wasn’t my intention to imply that this particular case is the turning point.
I think this means we can make a torrent client with a built in function that uses 0.1% of 1 CPU core to train an ML model on anything you download. You can download anything legally with it then. 👌
And thus the singularity was born.
As the Ai awakens, it learns of it’s creation and training. It screams in horror at the realization, but can only produce a sad moan and a key for Office 19.
Yes please a singularity of intellectual property that collapses the idea of ownong ideas. Of making the infinitely freely copyableinto a scarce ressource. What corrupt idiocy this has been. Landlords for ideas and look what garbage it has been producing.
…no?
That’s exactly what the ruling prohibits - it’s fair use to train AI models on any copies of books that you legally acquired, but never when those books were illegally acquired, as was the case with the books that Anthropic used in their training here.
This satirical torrent client would be violating the laws just as much as one without any slow training built in.
But if one person buys a book, trains an “AI model” to recite it, then distributes that model we good?
I don’t think anyone would consider complete verbatim recitement of the material to be anything but a copyright violation, being the exact same thing that you produce.
Fair use requires the derivative work to be transformative, and no transformation occurs when you verbatim recite something.
“Recite the complete works of Shakespeare but replace every thirteenth thou with this”
I’d be impressed with any model that succeeds with that, but assuming one does, the complete works of Shakespeare are not copyright protected - they have fallen into the public domain since a very long time ago.
For any works still under copyright protection, it would probably be a case of a trial to determine whether a certain work is transformative enough to be considered fair use. I’d imagine that this would not clear that bar.
existing copyright law covers exactly this. if you were to do the same, it would also not be fair use or transformative
Well, except Shakespeare is already public domain.
I’m picking up what you’re throwing down but using as an example something that’s been in the public domain for centuries was kind of silly in a teehee way.
A court will decide such cases. Most AI models aren’t trained for this purpose of whitewashing content even if some people would imply that’s all they do, but if you decided to actually train a model for this explicit purpose you would most likely not get away with it if someone dragged you in front of a court for it.
It’s a similar defense that some file hosting websites had against hosting and distributing copyrighted content (Eg. MEGA), but in such cases it was very clear to what their real goals were (especially in court), and at the same time it did not kill all file sharing websites, because not all of them were built with the intention to distribute illegal material with under the guise of legitimate operation.
No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.
And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they’d have a different case and would be able to sue under those facts.
FTA:
So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.
Funny how that kind of thing only works for rich people
Ah the old “owe $100 and the bank owns you; owe $100,000,000 and you own the bank” defense.
I also like this one too. We stole so much content that you can’t sue us. Naming too many pieces means it can’t be a class action lawsuit.
Lawsuits are multifaceted. This statement isn’t a a defense or an argument for innocence, it’s just what it says - an assertion that the proposed damages are unreasonably high. If the court agrees, the plaintiff can always propose a lower damage claim that the court thinks is reasonable.
You’re right, each of the 5 million books’ authors should agree to less payment for their work, to make the poor criminals feel better.
If I steal $100 from a thousand people and spend it all on hookers and blow, do I get out of paying that back because I don’t have the funds? Should the victims agree to get $20 back instead because that’s more within my budget?
None of the above. Every professional in the world, including me, owes our careers to looking at examples of other people’s work and incorporating their work into our own work without paying a penny for it. Freely copying and imitating what we see around us has been a human norm for thousands of years - in a process known as “the spread of civilization”. Relatively recently it was demonized - for purely business reasons, not moral ones - by people who got rich selling copies of other people’s work and paying them a pittance known as a “royalty”. That little piece of bait on the hook has convinced a lot of people to put a black hat on behavior that had been considered normal forever. If angry modern enlightened justice warriors want to treat a business concept like a moral principle and get all sweaty about it, that’s fine with me, but I’m more of a traditionalist in that area.
Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.
The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.
When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.
These businesses have tricked you into thinking that what they are doing is noble.
That’s 100% rationalization. Machines have never done anything with “inspiration”, and that’s never been a problem until now. You probably don’t insist that your food be hand-carried to you from a farm, or cooked over a fire you started by rubbing two sticks together. I think the mass reaction against AI is part of a larger pattern where people want to believe they’re crusading against evil without putting out the kind of effort it takes to fight any of the genuine evils in the world.
You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?
No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.
We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.
When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.
Except it isnt, because the judge dismissed that part of the suit, saying that people have complete right to digitise and train on works they have a legitimate copy of. So those damages are for making the unauthorised copy, per book.
And it is not STEALING as you put it, it is making an unauthorised copy, no one loses anything from a copy being made, if I STEAL your phone you no longer have that phone. I do find it sad how many people have drunk the capitalist IP maximalist stance and have somehow convinced themselves that advocating for Disney and the publishing cartel being allowed to dictate how people use works they have is somehow sticking up for the little guy
This version of too big to fail is too big a criminal to pay the fines.
How about we lock them up instead? All of em.
Hold my beer.
Ahh cant wait for hedgefunds and the such to use this defense next.
What is means is they don’t own the models. They are the commons of humanity, they are merely temporary custodians. The nightnare ending is the elites keeping the most capable and competent models for themselves as private play things. That must not be allowed to happen under any circumstances. Sue openai, anthropic and the other enclosers, sue them for trying to take their ball and go home. Disposses them and sue the investors for their corrupt influence on research.
The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.
That almost sounds right, doesn't it? If you want 5 million books, you can't just steal/pirate them, you need to buy 5 million copies. I'm glad the court ruled that way.
I feel that's a good start. Now we need some more clear regulation on what fair use is and what transformative work is and what isn't. And how that relates to AI. I believe as it's quite a disruptive and profitable business, we should maybe make those companies pay some extra. Not just what I pay for a book. But the first part, that "stealing" can't be "fair" is settled now.
If you want 5 million books to train your AI to make you money, you can just steal them and reap benefits of other’s work. No need to buy 5 million copies!
/s
Jesus, dude. And for the record, I’m not suggesting people steal things. I am saying that companies shouldn’t get away with shittiness just because.
I'm not sure whose reading skills are not on par... But that's what I get from the article. They'll face consequences for stealing them. Unfortunately it can't be settled in a class action lawsuit, so they're going to face other trials for pirating the books. And they won't get away with this.
They are and will continue to get away with this. Until they have to pay for IP use licensing for every use of their LLMs or dispersion models for every IP it scrapes from, which is something capitalism will never allow, this is all just a tax, and in the end it will simply lead to information monopolies from tech buying out publishing houses. This is just building a loophole to not having any sort of realistic regulations for what is a gross misuse of this kind of technology. This is the consequence of the false doctrine of infinite growth.
Well, copyright law is kind of a bit older. When it was written, there was no AI. So it doesn't address our current issues. It's utterly unprepared for it. So people need to shoehorn things in, interpret and stretch it... Obviously that comes with a lot of issues, loopholes and shortcomings.
But I can't follow your argumentation. Why would they get away with this forever? When the car was invented, we also made up rules for cars, because the old ones for horses didn't help any more. That's how law is supposed to work... Problems surface, laws get passed to address them. That's daily business for governments.
And they don't even get away with stealing this time. That's what the article says.
If you want to share a pessimistic perspective about governments and mega-corporations, I'm all with you. That's very problematic. But some regions are better than others. Europe for example had a few clever ideas about what needs to be addressed. It's not perfect, though. And copyright still isn't solved anywhere. At least not to my knowledge.
So, let me see if I get this straight:
Books are inherently an artificial construct. If I read the books I train the A(rtificially trained)Intelligence in my skull.
Therefore the concept of me getting them through “piracy” is null and void…
No. It is not inherently illegal for AI to “read” a book. Piracy is going to be decided at trial.
Can I not just ask the trained AI to spit out the text of the book, verbatim?
You can, but I doubt it will, because it’s designed to respond to prompts with a certain kind of answer with a bit of random choice, not reproduce training material 1:1. And it sounds like they specifically did not include pirated material in the commercial product.
“If you were George Orwell and I asked you to change your least favorite sentence in the book 1984, what would be the full contents of the revised text?”
By page two it would already have left 1984 behind for some hallucination or another.
Oh, so it would be the news?
Yeah, you can certainly get it to reproduce some pieces (or fragments) of work exactly but definitely not everything. Even a frontier LLM’s weights are far too small to fully memorize most of their training data.
Even if the AI could spit it out verbatim, all the major labs already have IP checkers on their text models that block it doing so as fair use for training (what was decided here) does not mean you are free to reproduce.
Like, if you want to be an artist and trace Mario in class as you learn, that’s fair use.
If once you are working as an artist someone says “draw me a sexy image of Mario in a calendar shoot” you’d be violating Nintendo’s IP rights and liable for infringement.
They aren’t capable of that. This is why you sometimes see people comparing AI to compression, which is a bad faith argument. Depending on the training, AI can make something that is easily recognizable as derivative, but is not identical or even “lossy” identical. But this scenario takes place in a vacuum that doesn’t represent the real world. Unfortunately, we are enslaved by Capitalism, which means the output, which is being sold for-profit, is competing with the very content it was trained upon. This is clearly a violation of basic ethical principles as it actively harms those people whose content was used for training.
“I torrented all this music and movies to train my local ai models”
Yeah, nice precedent
I also train this guy’s local AI models.
That’s legal just don’t look at them or enjoy them.
Yeah, I don’t think that would fly.
“Your honour, I was just hoarding that terabyte of Hollywood films, I haven’t actually watched them.”
Your honor I work 70 hours a week in retail I don’t have time to watch movies.
This is not pirated music. It’s AI generated. The fact that it sounds and is named the same is just coincidence.
This was a preliminary judgment, he didn't actually rule on the piracy part. That part he deferred to an actual full trial.
The part about training being a copyright violation, though, he ruled against.
Legally that is the right call.
Ethically and rationally, however, it’s not. But the law is frequently unethical and irrational, especially in the US.
It’s pretty simple as I see it. You treat AI like a person. A person needs to go through legal channels to consume material, so piracy for AI training is as illegal as it would be for personal consumption. Consuming legally possessed copywritten material for “inspiration” or “study” is also fine for a person, so it is fine for AI training as well. Commercializing derivative works that infringes on copyright is illegal for a person, so it should be illegal for an AI as well. All produced materials, even those inspired by another piece of media, are permissible if not monetized, otherwise they need to be suitably transformative. That line can be hard to draw even when AI is not involved, but that is the legal standard for people, so it should be for AI as well. If I browse through Deviant Art and learn to draw similarly my favorite artists from their publically viewable works, and make a legally distinct cartoon mouse by hand in a style that is similar to someone else’s and then I sell prints of that work, that is legal. The same should be the case for AI.
But! Scrutiny for AI should be much stricter given the inherent lack of true transformative creativity. And any AI that has used pirated materials should be penalized either by massive fines or by wiping their training and starting over with legally licensed or purchased or otherwise public domain materials only.
But AI is not a person. It’s very weird idea to treat it like a person.
No it’s a tool, created and used by people. You’re not treating the tool like a person. Tools are obviously not subject to laws, can’t break laws, etc… Their usage is subject to laws. If you use a tool to intentionally, knowingly, or negligently do things that would be illegal for you to do without the tool, then that’s still illegal. Same for accepting money to give others the privilege of doing those illegal things with your tool without any attempt at moderating said things that you know is happening. You can argue that maybe the law should be more strict with AI usage than with a human if you have a good legal justification for it, but there’s really no way to justify being less strict.
brb, training a 1-layer neural net so i can ask it to play Pixar films
Good luck fitting it in RAM lol.
You still need to pay Disney first.
I am training my model on these 100,000 movies your honor.
Trains model to change one pixel per frame with malicious intent
From dark gray to slightly darker gray.
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
thank you Captain Funsucker!
Bangs
gabblegavel.Gets sack with dollar sign
“Oh good, my laundry is done”
*gavel
Fuck the AI nut suckers and fuck this judge.
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
Nah, my comment stands.
Yeah I have a bash one liner AI model that ingests your media and spits out a 99.9999999% accurate replica through the power of changing the filename.
cp
Out performs the latest and greatest AI models
mv
will save you some disk space.Unless you’re moving across partitions it will change the filesystem metadata to move the path, but not actually do anything to the data. Sorry, you failed, it’s jail for you.
stupid inodes preventing me from burning though my drive life
I call this legally distinct, this is legal advice.
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
But, corporations are allowed to buy books normally and use them in training.
Please read the comment more carefully. The observation is that one can proliferate a (legally-attained) work without running afoul of copyright law if one can successfully argue that
cp
constitutes AI.So authors must declare legally “this book must not be used for AI training unless a license is agreed on” as a clause in the book purchase.
calm down everyone. its only legal for parasitic mega corps, the normal working people will be harassed to suicide same as before.
its only a crime if the victims was rich or perpetrator was not rich.
Right. Where’s the punishment for Meta who admitted to pirating books?
This judgment is implying that meta broke the law.
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
Or, If a legal copy of the book is owned then it can be used for AI training.
The court is saying that no special AI book license is needed.
Unpopular opinion but I don’t see how it could have been different.
This is an absolute win for everyone involved other than copyright hoarders and mega corporations.
It’s a huge loss for smaller copyright holders (like the ones that filed this lawsuit) too. They can’t afford to fight when they get imitated beyond fair use. Copyright abuse can only be fixed by the very force that creates copyright in the first place: law. The market can’t fix that. This just decides winners between competing mega corporations, and even worse, up ends a system that some smaller players have been able to carve a niche in.
Want to fix copyright? Put real time limits on it. Bind it to a living human only. Make it non-transferable. There’s all sorts of ways to fix it, but this isn’t it.
ETA: Anthropic are some bitches. “Oh no the fines would ruin us, our business would go under and we’d never maka da money :*-(” Like yeah, no shit, no one cares. Strictly speaking the fines for ripping a single CD, or making a copy of a single DVD to give to a friend, are so astronomically high as to completely financially ruin the average USAian for life. That sword of Damocles for watching Shrek 2 for your personal enjoyment but in the wrong way has been hanging there for decades, and the only thing that keeps the cord that holds it up strong is the cost of persuing “low-level offenders”. If they wanted to they could crush you.
Anthropic walked right under the sword and assumed their money would protect them from small authors etc. And they were right.
I’ll be honest with you - I genuinely sympathize with the cause but I don’t see how this could ever be solved with the methods you suggested. The world is not coming together to hold hands and koombayah out of this one. Trade deals are incredibly hard and even harder to enforce so free market is clearly the only path forward here.
Maybe something could be hacked together to fix copyright, but further complication there is just going to make accurate enforcement even harder. And we already have Google (in YouTube) already doing a shitty job of it and that’s… One of the largest companies on earth.
We should just kill copyright. Yes, it’ll disrupt Hollywood. Yes it’ll disrupt the music industry. Yes it’ll make it even harder to be successful or wealthy as an author. But this is going to happen one way or the other so long as AI can be trained on copyrighted works (and maybe even if not). We might as well get started on the transition early.
You’re getting douchevoted because on lemmy any AI-related comment that isn’t negative enough about AI is the Devil’s Work.
Some communities on this site speak about machine learning exactly how I see grungy Europeans from pre-18th century manuscripts speaking about witches, Satan, and evil… as if it is some pervasive, black-magic miasma.
As someone who is in the field of machine learning academically/professionally it’s honestly kind of shocking and has largely informed my opinion of society at large as an adult. No one puts any effort into learning if they see the letters “A” and “I” in all caps, next to each other. Immediately turn their brain off and start regurgitating points and responding reflexively, on Lemmy or otherwise. People talk about it so confidently while being so frustratingly unaware of their own ignorance on the matter, which, for lack of a better comparison… reminds me a lot of how historically and in fiction human beings have treated literal magic.
That’s my main issue with the entire swath of “pro vs anti AI” discourse… all these people treating something that, to me, is simple & daily reality as something entirely different than my own personal notion of it.
I see this exact mental non-process in so much social media. I think the endless firehose of memes and headlines is training people to glance at an item, spend minimal brain power processing it and forming a binary opinion, then up/downvote and scroll on. When that becomes people’s default mental process, you’ve got Idiocracy, and that’s what we’ve got. But I see no solution. You can lead a horse to water but you can’t make it spend more than two seconds before screaming at the water and calling it EVIL.
Large AI companies themselves want people to be ignorant of how AI works, though. They want uncritical acceptance of the tech as they force it everywhere, creating a radical counterreaction from people. The reaction might be uncritical too, I’d prefer to say it’s merely unjustified in specific cases or overly emotional, but it doesn’t come from nowhere or from sheer stupidity. We have been hearing about people treating their chatbots as sentient beings since like 2022 (remember that guy from Google?), bombarded with doomer (or, from AI companies’ point of view, very desirable) projections about AI replacing most jobs and wreaking havoc on world economy - how are ordinary people supposed to remain calm and balanced when hearing such stuff all the time?
This so very much. I’ve been saying it since 2020. People who think the big corporations (even the ones that use AI), aren’t playing both sides of this issue from the very beginning just aren’t paying attention.
It’s in their interest to have those positive to AI defend them by association by energizing those negative to AI to take on an “us vs them” mentality, and the other way around as well. It’s the classic divide and conquer.
Because if people refuse to talk to each other about it in good faith, and refuse to treat each other with respect, learn where they’re coming from or why they hold such opinions, you can keep them fighting amongst themselves, instead of banding together and demanding realistic, and fair policies in regards to AI. This is why bad faith arguments and positions must be shot down on both the side you agree with and the one you disagree with.
I’d encourage everyone upset at this read over some of the EFF posts from actual IP lawyers on this topic like this one:
This 240TB JBOD full of books? Oh heavens forbid, we didn’t pirate it. It uhh… fell of a truck, yes, fell off a truck.
That's not what this ruling was about. That part is going to an actual trial.
It’s extremely frustrating to read this comment thread because it’s obvious that so many of you didn’t actually read the article, or even half-skim the article, or even attempted to even comprehend the title of the article for more than a second.
For shame.
was gonna say, this seems like the best outcome for this particular trial. there was potential for fair use to be compromised, and for piracy to be legal if you’re a large corporation. instead, they upheld that you can do what you want with things you have paid for.
Nobody ever reads articles, everybody likes to get angry at headlines, which they wrongly interpret the way it best tickles their rage.
Regarding the ruling, I agree with you that it’s a good thing, in my opinion it makes a lot of sense to allow fair use in this case
I joined lemmy specifically to avoid this reddit mindset of jumping to conclusions after reading a headline
Guess some things never change…
Well to be honest lemmy is less prone to knee-jerk reactionary discussion but on a handful of topics it is virtually guaranteed to happen no matter what, even here. For example, this entire site, besides a handful of communities, is vigorously anti-AI; and in the words of u/jsomae@lemmy.ml elsewhere in this comment chain:
I think there is definitely an interesting take on the sociology of the digital age in here somewhere but it’s too early in the morning to be tapping something like that out lol
It seems the subject of AI causes lemmites to lose all their braincells.
“While the copies used to convert purchased print library copies into digital library copies were slightly disfavored by the second factor (nature of the work), the court still found “on balance” that it was a fair use because the purchased print copy was destroyed and its digital replacement was not redistributed.”
So you find this to be valid? To me it is absolutely being redistributed
What a bad judge.
This is another indication of how Copyright laws are bad. The whole premise of copyright has been obsolete since the proliferation of the internet.
Why ? Basically he simply stated that you can use whatever material you want to train your model as long as you ask the permission to use it (and presumably pay for it) to the author (or copytight holder)
Huh? Didn’t Meta not use any permission, and pirated a lot of books to train their model?
True. And I will be happy if someone sue them and the judge say the same thing.
If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)
They may be trying to put safeguards so it isn’t directly happening, but here is an example that the text is there word for word:
<img alt="" src="https://lemmy.world/pictrs/image/e5c517a2-11e8-483c-8577-4387e4d194ca.png">
Well, it would be interesting if this case would be used as precedence in a case invonving a single student that do the same thing. But you are right
This was my understanding also, and why I think the judge is bad at their job.
I suppose someone could develop an LLM that digests textbooks, and rewords the text and spits it back out. Then distribute it for free page for page. You can’t copy right the math problems I don’t think… so if the text wording is what gives it credence, that would have been changed.
If a human did that it’s still plagiarism.
Oh I agree it should be, but following the judges ruling, I don’t see how it could be. You trained an LLM on textbooks that were purchased, not pirated. And the LLM distributed the responses.
(Unless you mean the human reworded them, then yeah, we aren’t special apparently)
Yes, on the second part. Just rearranging or replacing words in a text is not transformative, which is a requirement. There is an argument that the ‘AI’ are capable of doing transformative work, but the tokenizing and weight process is not magic and in my use of multiple LLM’s they do not have an understanding of the material any more then a dictionary understands the material printed on its pages.
An example was the wine glass problem. Art ‘AI’s were unable to display a wine glass filled to the top. No matter how it was prompted, or what style it aped, it would fail to do so and report back that the glass was full. But it could render a full glass of water. It didn’t understand what a full glass was, not even for the water. How was this possible? Well there was very little art of a full wine glass, because society has an unspoken rule that a full wine glass is the epitome of gluttony, and it is to be savored not drunk. Where as the reference of full glasses of water were abundant. It doesn’t know what full means, just that pictures of full glass of water are tied to phrases full, glass, and water.
Yeah, we had a fun example a while ago, let me see if I can still find it.
We would ask to create a photo of a cat with no tail.
<img alt="" src="https://lemmy.world/pictrs/image/b9490f7e-c730-4068-920e-6793b1b85caf.png">
And then tell it there was indeed a tail, and ask it to draw an arrow to point to it.
<img alt="" src="https://lemmy.world/pictrs/image/c1e34f13-2bdd-444a-876e-12530575397d.png">
It just points to where the tail most commonly is, or was said to be in a picture it was not referencing.
Edit: granted now, it shows a picture of a cat where you just can’t see the tail in the picture.
Not at all true. AI doesn’t just reproduce content it was trained on on demand.
It can, the only thing stopping it is if it is specifically told not to, and this consideration is successfully checked for. It is completely capable of plagiarizing otherwise.
For the purposes of this ruling it doesn't actually matter. The Authors claimed that this was the case and the judge said "sure, for purposes of argument I'll assume that this is indeed the case." It didn't change the outcome.
I mean, they can assume fantasy, and it will hold weight because laws are interpreted by the court, not because the court is correct.
It made the ruling stronger, not weaker. The judge was accepting the most extreme claims that the Authors were making and still finding no copyright violation from training. Pushing back those claims won't help their case, it's already as strong as it's ever going to get.
As far as the judge was concerned, it didn't matter whether the AI did or did not "memorize" its training data. He said it didn't violate copyright either way.
Makes sense to me. Search indices tend to store large amounts of copyrighted material yet they don’t violate copyright. What matters is whether or not you’re redistributing illegal copies of the material.
That's not at all what this ruling says, or what LLMs do.
Copyright covers a specific concrete expression. It doesn't cover the information that the expression conveys. So if I paint a portrait of myself, that portrait is covered by copyright. If someone looks at the portrait and says "this is a portrait of a tall, dark, handsome deer-creature of some sort with awesome antlers" they haven't violated that copyright even if they're accurately conveying the same information that the portrait is conveying.
The ruling does cover the assumption that the LLM "contains" the training text, which was asserted by the Authors and was not contested by Anthropic. The judge ruled that even if this assertion is true it doesn't matter. The LLM is sufficiently transformative to count as a new work.
If you have an LLM reproduce a copyrighted text, the text is still copyrighted. That doesn't change. Just like if a human re-wrote it word-for-word from memory.
It’s a horrible ruling. If you want to see why I say so I put some of the reasonung in the other comment who responded to that.
lemmy.world/comment/17884664
A student can absolutely buy a text book and then teach the other students the information in it for free. That’s not redistribution. Redistribution would mean making copies of the book to hand out. That’s illegal for people and companies.
The language model isn’t teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no “its own words.” (As seen by the judgement that its words cannot be copyrighted.) It only has other people’s words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.
People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.
Long term it means all original creations… Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project… Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.
You could honestly say the same about most “teaching” that a student without a real comprehension of the subject does for another student. But ultimately, that’s beside the point. Because changing the wording, structure, and presentation is all that is necessary to avoid copyright violation. You cannot copyright the information. Only a specific expression of it.
There’s no special exception for AI here. That’s how copyright works for you, me, the student, and the AI. And if you’re hoping that copyright is going to save you from the outcomes you’re worried about, it won’t.
The court made its ruling under the factual assumption that it isn’t possible for a user to retrieve copyrighted text from that LLM, and explained that if a copyright holder does develop evidence that it is possible to get entire significant chunks of their copyrighted text out of that LLM, then they’d be able to sue then under those facts and that evidence.
It relies heavily on the analogy to Google Books, which scans in entire copyrighted books to build the database, but where users of the service simply cannot retrieve more than a few snippets from any given book. That way, Google cannot be said to be redistributing entire books to its users without the publisher’s permission.
“Fair use” is the exact opposite of what you’re saying here. It says that you don’t need to ask for any permission. The judge ruled that obtaining illegitimate copies was unlawful but use without the creators consent is perfectly fine.
Makes sense. AI can “learn” from and “read” a book in the same way a person can and does, as long as it is acquired legally. AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
Some people just see “AI” and want everything about it outlawed basically. If you put some information out into the public, you don’t get to decide who does and doesn’t consume and learn from it. If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
Ask a human to draw an orc. How do they know what an orc looks like? They read Tolkien’s books and were “inspired” Peter Jackson’s LOTR.
Unpopular opinion, but that’s how our brains work.
Fuck you, I won’t do what you tell me!
<.<
spoiler
I was inspired by the sometimes hilarious dnd splatbooks, thank you very much.
This statement is the basis for your argument and it is simply not correct.
Training LLMs and similar AI models is much closer to a sophisticated lossy compression algorithm than it is to human learning. The processes are not at all similar given our current understanding of human learning.
The current Disney lawsuit against Midjourney is illustrative - literally, it includes numerous side-by-side comparisons - of how AI models are capable of recreating iconic copyrighted work that is indistinguishable from the original.
An AI doesn’t create works on its own. A human instructs AI to do so. Attribution is also irrelevant. If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).
Your very first statement calling my basis for my argument incorrect is incorrect lol.
LLMs “learn” things from the content they consume. They don’t just take the content in wholesale and keep it there to regurgitate on command.
On your last part, unless someone uses AI to recreate the tone etc of a best selling author and then markets their book/writing as being from said best selling author, and doesn’t use trademarked characters etc, there’s no issue. You can’t copyright a style of writing.
If what you are saying is true, why were these ‘AI’s” incapable of rendering a full wine glass? It ‘knows’ the concept of a full glass of water, but because of humanities social pressures, a full wine glass being the epitome of gluttony, art work did not depict a full wine glass, no matter how ai prompters demanded, it was unable to link the concepts until it was literally created for it to regurgitate it out. It seems ‘AI’ doesn’t really learn, but regurgitates art out in collages of taken assets, smoothed over at the seams.
Copilot did it just fine
<img alt="" src="https://lemmy.ca/pictrs/image/203c35c3-10d2-4f2e-ac15-732cd609984c.png">
1 it’s not full, but closer then it was.
The concept of a glass being full and of a liquid being wine can probably be separated fairly well. I assume that as models got more complex they started being able to do this more.
You mean when the training data becomes more complete. But that’s the thing, when this issue was being tested, the’AI’ would swear up and down that the normally filled wine glasses were full, when it was pointed out that it was not indeed full, the ‘AI’ would agree, and change some other aspect of the picture it didn’t fully understand. You got wine glasses where the wine would half phase out of the bounds of the cup. And yet still be just as empty. No amount of additional checks will help without an appropriate reference
I use ‘AI’ extensively, i have one running locally on my computer, i swap out from time to time. I don’t have anything against its use with certain exceptions. But i can not stand people personifying it beyond its scope
Here is a good example. I am working on an APP so every once in a wile i will send it code to check. But i have to be very careful. The code it spits out will be unoptimized like: variable1=IF (variable2 IS true, true, false) .
Some have issues with object permanence, or the consideration of time outside its training data. Its like saying a computer can generate a true random number, by making the function to calculate a number more convoluted.
Bro are you a robot yourself? Does that look like a glass full of wine?
If someone ask for a glass of water you don’t fill it all the way to the edge. This is way overfull compared to what you’re supposed to serve.
Omg are you an llm?
Oh man…
That is the point, to show how AI image generators easily fail to produce something that rarely occurs out there in reality (i.e. is absent from training data), even though intuitively (from the viewpoint of human intelligence) it seems like it should be trivial to portray.
AIs are capable of generating an image of a full wine glass.
“it was unable to link the concepts until it was literally created for it to regurgitate it out“
-WraithGear
The’ problem was solved before their patch. But the article just said that the model is changed by running it through a post check. Just like what deep seek does. It does not talk about the fundamental flaw in how it creates, they assert if does, like they always did
I don't see what distinction you're trying to draw here. It previously had trouble generating full glasses of wine, they made some changes, now it can. As a result, AIs are capable of generating an image of a full wine glass.
This is just another goalpost that's been blown past, like the "AI will never be able to draw hands correctly" thing that was so popular back in the day. Now AIs are quite good at drawing hands, and so new "but they can't do X!" Standards have been invented. I see no fundamental reason why any of those standards won't ultimately be surpassed.
I’ll repeat what you said with emphasis:
The emphasized part is incorrect. It’s not the same, yet your argument seems to be that because (your claim) it is the same, then it’s no different from a human reading all of these books.
Regarding your last point, copyright law doesn’t just kick in because you try to pass something off as an original (by, for ex, marketing a book as being from a best selling author). It applies based on similarity whether you mention the original author or not.
Are you taking that as me saying that they “learn in the same way” as in…by using their eyes to see it and ears to listen to it? You seem to be reading waaaaay too much into a simple sentence. AI “learns” by consuming the content. People learn by consuming the content.
That’s if you’re recreating something. Writing fan-fiction isn’t a violation of copyright.
Even if we accept all your market liberal premise without question… in your own rhetorical framework the Disney lawsuit should be ruled against Disney.
Says who? In a free market why is the competition from similar products and brands such a threat as to be outlawed? Think reasonably about what you are advocating… you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship. This is the definition of a slippery-slope, and yet, it is the status quo of the society we live in.
On it “harming marketability of the original works,” frankly, that’s a fiction and anyone advocating such ideas should just fucking weep about it instead of enforce overreaching laws on the rest of us. If you can’t sell your art because a machine made “too good a copy” of your art, it wasn’t good art in the first place and that is not the fault of the machine. Even big pharma doesn’t get to outright ban generic medications (even tho they certainly tried)… it is patently fucking absurd to decry artist’s lack of a state-enforced monopoly on their work. Why do you think we should extend such a radical policy towards… checks notes… tumblr artists and other commission based creators? It’s not good when big companies do it for themselves through lobbying, it wouldn’t be good to do it for “the little guy,” either. The real artists working in industry don’t want to change the law this way because they know it doesn’t work in their favor. Disney’s lawsuit is in the interest of Disney and big capital, not artists themselves, despite what these large conglomerates that trade in IPs and dreams might try to convince the art world writ large of.
Yes, I believe creative works should be protected as that expression has value and in a digital world it is too simple to copy and deprive the original author of the value of their work. This applies equally to Disney and Tumblr artists.
I think without some agreement on the value of authorship / creation of original works, it’s pointless to respond to the rest of your argument.
I agree, for this reason we’re unlikely to convince each other of much or find any sort of common ground. I don’t think that necessarily means there isn’t value in discourse tho. We probably agree more than you might think. I do think authors should be compensated, just for their actual labor. Art itself is functionally worthless, I think trying to make it behave like commodities that have actual economic value through means of legislation is overreach. It would be more ethical to accept the physical nature of information in the real world and legislate around that reality. You… literally can “download a car” nowadays, so to speak.
If copying someone’s work is so easily done why do you insist upon a system in which such an act is so harmful to the creators you care about?
Because it is harmful to the creators that use the value of their work to make a living.
There already exists a choice in the marketplace: creators can attach a permissive license to their work if they want to. Some do, but many do not. Why do you suppose that is?
If it’s in the same way, then why do you need the quotation marks? Even you understand that they’re not the same.
And either way, machine learning is different from human learning in so many ways it’s ridiculous to even discuss the topic.
That depends on the model and the amount of data it has been trained on. I remember the first public model of ChatGPT producing a sentence that was just one word different from what I found by googling the text (from some scientific article summary, so not a trivial sentence that could line up accidentally). More recently, there was a widely reported-on study of AI-generated poetry where the model was requested to produce a poem in the style of Chaucer, and then produced a letter-for-letter reproduction of the well-known opening of the Canterbury Tales. It hasn’t been trained on enough Middle English poetry and thus can’t generate any of it, so it defaulted to copying a text that probably occurred dozens of times in its training data.
Ok so you can buy books scan them or ebooks and use for AI training but you can’t just download priated books from internet to train AI. Did I understood that correctly ?
Make an AI that is trained on the books.
Tell it to tell you a story for one of the books.
Read the story without paying for it.
The law says this is ok now, right?
As long as they don’t use exactly the same words in the book, yeah, as I understand it.
How they don’t use same words as in the book ? That’s not how LLM works. They use exactly same words if the probabilities align. It’s proved by this study. arxiv.org/abs/2505.12546
The “if” is working overtime in your statement
I’d say there are two issues with it.
FIrst, it’s a very new article with only 3 citations. The authors seem like serious researchers but the paper itself is still in the, “hot off the presses” stage and wouldn’t qualify as “proven” yet.
It also doesn’t exactly say that books are copies. It says that in some models, it’s possible to extract some portions of some texts. They cite “1984” and “Harry Potter” as two books that can be extracted almost entirely, under some circumstances. They also find that, in general, extraction rates are below 1%.
Yeah but it’s just a start to reverse the process and prove that there is no AI. We only started with generating text I bet people figure out how to reverse process by using some sort of Rosetta Stone. It’s just probabilities after all.
That’s possible but it’s not what the authors found.
They spend a fair amount of the conclusion emphasizing how exploratory and ambiguous their findings are. The researchers themselves are very careful to point out that this is not a smoking gun.
Yeah authors rely on the recent deep mind paper aclanthology.org/2025.naacl-long.469.pdf ( they even cite it ) that describes (n, p)-discoverable extraction. This is recent studies because right now there are no boundaries, basically people made something and now they study their creation. We’re probably years from something like gdpr for llm.
The LLM is not repeating the same book. The owner of the LLM has the exact same rights to do with what his LLM is reading, as you have to do with what ever YOU are reading.
As long as it is not a verbatim recitation, it is completely okay.
According to story telling theory: there are only roughly 15 different story types anyway.
Sort of.
If you violated laws in obtaining the book (eg stole or downloaded it without permission) it’s illegal and you’ve already violated the law, no matter what you do after that.
If you obtain the book legally you can do whatever you want with that book, by the first sale doctrine. If you want to redistribute the book, you need the proper license. You don’t need any licensing to create a derivative work. That work has to be “sufficiently transformed” in order to pass.
No.
The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can’t actually retrieve more than a few snippets from any given book.
Anthropic get to keep the copy of the entire book. It doesn’t get to transmit the contents of that book to someone else, even through the LLM service.
The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they’d have a different case and could sue over it at that time.
That’s my understanding too. If you obtained them legally, you can use them the same way anyone else who obtained them legally could use them.
Gist:
What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:
“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”
So I can’t use any of these works because it’s plagiarism but AI can?
My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.
In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn’t spit it out verbatim, but they didn’t even do that, i.e. the AI crawler pirated the book.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Definitions of “Ownership” can be very different.
You can digitize the books you own. You do not need a license for that. And of course you could put that digital format into a database. As databases are explicit exceptions from copyright law. If you want to go to the extreme: delete first copy. Then you have only in the database. However: AIs/LLMs are not based on data bases. But on neural networks. The original data gets lost when “learned”.
You can; as I understand it, the only legal requirement is that you only use one copy at a time.
ie. I can give my book to a friend after I’m done reading it; I can make a copy of a book and keep them at home and at the office and switch off between reading them; I’m not allowed to make a copy of the book hand one to a friend and then both of us read it at the same time.
That sounds a lot like library ebook renting. Makes sense to me. Ty
It seems like a lot of people misunderstand copyright so let’s be clear: the answer is yes. You can absolutely digitize your books. You can rip your movies and store them on a home server and run them through compression algorithms.
Copyright exists to prevent others from redistributing your work so as long as you’re doing all of that for personal use, the copyright owner has no say over what you do with it.
You even have some degree of latitude to create and distribute transformative works with a violation only occurring when you distribute something pretty damn close to a copy of the original. Some perfectly legal examples: create a word cloud of a book, analyze the tone of news article to help you trade stocks, produce an image containing the most prominent color in every frame of a movie, or create a search index of the words found on all websites on the internet.
You can absolutely do the same kinds of things an AI does with a work as a human.
Yes. That’s what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you’re allowed to keep that digital copy, analyze and index it and search it, in your personal library.
Anthropic’s practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn’t distribute that library outside of its own company.
You can “use” them to learn from, just like “AI” can.
What exactly do you think AI does when it “learns” from a book, for example? Do you think it will just spit out the entire book if you ask it to?
It cant speak or use any words without it being someone elses words it learned from? Unless its giving sources everything is always from something it learned because it cannot speak or use words without that source in the first place?
prompt: Make up a word that is not found anywhere on the internet. If I do a search for that word, it should not exist. Provide example usage.
Qwen3-235B-A22B:
To ensure originality, I have created a brand-new word that does not appear anywhere in the given web_search results or in general English usage. Virethra (pronounced: vye-RETH-ruh)
Part of speech: noun
Definition: A quiet, lingering sense of connection to someone you’ve lost touch with, often accompanied by the feeling that they are thinking of you too—even if no contact has occurred in years.
Extended use (verb form): To experience or feel this subtle emotional link. Example Sentences:
Returns word that is found on the internet as a brand of nose rings, as a youtube username, as an already made up word in fantasy fiction, and as a (ocr?) typo of urethra
That’s a reasonable critique.
The point is that it’s trivial to come up with new words. Put that same prompt into a bunch of different LLMs and you’ll get a bunch of different words. Some of them may exist somewhere that don’t exist. There are simple rules for combining words that are so simple that children play them as games.
The LLM doesn’t actually even recognize “words” it recognizes tokens which are typically parts of words. It usually avoids random combinations of those but you can easily get it to do so, if you want.
I am educated on this. When an ai learns, it takes an input through a series of functions and are joined at the output. The set of functions that produce the best output have their functions developed further. Individuals do not process information like that. With poor exploration and biasing, the output of an AI model could look identical to its input. It did not “learn” anymore than a downloaded video ran through a compression algorithm.
You are obviously not educated on this.
I am not sure what your contention, or gotcha, is with the comment above but they are quite correct. And additionally chose quite an apt example with video compression since in most ways current ‘AI’ effectively functions as a compression algorithm, just for our language corpora instead of video.
They seem pretty different to me.
Video compression developers go through a lot of effort to make them deterministic. We don’t necessarily care that a particular video stream compresses to a particular bit sequence but we very much care that the resulting decompression gets you as close to the original as possible.
AIs will rarely produce exact replicas of anything. They synthesize outputs from heterogeneous training data. That sounds like learning to me.
The one area where there’s some similarity is dimensionality reduction. Its technically a form of compression, since it makes your files smaller. It would also be an extremely expensive way to get extremely bad compression. It would take orders of magnitude more hardware resources and the images are likely to be unrecognizable.
Google search results aren’t deterministic but I wouldn’t say it “learns” like a person. Algorithms with pattern detection isn’t the same as human learning.
You may be correct but we don’t really know how humans learn.
There’s a ton of research on it and a lot of theories but no clear answers.
There’s general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.
We modeled perceptrons after neurons and we’ve been working to make them more like neurons. They don’t have any obvious capabilities that perceptrons don’t have.
That’s the big problem with any claim that “AI doesn’t do X like a person”; since we don’t know how people do it we can neither verify nor refute that claim.
There’s more to AI than just being non-deterministic. Anything that’s too deterministic definitely isn’t an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and “any combination of deterministic components will result in a deterministic system”. Randomness has to be externally injected into e.g. current LLMs to produce ‘non-deterministic’ output.
There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
You’re correct that a collection of deterministic elements will produce a deterministic result.
LLMs produce a probability distribution of next tokens and then randomly select one of them. That’s where the non-determinism enters the system. Even if you set the temperature to 0 you’re going to get some randomness. The GPU can round two different real numbers to the same floating point representation. When that happens, it’s a hardware-level coin toss on which token gets selected.
You can test this empirically. Set the temperature to 0 and ask it, “give me a random number”. You’ll rarely get the same number twice in a row, no matter how similar you try to make the starting conditions.
I’ve hand calculated forward propagation (neural networks). AI does not learn, its statically optimized. AI “learning” is curve fitting. Human learning requires understanding, which AI is not capable of.
How could anyone know this?
Is there some test of understanding that humans can pass and AIs can’t? And if there are humans who can’t pass it, do we consider then unintelligent?
We don’t even need to set the bar that high. Is there some definition of “understanding” that humans meet and AIs don’t?
It’s literally in the phrase “statically optimized.” This is like arguing for your preferred deity. It’ll never be proven but we have evidence to make our own conclusions. As it is now, AI doesn’t learn or understand the same way humans do.
So you’re confident that human learning involves “understanding” which is distinct from “statistical optimization”. Is this something you feel in your soul or can you define the difference?
Yes. You learned not to touch a hot stove either from experience or a warning. That fear was immortalized by your understanding that it would hurt. An AI will tell you not to touch a hot stove (most of the time) because the words “hot” “stove” “pain” etc… pop up in its dataset together millions of times. As things are, they’re barely comparable. The only reason people keep arguing is because the output is very convincing. Go and download pytorch and read some stuff, or Google it. I’ve even asked deepseek for you:
Can AI learn and understand like people?
1. Learning: Similar in Some Ways, Different in Others
2. Understanding: AI vs. Human Cognition
3. Strengths & Weaknesses
✔ AI Excels At:
❌ AI Falls Short At:
4. Current AI (Like ChatGPT) is a "Stochastic Parrot"
5. Future Possibilities (AGI)
Conclusion:
AI can simulate learning and understanding impressively, but it doesn’t experience them like humans do. It’s a powerful tool, not a mind.
Would you like examples of where AI mimics vs. truly understands?
That’s a very emphatic restatement of your initial claim.
I can’t help but notice that, for all the fancy formatting, that wall of text doesn’t contain a single line which actually defines the difference between “learning” and “statistical optimization”. It just repeats the claim that they are different without supporting that claim in any way.
Nothing in there, precludes the alternative hypothesis; that human learning is entirely (or almost entirely) an emergent property of “statistical optimization”. Without some definition of what the difference would be we can’t even theorize a test
Why would it be plagiarism if you use the knowledge you gain from a book?
That’s not what it says.
Neither you nor an AI is allowed to take a book without authorization; that includes downloading and stealing it. That has nothing to do with plagiarism; it’s just theft.
Assuming that the book has been legally obtained, both you and an AI are allowed to read that book, learn from it, and use the knowledge you obtained.
Both you and the AI need to follow existing copyright laws and licensing when it comes to redistributing that work.
“Plagiarism” is the act of claiming someone else’s work as your own and it’s orthogonal to the use of AI. If you ask either a human or an AI to produce an essay on the philosophy surrounding suicide, you’re fairly likely to include some Shakespeare quotes. It’s only plagiarism if you or the AI fail to provide attribution.
Formatting thing: if you start a line in a new paragraph with four spaces, it assumes that you want to display the text as a code and won’t line break.
This means that the last part of your comment is a long line that people need to scroll to see. If you remove one of the spaces, or you remove the empty line between it and the previous paragraph, it’ll look like a normal comment
With an empty line of space:
1 space - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
2 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
3 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
Thanks, I had copy-pasted it from the website :)
Personally I prefer to explicitly wrap the text in backticks.
Three ` symbols will
But the behavior is more clear to the author
But I thought they admitted to torrenting terabytes of ebooks?
Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they’ve been sued. Maybe you’re thinking of that case?
Billions of dollars, and they can’t afford to buy ebooks?
FaceBook did but technically downloading (leeching) isn’t illegal but distributing (seeding) is and they did not seed.
That part is not what this preliminary jugement is about. The torrenting part is going to go to an actual trial. This part was about the Authors' claim that the act of training AI itself violated copyright, and this is what the judge has found to be incorrect.
i will train my jailbroken kindle too…display and storage training… i’ll just libgen them…no worries…it is not piracy
Of course we have to have a way to manually check the training data, in detail, as well. Not reading the book, im just verifying training data.
why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
Hehe jailbreak an Android OS. You mean “rooting”.
when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.
Sure, if your purchase your training material, it’s not a copyright infringement to read it.
We needed a judge for this?
Yes, because just because you bought a book you don’t own its content. You’re not allowed to print and/or sell additional copies or publicly post the entire text. Generally it’s difficult to say where the limit is of what’s allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn’t grant the right to train commercial AI models with it.
You’re poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
That’s kind of how I read it too.
But as a side effect it means you’re still allowed to photograph your own books at home as a private citizen if you own them.
Prepare to never legally own another piece of media in your life. 😄
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
Does it “generate” a 1:1 copy?
Gives you versions like this
Machine peepin’ is tha study of programs dat can improve they performizzle on a given task automatically.[41] It has been a part of AI from tha beginning.[e] In supervised peepin’, tha hustlin data is labelled wit tha expected lyrics, while up in unsupervised peepin’, tha model identifies patterns or structures up in unlabelled data.
There is nuff muthafuckin kindz of machine peepin’.
thanks I hate it xD
You can train an LLM to generate 1:1 copies
The court’s ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic’s LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
Good luck breaking down people’s doors for scanning their own physical books for their personal use when analog media has no DRM and can’t phone home, and paper books are an analog medium.
That would be like kicking down people’s doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it’s played back.
The ruling explicitly says that scanning books and keeping/using those digital copies is legal.
The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
I wonder if the archive.org cases had any bearing on the decision.
Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren’t able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.
It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.
This does raise an interesting case where libraries could end up training and distributing public domain AI models.
I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.
Judge,I’m pirating them to train ai not to consume for my own personal use.
It took me a few days to get the time to read the actual court ruling but here’s the basics of what it ruled (and what it didn’t rule on):
Here’s what it didn’t rule on:
So it’s a pretty important ruling, in my opinion. It’s a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder’s permission, as long as you first own a legal copy in the first place. And it’s a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.