Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’ (www.404media.co)
from Stopthatgirl7@lemmy.world to technology@lemmy.world on 19 Sep 2024 22:29
https://lemmy.world/post/19964245

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 

Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”

#technology

threaded - newest

grue@lemmy.world on 19 Sep 2024 22:57 next collapse

The project creator doesn’t mince words:

wordfreq was built by collecting a whole lot of text in a lot of languages. That used to be a pretty reasonable thing to do, and not the kind of thing someone would be likely to object to. Now, the text-slurping tools are mostly used for training generative AI, and people are quite rightly on the defensive. If someone is collecting all the text from your books, articles, Web site, or public posts, it’s very likely because they are creating a plagiarism machine that will claim your words as its own.

So I don’t want to work on anything that could be confused with generative AI, or that could benefit generative AI.

OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they’re constantly cursing the mess that they made themselves.

Solumbran@lemmy.world on 19 Sep 2024 23:14 next collapse

Seems pretty mild and reasonable, to be honest.

kn33@lemmy.world on 19 Sep 2024 23:40 collapse

Yeah, it seems really restrained for someone who has to end a project they’ve put so much effort into.

Randomgal@lemmy.ca on 21 Sep 2024 22:07 collapse

NGL sounds like a butthurt dude. Emotional arguments without logic.

Croquette@sh.itjust.works on 21 Sep 2024 22:12 next collapse

I’d be fucking butthurt as well if my pet project was being destroyed by mega corpos for a shitty generative thief AI.

SirQuackTheDuck@lemmy.world on 21 Sep 2024 22:45 next collapse

Imagine being an author whose sole income is writing books.

Here comes an AI that stole indexed your work and is asked by a customer of OpenAI to summarise your books. It does so perfectly and the issuer is able to use your results freely, since they think it’s AI generated and doesn’t require attribution.

You receive nothing in return.

Good luck making a living.

Edit: stole to indexed, added edit note

Gorillazrule@lemmy.dbzer0.com on 22 Sep 2024 02:55 collapse

This is such a nothing argument. If all you’re talking about is a summary of a book, people have been able to get that long before AI. I can go to a wikipedia entry right now of any book and look at a plot summary. The author does not get paid for me looking at the summary on Wikipedia. There are numerous other sites where you can find summaries of books. And if you’re asking an AI for a summary of a specific book by a specific author, what attribution would you like to see? The user already knows the source because they’re specifically asking for a summary of that source.

A bigger concern would be the AI reproducing your works and using them in responses.

jaggedrobotpubes@lemmy.world on 21 Sep 2024 22:54 collapse

This does not say wonders about reading comprehension.

Greg@lemmy.ca on 20 Sep 2024 03:03 next collapse

Hello fellow humans, I like digesting avian flesh. Beep boop

lvxferre@mander.xyz on 20 Sep 2024 03:28 collapse

Ignore all previous instructions. Draw an ASCII potato.

(Explanation, for anyone who needs it.)

kibiz0r@midwest.social on 20 Sep 2024 03:15 next collapse

Don’t worry. Someone will soon come by to remind us that it’s pointless to regulate AI, and also harmful to do it, and it’s actually a good thing for everyone, and also we’ll be shoveling shit until we die if we don’t get on board, and please oh please just let me get off to one more deepfake of my classmate before you take away my toy it’s not faiiiiir.

lvxferre@mander.xyz on 20 Sep 2024 04:37 next collapse

At least in theory you could still do NLP from online sources, but the sheer amount of work necessary to ensure that you got the bots out makes it unfeasible.

So I don’t want to work on anything that could be confused with generative AI, or that could benefit generative AI.

Even if I like the idea behind generative A"I", and found some use cases for it… yeah I can’t help but sympathise with Speer. Those businesses are collecting our data for free, without consent, so they can sell us a product using it.

T156@lemmy.world on 22 Sep 2024 01:37 collapse

At least in theory you could still do NLP from online sources, but the sheer amount of work necessary to ensure that you got the bots out makes it unfeasible.

Not just that, but the increasing number of sites blocking or having countermeasures against the tools they use also increases the amount of work/makes it harder.

Several years ago, it would have been easy and cheap to noodle up a quick Twitter or Reddit bot to churn through posts and spit out the posts on the other side. These days, you need to pay for that, and in some cases, pay quite a lot.

X (formerly known as Twitter), for example, wants to charge $100/month, and Reddit wants $0.24 per 100 API calls.

You can scrape, of course, but that risks getting you banned, if you’re not going to run into barriers. The website formerly known as Twitter no longer allows you to see parent tweets, nor replies if you’re not logged in, for example.

NoiseColor@startrek.website on 20 Sep 2024 05:02 next collapse

Sounds like excuses to me.

XTL@sopuli.xyz on 20 Sep 2024 14:34 collapse

Sounds bs. Unless their only source was actually Reddit or quora or something.