LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI (www.dropsitenews.com)
from geneva_convenience@lemmy.ml to technology@lemmy.ml on 08 Aug 20:06
https://lemmy.ml/post/34368861

Meta has scraped data from the most-trafficked domains on the internet —including news organizations, education platforms, niche forums, personal blogs, and even revenge porn sites—to train its artificial intelligence models, according to a leaked list obtained by Drop Site News.

By scraping data from roughly 6 million unique websites, including 100,000 of the top-ranked domains, Meta has generated millions of pages of content to use for Meta’s AI-training pipeline.

The sites that Meta scrapes consist of copyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers.

They include mainstream businesses like Getty Images, Shopify, Shutterstock, but also extreme pornographic content, including websites advertising explicit sexual content and humiliation porn that exploits teenagers.

#technology

threaded - newest

Korkki@lemmy.ml on 08 Aug 20:41 next collapse

We ban the petty scrapers and celebrate the great ones as innovators and promote them as fortune 500 companies.

Pavidus@lemmy.world on 08 Aug 20:43 next collapse

Stole.

BlueEther@no.lastname.nz on 08 Aug 21:31 next collapse

% grep lemmy Meta.txt             
lemmy.ca
lemmynsfw.com
lemmy.sdf.org
lemmy.ml
lemmy.world
lemmygrad.ml

geneva_convenience@lemmy.ml on 08 Aug 22:42 next collapse

Good catch. That’s worth a seperate post.

Hexbear is on the list too.

marcie@lemmy.ml on 09 Aug 14:23 collapse

Llms will start randomly shitting out hexbear emotes lol

irelephant@lemmy.dbzer0.com on 11 Aug 18:22 collapse

where did you get the .txt file?

BlueEther@no.lastname.nz on 12 Aug 02:49 collapse

Just extracted it from the PDF

r00ty@kbin.life on 08 Aug 23:27 next collapse

I blocked the entire ASN for Meta, because they were downright dirty with their scraping. No gradual crawling, fakes UAs, random addresses across a large number of subnets.

They weren't the only ones either. The AI scraping heist is the new goldrush.

mindbleach@sh.itjust.works on 09 Aug 00:29 next collapse

Always amused when leftist instances treat intellectual property like it’s real.

Vendetta9076@sh.itjust.works on 10 Aug 17:29 next collapse

IP debate aside, LLM scrapers absolutely annihilate system resources. I host a wordpress site and before setting up cloudflare labyrinth my whole server would get ddos’d at least twice a day.

irelephant@lemmy.dbzer0.com on 11 Aug 18:24 collapse

its not, but scraping is annoyingly resource intensive.

Jerry@feddit.online on 09 Aug 00:32 next collapse

My Mastodon instance is on the list. I try hard to block them.

The problem with the list is that it's a target list, but not a list showing how much content, if any, they manage to process from any of the sites.

NutWrench@lemmy.ml on 09 Aug 12:22 collapse

One person’s “scraping” is another person’s plagiarism.