As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's
from thenexusofprivacy@infosec.exchange to fediverse@lemmy.world on 11 Aug 21:52
https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

As you’ve probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it’s certainly a threat worth thinking about.

So I’m wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don’t want to disclose … @deadsuperhero has some good discussion on We Distribute, and it would b e very interesting to hear what various instances are doing.

And a couple of more open-ended questions:

Here’s @FediPact’s post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

https://cyberpunk.lol/@FediPact/114999480874284493

@fediverse @fediversenews

#MastoAdmin #Meta #FediPact

Two very large question marks

#fedipact #fediverse #mastoadmin #meta

threaded - newest