Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia is giving AI developers its data to fend off bot scrapers (enterprise.wikimedia.com)
from Tea@programming.dev to technology@lemmy.world on 17 Apr 10:42
https://programming.dev/post/28761090

#technology

threaded - newest

ShellMonkey@lemmy.socdojo.com on 17 Apr 11:26 next collapse

You can download a torrent of the whole thing, they don’t need to give it to anyone.

en.m.wikipedia.org/…/Wikipedia:Database_download

eager_eagle@lemmy.world on 17 Apr 11:55 collapse

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.

r00ty@kbin.life on 17 Apr 12:29 collapse

The problem is, this assumes that even if the kind of AI creators that are scraping relentlessly (and there's a fair few that do) took this data source directly, that they'd then put an exception in their scrapers to avoid wikipedia's site. I doubt they would bother.

Geodad@lemm.ee on 17 Apr 14:14 next collapse

Is there not some way to just blacklist the AI domain or IP range?

Monument@lemmy.sdf.org on 17 Apr 14:23 next collapse

No, because there isn’t a single IP range or user agent, and many developers are going to lengths to defeat anti-scraping measures, which include user agent spoofing as well as vpns and the like to mask the source of the traffic.

devfuuu@lemmy.world on 17 Apr 16:12 next collapse

If you read the few articles about people being attacked by AI in the recent months they all tell the same story: it’s not possible. The AI companies are targetting on purpose other sites and working non stop to actively avoid any kind of blocking that could be active. They rotate IPs regularly, they change User agents, they ignore robots.txt, deduplicate requests over bunch of ips, if they detect they are being blocked they start only doing one request in each ip, they change user agents the moment they detect one is being blocked, etc etc etc.

baines@lemmy.cafe on 17 Apr 17:14 collapse

whitelists and the end of anonymity

HK65@sopuli.xyz on 18 Apr 07:00 collapse

Or just decent regulation. You’re offering an AI product? You can’t attest that it’s been trained in a legitimate way?

Into the shadow realm with you.

MangoPenguin@lemmy.blahaj.zone on 17 Apr 17:46 collapse

Nope, there’s no specific range of IPs that AI scrapers use.

MCasq_qsaCJ_234@lemmy.zip on 17 Apr 16:20 collapse

I just feel like OpenAI might accept this and ignore the website, although it’s very unlikely they will actually do that.