Cloudflare plans marketplace to sell permission to scrape websites (techcrunch.com)
from Appoxo@lemmy.dbzer0.com to technology@lemmy.world on 23 Sep 2024 17:07
https://lemmy.dbzer0.com/post/28340952

#technology

threaded - newest

umami_wasbi@lemmy.ml on 23 Sep 2024 17:11 next collapse

How can I do this without Cloudflare?

rikudou@lemmings.world on 23 Sep 2024 17:18 collapse

Put a page on your website saying that scrapping your website costs [insert amount] and block the bots otherwise.

gravitas_deficiency@sh.itjust.works on 23 Sep 2024 20:45 collapse

The hard part is reliably detecting the bots

melroy@kbin.melroy.org on 23 Sep 2024 22:39 collapse

Also you don't want to block legit search engines that are not scraping your data for AI.

gravitas_deficiency@sh.itjust.works on 23 Sep 2024 22:45 collapse

Again: hard to differentiate all those different bots, because you have to trust that they are what they say they are, and they often are not

melroy@kbin.melroy.org on 23 Sep 2024 23:30 collapse

Instead of blocking bots on user agent.. I'm blocking full IP ranges: https://gitlab.melroy.org/-/snippets/619

vinnymac@lemmy.world on 24 Sep 2024 02:51 collapse

It certainly can be a cat and mouse game, but scraping at scale tends to be ahead of the curve of the security teams. Some examples:

brightdata.com

oxylabs.io

Preventing access by requiring an account, with strict access rules can curb the vast majority of scraping, then your only bad actors are the rich venture capitalists.

scarabine@lemmynsfw.com on 23 Sep 2024 18:04 next collapse

I have an idea. Why don’t I put a bunch of my website stuff in one place, say a pdf, and you screw heads just buy that? We’ll call it a “book”

gaylord_fartmaster@lemmy.world on 23 Sep 2024 18:36 next collapse

They’re already ignoring robots.txt, so I’m not sure why anyone would think they won’t just ignore this too. All they have to do is get a new IP and change their useragent.

redditReallySucks@lemmy.dbzer0.com on 24 Sep 2024 07:32 collapse

Cloudflare is protecting a lot of sites from scraping with their POW captchas. They could allow people who pay

magic_smoke@links.hackliberty.org on 23 Sep 2024 23:40 collapse

As someone who uses invidious daily I’ve always been of the belief if you don’t want something scraped, then maybe don’t upload it to a public web page/server.

General_Effort@lemmy.world on 24 Sep 2024 10:59 next collapse

There’s probably not many people here who understand the connection between Invidious and scraping.

Justas@sh.itjust.works on 26 Sep 2024 12:00 collapse

Imagine a company that sells a lot of products online. Now imagine a scraping bot coming at peak sales hours and looking at each product list and page separately for said service. Now realise that some genuine users will have a worse buying experience because of that.

magic_smoke@links.hackliberty.org on 26 Sep 2024 13:15 collapse

Yeah there’s way easier ways to combat that without trying to prevent scraping.

Maybe don’t ship 20 units to the same address.