The Open-Source Software Saving the Internet From AI Bot Scrapers

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)
from sabreW4K3@lazysoci.al to technology@beehaw.org on 07 Jul 15:44
https://lazysoci.al/post/29843474

#technology

threaded - newest

FundMECFSResearch@lemmy.blahaj.zone on 07 Jul 16:15 next collapse

This thing Anubis always flags me for some reason. I use mullvad and safari (ios) with some add and tracker blocking extensions.

simple@piefed.social on 07 Jul 16:26 next collapse

Do you have javascript or cookies disabled? That might stop you from getting past.

FundMECFSResearch@lemmy.blahaj.zone on 07 Jul 16:26 collapse

nope

Photuris@lemmy.ml on 07 Jul 18:38 next collapse

More sites in general are blocking mullvad traffic lately (in my experience), and I’m not sure what, if anything, can be done about it.

FundMECFSResearch@lemmy.blahaj.zone on 07 Jul 18:47 next collapse

I expect better from a popular FOSS tool being used by privacy aware people though.

SweetCitrusBuzz@beehaw.org on 07 Jul 19:31 collapse

Can you open an issue, or see if one is open already for this?

Powderhorn@beehaw.org on 07 Jul 21:52 collapse

Agreed. Luckily, they don’t seem to have the full list of Mullvad IPs, so if I really want to read something, I just try another tunnel.

Appoxo@lemmy.dbzer0.com on 08 Jul 05:14 collapse

I wonder why traffic from known VPN companies are under more scrutiny than traffic from domestic households…

theangriestbird@beehaw.org on 07 Jul 18:44 next collapse

This snip at the end is so good:

Iaso said she thinks AI companies follow her work, and that if they really want to stop her and Anubis they just need to distract her.

“If you are working at an AI company, here’s how you can sabotage Anubis development as easily and quickly as possible,” she wrote on her site. “So first is quit your job, second is work for Square Enix, and third is make absolute banger stuff for Final Fantasy XIV. That’s how you can sabotage this the best.”

Geodad@beehaw.org on 07 Jul 19:43 collapse

I’d be fine with this… 🤣

leaky_shower_thought@feddit.nl on 07 Jul 19:25 next collapse

i like this one better than cloudflare’s turnstile.

cf blocks me all the time for the smallest reasons and i can’t seem to find their nag email.

fuckwit_mcbumcrumble@lemmy.dbzer0.com on 07 Jul 20:30 collapse

I have no issues with Cloudflare, but Anubis always takes it sweet ass time to verify me. Like 30+ seconds just sitting there, but then eventually I get in.

Vanilla_PuddinFudge@infosec.pub on 08 Jul 18:51 collapse

Windows XP ended support like 20 years ago if you were wondering if the Pentium 4 build you’re using was still viable.

remington@beehaw.org on 07 Jul 20:57 next collapse

Would you edit your post and add the following archive link to the body, please?

archive.is/VcoE1

who@feddit.org on 07 Jul 21:04 next collapse

Unfortunately, archive.is seems to have moved behind a big corporate CAPTCHA service, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

I suggest this archive link instead:

web.archive.org/…/the-open-source-software-saving…

remington@beehaw.org on 07 Jul 21:06 collapse

Unfortunately, archive.is has moved behind Cloudflare, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

How do you know this?

What about ghostarchive.org?

who@feddit.org on 07 Jul 21:27 collapse

Sorry; I shouldn’t have written Cloudflare specifically. Their CAPTCHA page now contains scripts from Google, not Cloudflare. I have corrected my comment.

How do you know this?

Because a couple months ago, archive.is/archive.today started showing me CAPTCHA pages instead of the archived articles when I use Firefox with scripts disabled. The current page contains scripts hosted by Google, which I won’t enable, so I can’t read the archived articles.

What about ghostarchive.org?

I haven’t used that site enough to have a consistent picture of what it’s doing. When I tried it a few minutes ago, it directed me to a CAPTCHA wall when trying to submit an article, but not when searching for an archived article. I’ll try to remember to look at it again periodically, to be able to answer this question in the future.

remington@beehaw.org on 07 Jul 21:32 collapse

Thanks. I appreciate the info and effort.

sabreW4K3@lazysoci.al on 07 Jul 21:28 collapse

To be honest with you, I refuse on moral grounds. 404 are independent and do good work. You’ve already linked a pay wall bypass in the comments, if anyone would like to find it, it’s not hard to scroll.

remington@beehaw.org on 07 Jul 21:31 collapse

OK. Fair enough.

who@feddit.org on 07 Jul 21:14 collapse

She told me she’s […] also thinking about a version that doesn’t require JavaScript, which some privacy-minded disable in their browsers.

As someone who is keenly aware of the privacy and security problems that come with allowing web scripts, I hope she prioritizes this soon. It’s really disappointing to find sites that were formerly readable without javascript suddenly inaccessible since adopting Anubis. The more sites that do this, the more people are pushed toward enabling scripts by default, exposing them to a great many trackers and web exploits that would otherwise be blocked.

exu@feditown.com on 08 Jul 15:26 collapse

There’s an option using some very new HTML tag, but it’s not the default.

anubis.techaro.lol/docs/admin/…/metarefresh

who@feddit.org on 08 Jul 18:38 collapse

Interesting. Judging by that option’s name, it seems to refer to use of the HTML <meta> tag to refresh a page.

developer.mozilla.org/en-US/docs/…/http-equiv

Neither this tag nor using it for refresh is new at all. I don’t think I’ve seen it used to detect bots, though. I wonder what Anubis is doing here.

JohnEdwa@sopuli.xyz on 09 Jul 00:18 collapse

It’s simply checking if the connection is from an actual browser, as a scraper pretending to be one won’t actually refresh the page as instructed. It’s going to buy some time, but like the rest of Anubis in general, it will only work until the scrapers get modified to work around it.