Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal (www.404media.co)
from shish_mish@lemmy.world to technology@lemmy.world on 24 Jul 2024 16:43
https://lemmy.world/post/17913261

#technology

threaded - newest

BearOfaTime@lemm.ee on 24 Jul 2024 16:46 next collapse

Excellent! Now I won’t get reddit results and then have to filter them out!

MehBlah@lemmy.world on 24 Jul 2024 16:49 next collapse

Sounds great to me. With reddit gone maybe we can start to find what we are looking for without having to go sort through reddit.

tal@lemmy.today on 24 Jul 2024 18:45 collapse

Kagi has a “search lens” specifically to search the Threadiverse. Like, they track lemmy/kbin/etc sites and you can specifically include them in their own results section, and can also have “!threadiverse” or whatever you want specifically search that.

They do the same for Usenet.

I suppose, given this new robots.txt Reddit development, that they’ll probably never have a Reddit lens, though.

zutto@lemmy.fedi.zutto.fi on 24 Jul 2024 20:14 collapse

Kagi is a metasearch-engine (apart from their homebrew small-web index, known as Teclis), so the reddit lenses will continue to function long as one of the search engines it’s querying is paying reddit.

MyOpinion@lemm.ee on 24 Jul 2024 17:52 next collapse

These two shitty companies deserve each other.

woelkchen@lemmy.world on 24 Jul 2024 18:08 next collapse

test site:reddit.com works fine from DDG for me.

tal@lemmy.today on 24 Jul 2024 18:23 next collapse

Older results will still show up, but these search engines are no longer able to “crawl” Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward.

Robots.txt lets you ask specific user-agents not to index the site. My guess is that that’s how they restricted it. I don’t know how those changes are reflected in existing indexed pages – don’t know if there’s any standard there – but it’ll stop crawlers from downloading new pages.

Try searching for new posts, see how DDG/Bing compares to Google.

EDIT: Yeah. They’ve got a sitewide ban for all crawlers. That’d normally block Google’s bot too, but I bet that they have some offline agreement to have it ignore the thing, operate out-of-spec.

www.reddit.com/robots.txt

User-agent: *

Disallow: / 

Here’s a snapshot on archive.org’s Wayback Machine from April 30 of this year. Very different:

web.archive.org/web/20240430000731/…/robots.txt

squidspinachfootball@lemm.ee on 24 Jul 2024 18:26 next collapse

iirc, isn’t robots.txt more of a gentlemen’s agreement? I vaguely recall bots being able to crawl a site regardless, it’s just that most devs respect robots.txt and don’t. Could be wrong though, happy to be corrected.

tal@lemmy.today on 24 Jul 2024 18:33 collapse

Sure, you can write software that violates the spec. But I mean, that’d be true for anything that Reddit can do on their end. Even if they block responses to what they think are bots, software can always try hard to impersonate users and scrape websites. You could go through a VPN, pretend to be a browser being linked to a page.

But major search engines will follow the spec with their crawlers.

EDIT: RFC 9309, Robots Exclusion Protocol

datatracker.ietf.org/doc/html/rfc9309

If no matching group exists, crawlers MUST obey the group with a user-agent line with the “*” value, if present.

To evaluate if access to a URI is allowed, a crawler MUST match the paths in “allow” and “disallow” rules against the URI.

EDIT2: Even if, amusingly, Google apparently isn’t for this particular case with GoogleBot, given the way that they’re signing agreements. They’ll honor it for sites that they haven’t signed agreements with, though.

EDIT3: Actually, on second thought, GoogleBot may be honoring it too. GoogleBot may not be crawling Reddit anymore. They may have some “direct pipe” that passes comments to Google that bypasses Google’s scraper. Less load on both their systems, and lets Google get real-time index updates without having to hammer the hell out of Reddit’s backend to see if things have changed. Like, think of how Twitter’s search engine is especially useful because it has full-text search through comments and immediately updates the index when someone comments.

squidspinachfootball@lemm.ee on 24 Jul 2024 20:57 collapse

That’s a good point, it’s probably way less load and overhead if Reddit and Google just sent info back and forth instead of scraping. Good way for Google to keep their spot as the favoured search engine and beat the competition too, since everything that comes up these days are articles full of SEO nonsense at best, then AI generated nonsense at worst. If nobody else can read the actual human responses, Google has a huge leg up. Also interesting to see that Google’s honouring the txt file even when nobody’s holding them to it.

I had no idea Twitter’s search updated their index immediately after a comment is posted though. That’s a lot of updates considering the amount of posts they get daily.

tal@lemmy.today on 25 Jul 2024 01:32 collapse

I had no idea Twitter’s search updated their index immediately after a comment is posted though.

While I never had a Twitter account, it’s the major reason that I used the service anonymously. In an unfolding event, like a natural disaster or something, it was absolutely unparalleled in its ability to rapidly comb through enormous amounts of information being plonked in by people around the world. I strongly prefer Reddit-style forum structure most of the time, but for issues for which there is no pre-existing communities and where the common issue is one that will only exist for a short period of time, I think that Twitter’s ad-hoc connections between retweets and hashtags works much better than Reddit’s association-of-comments-by-subreddit. I understand that Mastodon, unfortunately, doesn’t have a full-text search feature, just searching based on exact hashtags. Actually…hmm. I was just talking about Kagi’s search lens for the Threadiverse in another comment that I saw. I wonder if Kagi actually indexes Mastodon as well? That’d provide for similar functionality.

investigates

No, it looks like they only do the Reddit-alike Threadiverse (lemmy, kbin, mbin, etc), for which they use the term “Fediverse Forums”.

investigates further

It does look like they index in real time, though, or at least quickly – they probably are one of the institutions out there with an instance slurping up everything out there. I was able to find your comment on that search lens.

That’s a lot of updates considering the amount of posts they get daily.

Yeah, I’m sure that however the Twitter guys built it, they specifically designed it around permitting inexpensive index updates.

eager_eagle@lemmy.world on 24 Jul 2024 18:43 collapse

User-Agent: bender
Disallow: /my_shiny_metal_ass
itslilith@lemmy.blahaj.zone on 25 Jul 2024 10:33 collapse

set the date filter to something recent, test site:reddit.com df:w (results from last week only) gives 0 hits

Brewchin@lemmy.world on 24 Jul 2024 18:37 next collapse

Parts of the Internet now only searchable on specific sites now? What next - charging a monthly subscription to use Google?

This needs to be regulated before the Internet becomes like streaming TV.

tal@lemmy.today on 24 Jul 2024 18:40 collapse

Robots.txt has been around for a long time, and all the major search engines will honor it. Not having a full index of the Web is the norm.

That isn’t to say that the practice of signing agreements isn’t potentially a concern. Not sure that I like the idea of search engines paying sites money to degrade search results of competitors.

[deleted] on 24 Jul 2024 20:38 next collapse

.

reddig33@lemmy.world on 25 Jul 2024 04:51 collapse

What isn’t the norm is to serve one robots.txt to one company, and a different robots.txt to everyone else. Which is what Reddit is doing here.

maxenmajs@lemmy.world on 24 Jul 2024 18:48 next collapse

Alright then. The 3rd party app drama already pushed me here. I really won’t go back for anything if I’m not allowed to search for Reddit anymore.

Reverendender@sh.itjust.works on 24 Jul 2024 20:40 next collapse

Is Google really permitted to prevent any other search engine from looking at Reddit?

Evotech@lemmy.world on 24 Jul 2024 20:51 collapse

I guess Reddit is permitted to only let Google index it

Reverendender@sh.itjust.works on 24 Jul 2024 20:53 next collapse

Are they though?

Evotech@lemmy.world on 24 Jul 2024 20:55 collapse

I don’t know of any law that says that they can’t.

reddig33@lemmy.world on 25 Jul 2024 04:52 collapse

I also don’t know if a law that says search engines have to honor a robots.txt file. I guess we will see what happens if Bing or some other service decides to ignore it.

Evotech@lemmy.world on 25 Jul 2024 04:54 collapse

You can just require a log in to view content, or just flat out auto ban indexing robots.

helenslunch@feddit.nl on 24 Jul 2024 22:35 collapse

How can they do that, logistically?

Like I realize there’s a flag they can raise that asks not to be indexed but that’s not legally binding.

Evotech@lemmy.world on 25 Jul 2024 04:33 collapse

I guess they can make it hard to index by scraping by rate limiting or requiring login to view content etc and only provide Google the api to bypass the restrictions

There’s probably a lot of ways to do it

BrianTheeBiscuiteer@lemmy.world on 24 Jul 2024 22:50 next collapse

Thank you Lemmy, for making it so much easier to walk away from that dumpster fire!

z3rOR0ne@lemmy.ml on 24 Jul 2024 23:01 next collapse

Well that’s annoying. One work around is to use a redirect extension like Libredirect and you can still search via the !reddit bang on DuckDuckGo. Thusly if I type into my search bar which has DuckDuckGo as default:

!reddit some new post or topic, it will search reddit for the search term, then when it attempts to load the reddit page, the libredirect extension will redirect and show the results.

Requires a bit of configuring and sure is annoying, but hey, no Google search necessary to get the up to date reddit threads.

upside431@lemmy.world on 25 Jul 2024 01:23 next collapse

This shouldn’t be allowed

wafflez@lemmy.world on 25 Jul 2024 06:25 collapse

Shouldn’t*? Or/s?

upside431@lemmy.world on 25 Jul 2024 12:02 collapse

Fixed

moe90@feddit.nl on 25 Jul 2024 04:48 next collapse

just begin with site:reddit.com test for brave search and it still works

itslilith@lemmy.blahaj.zone on 25 Jul 2024 10:30 collapse

did you set time limit to last week? old posts are still indexed. just tried “site:reddit.com df:w” on DDG and no hits

pipows@lemmy.today on 25 Jul 2024 20:28 collapse

I can confirm, it works just fine on brave search. I set the time limit to yesterday and it still gave me results

Petter1@lemm.ee on 25 Jul 2024 05:06 next collapse

This seems illegal to me 😮

hahattpro@lemmy.world on 27 Jul 2024 17:55 collapse

And Brave Search