What is going on behind the scenes when searching for a lemmy post in a search engine?
from Mexigore@lemmy.world to fediverse@lemmy.world on 24 Mar 09:38
https://lemmy.world/post/27325042

I was looking for some collection of posts earlier about Proton Mail and the whole controversy with the CEO, and I opened a post the lemmy instance that was suggested was lemmy.zip but the community and the poster were from lemmy.world so that made me ask myself a bunch of questions. Reference link

Note: I used duckduckgo

Here are some questions I have:

I remember learning about search engines a while back but I don’t know how relevant that information is any more. Having crawlers and the more a website is linked in other websites the higher up in the search result will be and the whole robot.txt thing.

I know if I wanted to search for something specific in lemmy I could just use its own search function, but what about people who ask general questions and that happens to be answered in a lemmy post. I wanted to know how exposed we are/ will be to people who don’t yet know about lemmy.

#fediverse

threaded - newest

gigachad@sh.itjust.works on 24 Mar 10:32 next collapse

Instances that disagree with being found in search engines are not shown. Instance admins can configure their robots.txt by adding lemmy-search. All other instances can theoretically be found. I think their priority depends on the laws of SEO (Search Engine Optimization). This probably means that a post on myownlemmy1337 that is federated with lemmy.world, will be found as a post on lemmy.world.

So, if Lemmy was very famouse, I guess it’s possible to get pages over pages with the same result from different instances. However search engines usually have a way to exclude “similar” results.

For voyager it may be possible, they do not want to be found, I don’t know about this though. You could add site:vger.app to your search prompt for testing this.

catloaf@lemm.ee on 24 Mar 11:03 next collapse

Search engines don’t treat Lemmy specially. They index the pages just like any other site. If it’s discoverable through the crawling process, it’ll be indexed.

nucleative@lemmy.world on 24 Mar 13:31 next collapse

I think most search engines are not optimized for this. I’m sure it’s changing but might take some time.

Google historically penalizes duplicate content and selects one source as canonical, usually whichever domain is the most authoritative. When it comes to lemmy, whichever instance hosts the community should probably be the canonical source.

rimu@piefed.social on 24 Mar 16:34 collapse

Every post has a <link rel="canonical" href="https://lemmy.instance/whatever"> tag on it which links to the version of the post on the author's instance.

nucleative@lemmy.world on 25 Mar 00:20 collapse

TIL, thanks for the insight. This is as it should be and Google can deal with it no problem.

muntedcrocodile@lemm.ee on 24 Mar 16:28 collapse

Federation is a weird one with search engines. Each instance is indexed by search engines directly (if the admins allow it in robots.txt) and the web crawler will then index that Lemmy instance. It used to just be like this and thus a scraper would come across the same content in multiple instances and also find a bunch of back links to said other instances. The search engine would then classify the entirety of the fediverse as an seo hack and ignore it. This issue has since been resolved so now posts include a special HTML tag that tells the web crawler where the original content came from (I assume the instance which manages the content so the communities instance).

What this means is that each instance is individually competing in the search results. When I crawler discovers lemmy.world content through accessing the lemmy.zip instance it knows the content came from lemmy.world and thus rates the lemmy.zip content and by extension the post u where looking for as though it was a lemmy.world page. (I assume all the search providers don’t say how their algorithms work).

Its a shame how this works as it means that each instance has to outperform the competition individually instead of being able to work as a collective. Ideally the fediverse would have a single domain that search engines can be told is the content origin and thus the fediverse would be able to compete as a collective.

JustAnotherKay@lemmy.world on 25 Mar 03:39 collapse

Ideally the Fediverse would have a single domain…

Educate me a bit. Would it be possible to create a unique low level domain like www for the Fediverse? Imagine fed.service.instance/data as the structure perhaps?

muntedcrocodile@lemm.ee on 25 Mar 05:43 collapse

It would have to be a common domain. So the content origin would have to be something like instance.fediverse.com the common domain needs to be the base level domain.

JustAnotherKay@lemmy.world on 25 Mar 06:15 collapse

Would you happen to be able to tell me why using the lower domain doesn’t work? Also am I using the right terminology?

muntedcrocodile@lemm.ee on 25 Mar 07:17 collapse

So domains are recursive tree structures. So u have node com which has a bunch of nodes below it of different domains which can have sub domains etc. Its like this cos that’s how DNS was designed.

Search engines gives scores to domains and pages. When u say that the content origin of ur page is some other domain the search engines will use that knowledge to adjust rating accordingly.

Say u have 2 domains site1.example.com and site2.example.com the search engine will have a rank for site1, site2, and example.com where the rating of both site1 and site2 effect the rating for example.com.

If the content origin for all federated content has the same origin say instance.fediverse.com then the rating for all federated content will be classified as part of fediverse.com and all content will be working together across all instances boosting the fediverse as a single entity.

JustAnotherKay@lemmy.world on 25 Mar 21:36 collapse

Thank you for teaching me a bit! Ya know, I bet my computer science classes taught me this at some point. I should pay more attention to my readings lol