A young computer scientist and two colleagues show that searches within data structures called hash tables can be much faster than previously deemed possible. (www.quantamagazine.org)
from Cat@ponder.cat to technology@lemmy.world on 11 Feb 2025 07:30
https://ponder.cat/post/1594369

#technology

threaded - newest

tyler@programming.dev on 11 Feb 2025 08:39 next collapse

This is incredible, but why does the article end by stating that this might not have any immediate applications? Shouldn’t this immediately result in more efficient hash tables in everyday programming languages?

source_of_truth@lemmy.world on 11 Feb 2025 08:48 next collapse

I don’t think the speed of hash tables is a problem in most applications except benchmarks.

lime@feddit.nu on 11 Feb 2025 09:13 next collapse

anything that deserializes arbitrary json will put it into a hash table, right? it would definitely speed up the web.

barsoap@lemm.ee on 11 Feb 2025 10:35 next collapse

Using bencode over json would probably speed up the web more. Not to mention good ole ASN.1 (well, at least some binary schemes for ASN.1). The web is completely cooked when it comes to efficiency.

lime@feddit.nu on 11 Feb 2025 10:40 collapse

the biggest speedup would probably come from using proper schemas that can be efficiently parsed. but we’ve made our bed out of ad-hoc protocols.

barsoap@lemm.ee on 11 Feb 2025 11:01 next collapse

And yet all that pales in comparison to using react (or whatever framework) over vanilla js. Enter McMaster-Carr.

lime@feddit.nu on 11 Feb 2025 11:02 collapse

yupyup, just send HTML over the wire. it’s fine.

frezik@midwest.social on 11 Feb 2025 11:38 collapse

JSON libraries are stupidly well optimized. There are binary encoding schemes that are faster and more compact, but its hard to beat JSON for text-based.

0x0@lemmy.dbzer0.com on 11 Feb 2025 13:29 collapse
frezik@midwest.social on 11 Feb 2025 11:23 collapse

Depends on the implementation, but most will, yes. There are other forms of associative arrays, like trie or binary tree, but hash is the most common.

zkfcfbzr@lemmy.world on 11 Feb 2025 09:37 collapse

Hash tables are often used behind the scenes. dicts and sets in python both utilize hash tables internally, for example.

source_of_truth@lemmy.world on 11 Feb 2025 10:56 collapse

I’ve only used java but java hash tables are stupid fast in my experience, like everything else in my crap programs was 1000 times slower than the hash table access or storage.

Just reading the title, it’s talking about searching hash tables, which wasn’t something I was specifically doing.

0x0@lemmy.dbzer0.com on 11 Feb 2025 13:20 next collapse

Sorry to be blunt, but you don’t know what you’re talking about.

source_of_truth@lemmy.world on 11 Feb 2025 21:00 collapse

How much do you think this speeds up an average program? Enlighten us with your knowledge.

0x0@lemmy.dbzer0.com on 11 Feb 2025 21:36 collapse

I wrote my comment not to antagonize you but to point out that you’re asking the wrong questions. I failed to articulate that, and I’m sorry for being harsh.

Your prior comment indicated that you have used hash tables in Java, which were very fast. You said that your program accessed the hash tables, but did not “search” the table. These operations are the same thing, which led me to believe you’re out of your depth.

This last comment asks me how much this paper’s contribution speeds up an average program. You’re asking the wrong question, and you seem to be implying the work was useless if it doesn’t have an immediate practical impact. This is a theoretical breakthrough far over my head. I scanned the paper, but I’m unsurprised they haven’t quantified the real-world impact yet. It’s entirely possible that despite finding an asymptotic improvement, the constant factors (elided by the big O analysis) are so large as to be impractical… or maybe not! I think we need to stay tuned.

Again, sorry for being blunt. We all have to start somewhere. My advice is to be mindful of where the edge of your expertise lies and try to err on the side of not devaluing others’ work.

deegeese@sopuli.xyz on 11 Feb 2025 16:49 collapse

If you use a hash table, you search every time you retrieve an object.

If you didn’t retrieve, why would you be storing the data in the first place?

source_of_truth@lemmy.world on 11 Feb 2025 20:58 collapse

I know that, but phrased that way it sort of sounds like they’re iterating over the entire thing.

[deleted] on 11 Feb 2025 09:26 next collapse

.

rockSlayer@lemmy.world on 11 Feb 2025 10:05 next collapse

Infrastructural APIs are much slower to change, and in a lot of cases the use of those APIs are dependent on a specific version. The change will definitely occur over time as the practical limitations are discovered

barsoap@lemm.ee on 11 Feb 2025 10:06 next collapse

After reading through the abstract the article is pop sci bunk: They developed a method to save additional space with constant-time overhead.

Which is certainly novel and nice and all kinds of things but it’s just a tool in the toolbox, making things more optimal in theory says little about things being faster in practice because the theoretical cost models never match what real-world machines are actually doing. In algorithm classes we learn to analyse sorting algorithms by number of comparisons, and indeed the minimum necessary is O(n log n), in the real world, it’s numbers of cache invalidation that matters: CPUs can compare numbers basically instantly, getting the stuff you want to compare from memory to the CPU is where time is spent. It can very well be faster to make more comparisons if it means you get fewer, or more regular (so that the CPU can predict and pre-fetch), data transfers.

Consulting my crystal ball, I see this trickling down into at least the minds of people who develop the usual KV stores, database engineers, etc, maybe it’ll help maybe it won’t those things are already incredibly optimized. Never trust a data structure optimisation you didn’t benchmark. Never trust any optimisation you didn’t benchmark, actually. Do your benchmarks, you’re not smarter than reality. In case it does help, it’s going to trickle down into standard implementations of data structures languages ship with.

EDIT: I was looking an this paper, not this. It’s actually disproving a conjecture of Yao, who has a Turing prize, certainly a nice feather to have in your cap. It’s also way more into the theoretical weeds than I’m comfortable with. This may have applications, or this may go along the lines of the Karatsuba algorithm: Faster only if your data is astronomically large, for (most) real-world applications the constant overhead out-weighs the asymptotic speedup.

taladar@sh.itjust.works on 11 Feb 2025 10:19 next collapse

Also never even start optimizing until you profile and are sure the bit you are trying to optimize even matters to the overall performance of your program.

tyler@programming.dev on 11 Feb 2025 21:58 collapse

the reason it confused me is because the college student was clearly using the algorithm to accomplish his task, not just theoretically designed. So it didn’t seem to be a small improvement that would only be noticeable in certain situations.

I’m not smart enough to understand the papers so that’s why I asked.

barsoap@lemm.ee on 11 Feb 2025 22:17 collapse

Oh no it’s definitely a theoretical paper. Even if the theory is fully formalised and thus executable it still wouldn’t give much insight on how it’d perform in the real world because theorem provers aren’t the most performant programming languages.

And, FWIW, CS theorists don’t really care about running programs same as theoretical physicists don’t care much about banging rocks together, in both cases making things work in the real world is up to engineers.

tyler@programming.dev on 12 Feb 19:40 next collapse

you’ve misunderstood what I’ve said, but whatever.

TechLich@lemmy.world on 13 Feb 03:52 collapse

It’s really not. Just because they describe their algorithm in computer science terms in the paper, doesn’t mean it’s theoretical. Their elastic and funnel examples are very clear and pretty simple and can be implemented in any language you like…

Here’s a simple python example implementation I found in 2 seconds of searching: github.com/sternma/optopenhash/

Here’s a rust crate version of the elastic hash: github.com/cowang4/elastic_hash_rs

It’s not a lot of code to make a hash table, it’s a common first year computer science topic.

What’s interesting about this isn’t that it’s a complex theoretical thing, it’s that it’s a simple undergrad topic that everybody thought was optimised to a point where it couldn’t be improved.

barsoap@lemm.ee on 13 Feb 09:30 collapse

When you have a paper that’s pretty much a succession of “Lemma:” “Proof:” “Theorem:” and “Proof:” and no benchmark chart then yes it’s a theoretical one.

OhNoMoreLemmy@lemmy.ml on 11 Feb 2025 11:04 next collapse

Hash tables are super efficient when they’re not nearly full. So the standard trick is just to resize them when they’re too close to capacity.

The new approach is probably only going to be useful in highly memory constrained applications, where resizing isn’t an option.

Sekoia@lemmy.blahaj.zone on 11 Feb 2025 13:11 next collapse

So… databases? Especially in data centers? Still a nice boost in that case

deegeese@sopuli.xyz on 11 Feb 2025 16:47 collapse

Hash tables are used in literally everything and they always need to minimize resizing because it’s a very expensive operation.

I suspect this will silently trickle into lots of things once it gets picked up by standard Python and JavaScript platforms, but that will take years.

lemmyng@lemmy.ca on 11 Feb 2025 09:52 collapse

I haven’t read the Tiny Pointers article yet, but the OP article implies that the new hash tables may rely on them. If so, then the blocker could be the introduction (or lack thereof) of tiny pointers in programming languages.

tyler@programming.dev on 11 Feb 2025 22:06 collapse

Tiny Pointers was the paper that the student read to get the idea. The paper he co-authored was “Optimal Bounds for Open Addressing Without Reordering”

BeigeAgenda@lemmy.ca on 11 Feb 2025 09:02 next collapse

Here’s a link to the paper “Tiny Pointers” arxiv.org/pdf/2111.12800 those editors at Quantamagazine writes their summary in a strange fashion, for instance using x in stead of n which is normally used in computer science when talking about big O notation.

Lojcs@lemm.ee on 11 Feb 2025 10:11 next collapse

This is the paper the article is about: arxiv.org/pdf/2501.02305

deegeese@sopuli.xyz on 11 Feb 2025 16:53 collapse

In the article, x is not the size of the hash table, it is the inverse of the table’s filling fraction. A 1000-element table that is 90% full has x=10, N=1000.

Since they’re not discussing scaling of data sizes, would be confusing to use O(N) notation or people would make that assumption.

Mubelotix@jlai.lu on 11 Feb 2025 09:24 next collapse

Very sus. This article is not technically counvincing at all. Probably a few students who dream of inventing the next big thing. Hash tables complexity was mathematically proven to be generally optimal. If you think you optimized it it’s because your tests don’t reflect the common real world

Capsicones@lemmy.blahaj.zone on 11 Feb 2025 11:00 collapse

The paper was published by IEEE and with professors as co-authors. Only the second author is a student. And I wouldn’t dismiss it out of hand like that because of a magazine article. Students come up with breakthroughs all the time. The paper itself says it disproves Yao’s conjecture. I personally plan to implement and benchmark this because the results seem so good. It could be another fibonacci heap situation, but maybe not. Hash tables are so widely used, that it might even be worthwhile to make special hardware to use this on servers, if our current computer architecture is only thing that holds back the performance.

Edit: author sequence

Valmond@lemmy.world on 11 Feb 2025 10:38 next collapse

The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves.

WTF?

I mean first of all, the “article” is just crap IMO, like only hinting att “anazing” things. But also containing basic errors like the above. It’s like they don’t understand complexity, a constant average time on what? A specific size of a hash table? Or do they mean it scales linearly with its size? Just go through the whole table and voila you have constant time already.

So if they did find something, it’s not well explained in the article IMO. Shame, cause those kind of things are interesting, IMO.

Edit: looks like they “cheat” (in a good way) to “reorder” the table to get their results with smarter inserts (their ‘elastic hashing’) so to not reorder the table. If so, then it’s very smart but not groundbreaking. I’m sick and just skimmed the paper, don’t kill me.

blx@lemmy.zip on 11 Feb 2025 10:48 next collapse

“Constant average query time” is not that hard to understand. It means that sometimes access time is e.g. linear, and sometimes you get your content before executing the code. With a hash table large enough and full enough, this can be used to fetch content seconds, minutes, days, potentially years before the program even exists. That’s one hell of a breakthrough.

[edit] /s, oops

ripcord@lemmy.world on 11 Feb 2025 12:05 collapse

…before the program even exists…?

RandomVideos@programming.dev on 11 Feb 2025 17:43 collapse

They managed to get the time to O(1)

deegeese@sopuli.xyz on 11 Feb 2025 18:39 collapse

The article is discussing how to reduce the constant time factor which depends on the filling fraction, which is a speed-memory tradeoff when creating the hash table.

The innovation described allows for the use of fuller tables which are resized less frequently, or faster insertion/retrieval for the existing filling fraction.

meowmeowbeanz@sh.itjust.works on 11 Feb 2025 22:24 next collapse

Hash tables. The backbone of computing, optimized to death by generations of neckbeards convinced they’d squeezed out every drop of efficiency. Then some undergrad casually strolls in, ignores four decades of academic dogma, and yeets Yao’s conjecture into the sun. Turns out you can make insertion times collapse from (O(x)) to (O((\log x)^2))—if you’re naive enough to not know the “rules.”

The real kicker? Non-greedy tables now achieve constant average query times, rendering decades of “optimal” proofs obsolete. Academia’s response? A mix of awe and quiet despair. This is why innovation thrives outside the echo chamber of tenured gatekeepers regurgitating theorems like stale propaganda.

But let’s not pretend this changes anything practical tomorrow. It’s a beautiful math flex—a reminder that theoretical CS isn’t solved, just trapped in peer-reviewed groupthink. Forty years to disprove a conjecture. How many more sacred cows are grazing untouched?

Presently42@lemmy.ca on 12 Feb 23:16 next collapse

This is genuinely a beautifully written comment. I’d expect an author with mathematical or physical background to have written it, like Asimov or Pynchon. Bravo!

steeznson@lemmy.world on 12 Feb 23:28 next collapse

Not weird enough to be Pynchon but it was a good comment nonetheless

Presently42@lemmy.ca on 12 Feb 23:32 collapse

The comment does randomly mention cows - I’ll allow it!

meowmeowbeanz@sh.itjust.works on 12 Feb 23:39 collapse

Thanks for the compliment! For context, I do have an academic background, though no degree. My knowledge in computer science is self-taught, but I’ve built a solid foundation in physics, math (though it’s always humbling), philosophy, and literature. It’s less about formal credentials and more about chasing intellectual rabbit holes.

Maybe that’s why I’m so allergic to gatekeeping nonsense. Academia’s obsession with rigid frameworks feels like a straitjacket for creativity. The beauty of CS—and science as a whole—is that it thrives on breaking rules, not worshipping them.

As for Pynchon: he’s a postmodern literary juggernaut. His works are dense, chaotic, and packed with esoteric references—math, history, conspiracy theories. Comparing my comment to his writing? That’s high praise for anyone who thrives in the chaos of ideas.

Anyway, the real credit goes to those audacious enough to challenge orthodoxy. They’re the ones who remind us that progress isn’t born from conformity but from questioning everything we think we know.

UndercoverUlrikHD@programming.dev on 13 Feb 03:08 collapse

A lot of famous scientists make their breakthroughs at fairly young age, before their mind gets locked into a certain paradigm. Look up Thomas Kuhn’s The Structure of Scientific Revolutions, which forwards some interesting ideas on how science is conducted.

meowmeowbeanz@sh.itjust.works on 13 Feb 04:59 collapse

The irony of citing Kuhn here isn’t lost on me. His Structure of Scientific Revolutions is practically a manual for how entrenched paradigms suffocate innovation. The young, unjaded minds he describes are precisely the ones who can dismantle decades of “consensus” with a single insight. But let’s not romanticize this—most breakthroughs don’t come from genius alone but from ignorance of the so-called rules.

That said, the real tragedy is how academia weaponizes peer review to enforce conformity. Paradigm shifts like these aren’t celebrated; they’re tolerated begrudgingly, often posthumously. Yao’s conjecture stood for 40 years not because it was unassailable but because questioning it was career suicide. Imagine how many more revolutions we’d see if the system didn’t punish dissent.

Harbinger01173430@lemmy.world on 13 Feb 00:04 next collapse

I don’t care about the techno able those group of C’s students did.

When will I be able to enjoy faster dicts in my python code?

gandalf_der_12te@discuss.tchncs.de on 13 Feb 01:50 collapse

Why are these articles even published? They are so incredibly non-technical that it’s just impossible to get the point as a reader.

Why even link these articles here? Throw that pop-science in the trash bin, where it belongs.