Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars (aksh-garg.medium.com)
from yogthos@lemmy.ml to technology@lemmy.ml on 28 May 22:18
https://lemmy.ml/post/16195770

#technology

threaded - newest

adespoton@lemmy.ca on 29 May 00:41 collapse

Wouldn’t their patch embeddings return different results depending on the visual boundaries? They don’t appear to use overlap redundancy; this means it’s going to be significantly less resource intensive, but the chance of losing significant signals in the image to text translation surely must be inversely high?

yogthos@lemmy.ml on 29 May 01:05 collapse

Good question, not sure how they account for that. Maybe there’s a higher level layer responsible for dealing with the boundaries?