This time once archive.org is back online again… is it possible to get torrents of some of their popular data storage? For example I wouldn’t imagine their catalog of books with expired copyright to be very big. Would love a community way to keep the data alive if something even worse happens in the future (and their track record isn’t looking good now)
njordomir@lemmy.world
on 20 Oct 19:50
nextcollapse
Yep, that seems like the ideal decentralized solution. If all the info can be distributed via torrent, anyone with spare disk space can help back up the data and anyone with spare bandwidth can help serve it.
Most of us can’t afford the sort of disk capacity they use, but it would be really cool if there were a project to give volunteers pieces of the archive so that information was spread out. Then volunteers could specify if they want to contribute a few gigabytes to multiple terabytes of drive space towards the project and the software could send out packets any time the content changes. Hmm this description sounds familiar but I can’t think of what else might be doing something similar – anyone know of anything like that that could be applied to the archive?
njordomir@lemmy.world
on 21 Oct 02:57
nextcollapse
Yeah, the projects I’ve heard about that have done something like this broke it into multiples.
For example, 1000GB could be broken into forty 25GB torrents and within that, you can tell the client to only download some of the files.
At scale, a webpage can show the seed/leach numbers and averages foe each torrent over a time period to give an idea of what is well mirrored and what people can shore up. You could also change which torrent is shown as the top download when people go to the contributor page and say they want to help host it ensuring a better distribution.
Since I’m spamming with this same idea right now - the description is similar to Freenet (the old one, the Hyphanet), but you’d need some kind of ability to choose parts of which collections of data get stored in your contributed storage, while with Freenet it’s all the network (unless you form a separated F2F net, there is such an option, but no way to be sure that all peers, ahem, store only IA data and not their own porn collections, for example, taking precious storage). I’ve described one idea in my previous comment, but it’s purely an idea, I’m nowhere close to having the knowledge to make such.
There’s an issue with torrents, only the most popular ones get replicated and the process is manual\social.
Something like Freenet is needed, which automatically “spreads” data over machines contributing storage, but Freenet is an unreliable storage, basically like a cache where older and unwanted stuff gets erased.
So it should be something like Freenet, but possibly with some “clusters” or “communities” with a central (cryptography-enabled) authority of each being able to determine the state of some collection of data as a whole, and pick priorities. My layman’s understanding is that this would be similar to something between Freenet and Ceph, LOL. More like a cluster filesystem spread over many nodes, not like cache.
You have more knowledge on this than I did. I enjoyed reading about Freenet and Ceph. I have dealt with cloud stuff, but not as much on a technical-underpinnings level. My first freenet impression from reading some articles gives me 90s internet vibes based on the common use cases they listed.
I remember ceph because I ended up building it from the AUR once on my weak little personal laptop because it got dropped from some repository or whatever but was still flagged to stay installed. I could have saved myself an hours long build if I had read the release notes.
I’m pretty sure all their content is available by torrent, so you could mirror the data and provide the torrent files for direct download. It’ll probably be here when it’s back up: archive.org/details/public-domain-archive
SilentStorms@lemmy.dbzer0.com
on 21 Oct 04:46
collapse
Anna’s Archive does this. I think its a really good way to make it difficult to take them down.
Hopefully this hack starts some conversations on how they can ensure longevity for their project. Seems they’re being attacked on multiple fronts now.
Discrediting someone usually has a goal of pushing customers to another source though. There is no other source of this information, so what would be the point?
OpenStars@discuss.online
on 20 Oct 17:13
nextcollapse
Generating turmoil just prior to the USA election maybe?
qfe0@lemmy.dbzer0.com
on 20 Oct 17:23
nextcollapse
Destroy a source of historical documents so that the past can be contested. Sow doubt, confusion, deniability. Hide evidence of past crimes, or inconvenient documents. Plant documents, etc.
Lol, we should create a society of sorts along the lines of the original Bavarian Illuminati. Create a decentralized storage network and archive of knowledge and history. Create a list of important shit that needs to be archived, and delegate standardized chunks (let’s say 5 or 10gb each chunk) of data that are to be downloaded by people. Anytime 5 or 10 people have downloaded a chunk, strike it off the list of priority archival and move onto the next chunk. For this to work, needs alot of people though.
Okay, enough is enough. The Internet Archive is both essential infrastructure and irreplaceable historical record; it cannot be allowed to fall. Rather than just hoping the Archive can defend itself, I say It’s time to hunt down and counterattack the scum perpetrating this!
SynopsisTantilize@lemm.ee
on 20 Oct 20:22
nextcollapse
Lol you’re gonna pull that thread and at the end of the sweater is gonna be the CIA or Russia.
Israel more likely. Making an attack completely useless for Palestine and calling yourself a pro-Palestine group - would be exactly their kind of braindead, but capable.
SynopsisTantilize@lemm.ee
on 21 Oct 10:31
collapse
Where are the anonymous group and 4chan autists? They should attack these assholes. Attacking internet archive is like kicking a kitten. Everyone will hate you for it.
DudeImMacGyver@sh.itjust.works
on 20 Oct 20:46
nextcollapse
Why are people fucking with the Internet Archive? Who benefits?
It’s funny how these people feel like cockroaches.
LunchMoneyThief@links.hackliberty.org
on 21 Oct 16:20
collapse
I’ve enjoyed using Wayback Machine on journalistic articles where they try to retcon information, but the original copy had already been captured. The Ministry of Truth hates archive.org.
Draconic_NEO@lemmy.world
on 21 Oct 04:02
nextcollapse
Well right wingers want to ban books and services like IA make that harder since they provide easy access to download or digitally borrow those books. It makes it harder for them to deny people access to those books since they can find them online. Of course, there are other ways people can still obtain those books, IA isn’t the only one, but it’s the easiest and the most convent.
I’ll give you my opinion though you haven’t asked for it:
Some right wingers (libertarian mostly) don’t want to ban books, they want books in fact to be reliably available, and having one centralized Internet Archive to store all of them is not reliable.
(Or in the same logic for humanity to be knowledgeable and resistant to propaganda, and treating sources’ availability as a given being harmful towards that goal - naive people can believe wrong things.)
See Babylon V example with kicking the ant hive again and again to some well-meaning goal, of the evolution kind.
Mind that I don’t think these people have such an intent.
It’s just in my childhood someone has gaslighted me into trying to be optimistic in such cases. Like “if someone is digging a grave for you, just wait till they’re done, you’ll get a nice pond”. Same as a precedent that is created with one intent and interpretation, but works for all possible intents and interpretations, because it’s a real world event.
So, other than gaslighting, real effects are real. Including positive ones, like all of us right now realizing that a centralized IA is unacceptable, we need something like “IA@home”, with a degree of forkability without duplicating the data, so that someone who’d somehow hijack the private key or whatever identifying said new IA’s authority wouldn’t be able to harm existing versions and they wouldn’t require much more storage.
Shit, I can’t stop thinking about that “common network and identities and metadata exchange, but data storage shared per communities one joins, Freenet-like” idea, but I don’t even remotely know where to start developing it and doubt I’ll ever.
4 years ago (best number I can find, considering IAs blog pages are down) IA used about 50 petabytes on servers that have 250 terabytes of storage and 2gbps network.
From this, we can conclude that 1 TB of storage requires 8mbps of network speed.
Let’s just say that average/all residential broadband has spare bandwidth for 8mbps symmetrical.
We would need 50,000 volunteers to cover the absolute minimum.
Probably 100k to 200k to have any sort of reliability, considering it’s all residential networking and commodity hardware.
In the last 4 years, I imagine IA has increased their storage requirements significantly.
And all of that would need to be coordinated, so some shards don’t get over-replicated
This seems to confirm my critique of “manual” solutions with torrents and such offered in other comments, resulting in the idea shortly described in the comment you were answering.
Yes, this would require a lot of people, but some would contribute more and some less, just like with other public P2P solutions.
From my POV the biggest problem is synchronizing indexes (similar to superblock maybe) of such a storage, and balancing replication based on them, in a decentralized way. Because it would seem that those indexes by themselves would be not small.
There should also be all the usual stuff with controlling data integrity.
I think it’s realistic to attract many volunteers, if the thing in question will also be the user client, similar to Freenet and torrents socially, and bigger storage will allow them to faster get things they access more often, as a cache. But then balancing between that and storing necessary, but unpopular parts of the space, is a question.
There are really good, incentivized versions of decentralized storage networks. Unfortunately discussions about them are stigmatized under the “crypto” umbrella so the mere mention typically gets you buried.
If you have an open mind, check them out!
interdimensionalmeme@lemmy.ml
on 22 Oct 03:54
collapse
Copyright holders compete with old content clogging up the works. They wish the library would burn.
Apparently, BlackMeta is behind the DDoS attack to the Internet Archive. Apparently they are pro-Palestine hacktivists - their X account also has some russian written in it.
(Edit) Also, Internet Archive is banned on China since 2012 and Russia since 2015.
Traister101@lemmy.today
on 21 Oct 02:32
nextcollapse
Yes they are a “pro-Palestine” Russian based hacker group… Nothing funny going on here no sir
Reading that whole page, holy shit, it’s like a twelve year old wrote it trying to sound very smart while also attempting to divert blame and falsify agenda. If this ain’t a Russian psyop, nothing is.
GhostFaceSkrilla@lemmy.world
on 21 Oct 12:42
collapse
Definitely not their genocidal neighbors terrorizing as usual. /s
Knowing the folks at IA I’m sure they would love a backup. They would love a community. I’m sure they don’t want to be the only ones doing this. But dang, they’ve got like 99 Petabytes of data. I don’t know about you, but my NAS doesn’t have that laying around…
That is an insane amount of storage. How much does it grow every year and is it stable growth or accelerating?
el_abuelo@programming.dev
on 21 Oct 12:51
collapse
I wonder if someone can come up with some kind of distributed storage that isn’t insanely slow. Kinda like a CDN but on personal devices. I’m thinking like SETI@HOME did with distributed compute.
Edit: this is kinda like torrents but where the contents are changing frequently.
You should look up IPFS! It’s trying to be kinda like that.
It’ll always be slower than a CDN, though, partly because CDNs pay big money to be that fast, but also anything p2p is always going to have some overhead while the swarm tries to find something. It’s just a more complicated problem that necessarily has more layers.
But that doesn’t mean it’s not possible for it to be “fast enough”
el_abuelo@programming.dev
on 21 Oct 15:42
nextcollapse
Interesting, thanks
sugar_in_your_tea@sh.itjust.works
on 21 Oct 16:07
collapse
And there’s a promising new IPFS-like system called Iroh, which should have a lot less overhead and in general just be faster than IPFS. It’s not quite ready to just switch to right now, but an enterprising individual could probably make something useful with it without too much work (i.e. months, not years).
I’m using it for a distributed application project right now, but the intent is a bit different than the IA use-case.
Since it’s Reddit, I would guess copyright sockpuppets are steering the narrative to help damage them further.
nickwitha_k@lemmy.sdf.org
on 21 Oct 17:47
collapse
Quick question for those more in the know: Have these events disrupted IA’s ability to archive pages? I ask because I was recently talking with a security guy about a novel malware that used a hacked webpage for command injection. One possible motive that came to mind, if the archiving was disrupted would be to cover tracks for a similar malware. Inject code, perform malicious activity, revert, then, there’s more time before the control code is discovered.
threaded - newest
Not this crap again
Wtf
This again??
This time once archive.org is back online again… is it possible to get torrents of some of their popular data storage? For example I wouldn’t imagine their catalog of books with expired copyright to be very big. Would love a community way to keep the data alive if something even worse happens in the future (and their track record isn’t looking good now)
Like this idea
Yep, that seems like the ideal decentralized solution. If all the info can be distributed via torrent, anyone with spare disk space can help back up the data and anyone with spare bandwidth can help serve it.
Most of us can’t afford the sort of disk capacity they use, but it would be really cool if there were a project to give volunteers pieces of the archive so that information was spread out. Then volunteers could specify if they want to contribute a few gigabytes to multiple terabytes of drive space towards the project and the software could send out packets any time the content changes. Hmm this description sounds familiar but I can’t think of what else might be doing something similar – anyone know of anything like that that could be applied to the archive?
Yeah, the projects I’ve heard about that have done something like this broke it into multiples.
For example, 1000GB could be broken into forty 25GB torrents and within that, you can tell the client to only download some of the files.
At scale, a webpage can show the seed/leach numbers and averages foe each torrent over a time period to give an idea of what is well mirrored and what people can shore up. You could also change which torrent is shown as the top download when people go to the contributor page and say they want to help host it ensuring a better distribution.
Since I’m spamming with this same idea right now - the description is similar to Freenet (the old one, the Hyphanet), but you’d need some kind of ability to choose parts of which collections of data get stored in your contributed storage, while with Freenet it’s all the network (unless you form a separated F2F net, there is such an option, but no way to be sure that all peers, ahem, store only IA data and not their own porn collections, for example, taking precious storage). I’ve described one idea in my previous comment, but it’s purely an idea, I’m nowhere close to having the knowledge to make such.
There’s an issue with torrents, only the most popular ones get replicated and the process is manual\social.
Something like Freenet is needed, which automatically “spreads” data over machines contributing storage, but Freenet is an unreliable storage, basically like a cache where older and unwanted stuff gets erased.
So it should be something like Freenet, but possibly with some “clusters” or “communities” with a central (cryptography-enabled) authority of each being able to determine the state of some collection of data as a whole, and pick priorities. My layman’s understanding is that this would be similar to something between Freenet and Ceph, LOL. More like a cluster filesystem spread over many nodes, not like cache.
You have more knowledge on this than I did. I enjoyed reading about Freenet and Ceph. I have dealt with cloud stuff, but not as much on a technical-underpinnings level. My first freenet impression from reading some articles gives me 90s internet vibes based on the common use cases they listed.
I remember ceph because I ended up building it from the AUR once on my weak little personal laptop because it got dropped from some repository or whatever but was still flagged to stay installed. I could have saved myself an hours long build if I had read the release notes.
That’s correct, I meant the way it works.
I’m pretty sure all their content is available by torrent, so you could mirror the data and provide the torrent files for direct download. It’ll probably be here when it’s back up: archive.org/details/public-domain-archive
Anna’s Archive does this. I think its a really good way to make it difficult to take them down.
Hopefully this hack starts some conversations on how they can ensure longevity for their project. Seems they’re being attacked on multiple fronts now.
I guess this is an attempt to discredit them.
After working at many, many companies, security is usually very bad. This is typical. Not changing access tokens is also very common.
Discrediting someone usually has a goal of pushing customers to another source though. There is no other source of this information, so what would be the point?
Generating turmoil just prior to the USA election maybe?
Destroy a source of historical documents so that the past can be contested. Sow doubt, confusion, deniability. Hide evidence of past crimes, or inconvenient documents. Plant documents, etc.
Now we are talking.
I really hate that reddit slang but username checks out
He who controls the past controls the future, he who controls the present controls the past.
Russians banned it, russian hackers trying to destroy it, at least it’s consistent
Lol, we should create a society of sorts along the lines of the original Bavarian Illuminati. Create a decentralized storage network and archive of knowledge and history. Create a list of important shit that needs to be archived, and delegate standardized chunks (let’s say 5 or 10gb each chunk) of data that are to be downloaded by people. Anytime 5 or 10 people have downloaded a chunk, strike it off the list of priority archival and move onto the next chunk. For this to work, needs alot of people though.
Sow doubt. As in spreading it like seeds to take root and grow. 100% in agreement with you, just being a grammar Nazi. Carry on.
Word
War of attrition is my guess
Okay, enough is enough. The Internet Archive is both essential infrastructure and irreplaceable historical record; it cannot be allowed to fall. Rather than just hoping the Archive can defend itself, I say It’s time to hunt down and counterattack the scum perpetrating this!
Lol you’re gonna pull that thread and at the end of the sweater is gonna be the CIA or Russia.
Edit: in = is
Did I stutter?
Israel more likely. Making an attack completely useless for Palestine and calling yourself a pro-Palestine group - would be exactly their kind of braindead, but capable.
Mossad…CIA…same dragon different head.
Where are the anonymous group and 4chan autists? They should attack these assholes. Attacking internet archive is like kicking a kitten. Everyone will hate you for it.
Why are people fucking with the Internet Archive? Who benefits?
People use Archive links to avoid giving sites traffic.
This is a problem for advertisers and media corps.
Not saying they’re the ones doing this, but they’d definitely benefit.
Wouldn’t put it past them…
Someone else looked to the group claiming responsibility for this. It’s a pro-Palestinian Russian group
Why is this a problem, how would it affect real availability of ads? Except maybe tracking users.
Without tracking they don’t have metrics for their ads, which effects reports and pricing. They really want to know if someone looks at an ad.
It’s funny how these people feel like cockroaches.
I’ve enjoyed using Wayback Machine on journalistic articles where they try to retcon information, but the original copy had already been captured. The Ministry of Truth hates archive.org.
Maybe they’re just trolls doing it for the lulz.
Well right wingers want to ban books and services like IA make that harder since they provide easy access to download or digitally borrow those books. It makes it harder for them to deny people access to those books since they can find them online. Of course, there are other ways people can still obtain those books, IA isn’t the only one, but it’s the easiest and the most convent.
I’ll give you my opinion though you haven’t asked for it:
Some right wingers (libertarian mostly) don’t want to ban books, they want books in fact to be reliably available, and having one centralized Internet Archive to store all of them is not reliable.
(Or in the same logic for humanity to be knowledgeable and resistant to propaganda, and treating sources’ availability as a given being harmful towards that goal - naive people can believe wrong things.)
See Babylon V example with kicking the ant hive again and again to some well-meaning goal, of the evolution kind.
Mind that I don’t think these people have such an intent.
It’s just in my childhood someone has gaslighted me into trying to be optimistic in such cases. Like “if someone is digging a grave for you, just wait till they’re done, you’ll get a nice pond”. Same as a precedent that is created with one intent and interpretation, but works for all possible intents and interpretations, because it’s a real world event.
So, other than gaslighting, real effects are real. Including positive ones, like all of us right now realizing that a centralized IA is unacceptable, we need something like “IA@home”, with a degree of forkability without duplicating the data, so that someone who’d somehow hijack the private key or whatever identifying said new IA’s authority wouldn’t be able to harm existing versions and they wouldn’t require much more storage.
Shit, I can’t stop thinking about that “common network and identities and metadata exchange, but data storage shared per communities one joins, Freenet-like” idea, but I don’t even remotely know where to start developing it and doubt I’ll ever.
4 years ago (best number I can find, considering IAs blog pages are down) IA used about 50 petabytes on servers that have 250 terabytes of storage and 2gbps network.
From this, we can conclude that 1 TB of storage requires 8mbps of network speed.
Let’s just say that average/all residential broadband has spare bandwidth for 8mbps symmetrical.
We would need 50,000 volunteers to cover the absolute minimum.
Probably 100k to 200k to have any sort of reliability, considering it’s all residential networking and commodity hardware.
In the last 4 years, I imagine IA has increased their storage requirements significantly.
And all of that would need to be coordinated, so some shards don’t get over-replicated
This seems to confirm my critique of “manual” solutions with torrents and such offered in other comments, resulting in the idea shortly described in the comment you were answering.
Yes, this would require a lot of people, but some would contribute more and some less, just like with other public P2P solutions.
From my POV the biggest problem is synchronizing indexes (similar to superblock maybe) of such a storage, and balancing replication based on them, in a decentralized way. Because it would seem that those indexes by themselves would be not small.
There should also be all the usual stuff with controlling data integrity.
I think it’s realistic to attract many volunteers, if the thing in question will also be the user client, similar to Freenet and torrents socially, and bigger storage will allow them to faster get things they access more often, as a cache. But then balancing between that and storing necessary, but unpopular parts of the space, is a question.
I think I need to read up.
There are really good, incentivized versions of decentralized storage networks. Unfortunately discussions about them are stigmatized under the “crypto” umbrella so the mere mention typically gets you buried.
If you have an open mind, check them out!
Copyright holders compete with old content clogging up the works. They wish the library would burn.
Apparently, BlackMeta is behind the DDoS attack to the Internet Archive. Apparently they are pro-Palestine hacktivists - their X account also has some russian written in it.
(Edit) Also, Internet Archive is banned on China since 2012 and Russia since 2015.
Yes they are a “pro-Palestine” Russian based hacker group… Nothing funny going on here no sir
xcancel.com/Sn_darkmeta/…/1845502888480579860#m
Buddy. I don’t care what they say. It’s plainly obvious they are lying. They are just brown hat hackers
So if white hat is ethical hackers, black hat is unethical, and red hat is Linux, then obviously brown hat is shitty!
Reading that whole page, holy shit, it’s like a twelve year old wrote it trying to sound very smart while also attempting to divert blame and falsify agenda. If this ain’t a Russian psyop, nothing is.
Definitely not their genocidal neighbors terrorizing as usual. /s
We need IA full mirrors. This is too critical to leave to this one company.
Knowing the folks at IA I’m sure they would love a backup. They would love a community. I’m sure they don’t want to be the only ones doing this. But dang, they’ve got like 99 Petabytes of data. I don’t know about you, but my NAS doesn’t have that laying around…
That is an insane amount of storage. How much does it grow every year and is it stable growth or accelerating?
I wonder if someone can come up with some kind of distributed storage that isn’t insanely slow. Kinda like a CDN but on personal devices. I’m thinking like SETI@HOME did with distributed compute.
Edit: this is kinda like torrents but where the contents are changing frequently.
You should look up IPFS! It’s trying to be kinda like that.
It’ll always be slower than a CDN, though, partly because CDNs pay big money to be that fast, but also anything p2p is always going to have some overhead while the swarm tries to find something. It’s just a more complicated problem that necessarily has more layers.
But that doesn’t mean it’s not possible for it to be “fast enough”
Interesting, thanks
And there’s a promising new IPFS-like system called Iroh, which should have a lot less overhead and in general just be faster than IPFS. It’s not quite ready to just switch to right now, but an enterprising individual could probably make something useful with it without too much work (i.e. months, not years).
I’m using it for a distributed application project right now, but the intent is a bit different than the IA use-case.
Something like torrents. Split the whole thing in small 5gb torrents.
Hope they had a backup
The majority of Reddit discourse on this is wild. The crowd there is going HARD to try and paint IA in the most negative light possible.
I know we don’t like Reddit here, but for example: reddit.com/…/internet_archive_issues_continue_thi…
It’s almost as if the “hackers” and/or copyright holders are running that conversation.
Since it’s Reddit, I would guess copyright sockpuppets are steering the narrative to help damage them further.
Quick question for those more in the know: Have these events disrupted IA’s ability to archive pages? I ask because I was recently talking with a security guy about a novel malware that used a hacked webpage for command injection. One possible motive that came to mind, if the archiving was disrupted would be to cover tracks for a similar malware. Inject code, perform malicious activity, revert, then, there’s more time before the control code is discovered.