A look at the Internet Archive, which sees ~100TB of material uploaded daily and has cataloged ~73K US government website pages that the Trump admin expunged. (text.npr.org)
from Tea@programming.dev to technology@lemmy.world on 23 Mar 20:39
https://programming.dev/post/27419361

#technology

threaded - newest

dan@upvote.au on 23 Mar 20:57 next collapse

I didn’t realise they do tours every Friday at 1pm. I’ll have to visit some time!

I really hope the lawsuits don’t kill the Internet Archive. It’s an important resource.

NeoNachtwaechter@lemmy.world on 24 Mar 04:58 next collapse

The truth is stored on their harddisks. But the truth may become very illegal very soon.

They better move the whole thing out of Usa now.

pogmommy@lemmy.ml on 24 Mar 10:47 next collapse

I love the IA but they need to be infinitely more decentralized like yesterday

douglasg14b@lemmy.world on 24 Mar 18:17 collapse

And funded by who?

It’s nice to say that it should be decentralized, but who is funding the development of that? Are you donating to IA?

SoftestSapphic@lemmy.world on 24 Mar 18:35 next collapse

TBH this is an important enough resource the UN should fund it.

They won’t but they should.

pogmommy@lemmy.ml on 24 Mar 19:14 collapse

I mean, yeah like another user said, ideally it would be in the interest of groups which allege to have am interest in some form of democracy. But additionally, the ability to set up browsable partial mirrors which could be hosted by miscellaneous nonprofits and individuals both within and outside of the US would be a massive first step to preserving the information that IA stores. The fact that attacks on their servers can eradicate all access to the information they store is troubling given how many enemies they’ve made simply through the work they do.

douglasg14b@lemmy.world on 25 Mar 20:33 collapse

The actual volume of data is kind of insane for distribution. You start running into many scale problems.

At ~70PB of storage, assumed redundant as well. And at ~$15/TB JUST for HDDs alone, you’re talking $2.1 million in just hard drives.

Installation, hardware, and facility costs will at least pentuple that number, if we’re being crazy conservative. Making the cost to stand up an archive $10.5 million?


During this process I found out that their finances are public and there is more reliable information out there:

  • $2/GB for permanent storage, overall ( $2000/TB)

The cost to store the data and run the archive is a whopping $36mill/y at the moment.

Which if you consider what they do is incredibly cheap. And easily fundable by even a small municipality never mind a large Nation.

turmacar@lemmy.world on 25 Mar 23:16 collapse

It would be interesting to have encrypted blobs scattered around volunteer computers/servers, like a storage version of BOINC / @HOME.

People tend to have dramatically less spare storage space than space compute time though and it would need to be very redundant to be guaranteed not to lose data.

douglasg14b@lemmy.world on 26 Mar 21:36 collapse

Oh for sure, that’s quite reasonable, though at some point you just move towards re-creating BitTorrent, which will be the actual effect you want.

You could build an appliance on top of the protocol that enables the distributed storage, that might actually be pretty reasonable 🤔

Ofc you will need your own protocols to break the data up into manageable parts, chunked in a same way, and make it capable of being removed from the network or at least made inaccessible for dmca claims. Things that is completely preventing the internet archive from being too much of a target from government entities.

turmacar@lemmy.world on 26 Mar 22:57 collapse

Yea some kind of fork of the torrent protocol where you can advertise “I have X amount of space to donate” and there’s a mechanism to give you the most endangered bytes on the network maybe. Would need to be a lot more granular than torrents to account for the vast majority of nodes not wanting or being capable of getting to “100%”.

I don’t think the technical aspects are insurmountable, and there’s at least some measure of a builtin audience in that a lot of people run archiveteam warrior containers/VMs. But storage is just so many orders of magnitude more expensive than letting a little cpu/bandwidth limited process run in the background. I don’t know that enough people would be willing/able to donate enough to make it viable?

~70 000 data hoarders volunteering 1TB each to be a 1-1 backup of the current archive.org isn’t a small number of people, and that’s only to get a single parity copy. But it also isn’t an outrageously large number of people.

douglasg14b@lemmy.world on 27 Mar 02:56 collapse

You might not necessarily have to fork BitTorrent and instead if you have your own protocol for grouping and breaking the data into manageable chunks of a particular size and each one of those represents an actual full torrent. Then you won’t necessarily have to worry about completion levels on those torrents and you can rely on the protocol to do its thing.

Instead of trying to modify the protocol modify the process that you wish to use protocol with.

General_Effort@lemmy.world on 24 Mar 11:36 collapse

Well, not to Europe. They’ve always been illegal here. I don’t know where they could even go.

rickrolled767@ttrpg.network on 24 Mar 11:49 collapse

They’re illegal in Europe? Could you elaborate a bit on that?

General_Effort@lemmy.world on 24 Mar 16:40 collapse

In practice, copyright would be the big problem. There is no Fair Use in Europe. There is no difference between what they do and Anna’s Archive or LibGen. As far as copyright people are concerned, this is just “theft” on a gigantic scale.

Then there’s the GDPR. As far as the EU is concerned, this is one huge human rights violation. The GDPR does allow for archives, but figuring out how the IA should operate would take some litigation. I doubt they would be allowed to provide the Wayback Machine.

rickrolled767@ttrpg.network on 24 Mar 18:32 collapse

Gotcha. Thanks for explaining it. I’m in the US so I was really curious on what was different in the EU that would cause problems for them

arafatknee@lemmy.dbzer0.com on 24 Mar 09:59 next collapse

The Internet archive is like the digital equivalent of the Svalbard Global seed vault.

en.m.wikipedia.org/…/Svalbard_Global_Seed_Vault

witx@lemmy.sdf.org on 24 Mar 11:18 next collapse

Is there a way we can help? E.g torrent seeding of the content?

General_Effort@lemmy.world on 24 Mar 11:38 collapse

warrior.archiveteam.org


The ArchiveTeam Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space.

The warrior runs on Windows, OS X and Linux. You’ll need VirtualBox (recommended), VMware or a similar program to run the virtual machine.

Appoxo@lemmy.dbzer0.com on 24 Mar 11:48 next collapse

That’s the whole reason it’s 100TB uploaded…

killeronthecorner@lemmy.world on 24 Mar 12:31 next collapse

I wonder if I can run a resource-constrained instance of this on esxi… something to look into this weekend, thank you.

boonhet@lemm.ee on 25 Mar 02:00 collapse

It barely uses any resources. You can have up to 6 active jobs and most of the time you’ll be waiting for an upload slot to open up so you can get one of your 6 uploaded.

You can just set it and forget it unless you have a bandwidth cap and set it on a video site.

dan@upvote.au on 24 Mar 15:32 collapse

They push the VM images, but there’s a Docker container available too.

DFX4509B_2@lemmy.org on 28 Mar 01:23 collapse

The current American regime is a good reason for them to not only move to whatever the data storage equivalent of a tax haven is, but also move fully to Tor/I2P to cover their tracks against any enemy powers, assuming Trump and Musk don’t figure out how to deanonymize Tor and I2P.