How much is one terabyte of data?

SushiHippie · on Feb 20, 2024

> The high-speed SSD reads 300 megabytes of data per second

Aren't normal SSDs now at 500-600MB/s read?

And then there are nvme SSDs which can read up to 7GB/s

danking00 · on Feb 20, 2024

It'd be interesting to see a peak-sequential-bandwidth by cost-per-gigabyte plot. The number I keep in my head is 500 MiB/s, but you're right that there are much faster drives out there [1]. Of the public clouds: Google's "Local SSD" claims ~12,000 MiB/s but they're ephemeral and you need 12 TiB of disks to hit that bandwidth [2][4]. AWS has these io2 SSDs which claim 4,000 MiB/s [3].

On the other points of the article, even if you had a huge disk array plugged into the machine, how many cores can you also plug into that computer? I suppose there will always be a (healthy, productive) race here between the vertical scaling of GPUs + NVMe SSDs and the horizontal scaling of CPUs and blob storage.

EDIT: formatting.

[1] First Google result is Tom's hardware: https://www.tomshardware.com/features/ssd-benchmarks-hierarc...

[2] https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[3] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[4] The ephemerality has two downsides. First, you have to get the data onto that local SSD from some other, probably slower, storage system (I haven't benchmarked GCS lately, but that's probably your best bet for quickly downloading a bunch of data?). Second, you need to use non-spot instances which are 3-6x the price.

magicalhippo · on Feb 20, 2024

With AMD Epycs having 128 PCIe 4.0 lanes, using say 96 for disks thats 192GB/s of aggregate bandwidth. With 16TB U.2 SSDs that's up to 1.5PB of storage if you use one lane per disk.

Not for your average homelab budget but...

HPsquared · on Feb 20, 2024

And "characteristic time": disk size / speed. In other words, time required to read/write the full disk.

mrits · on Feb 20, 2024

I'm doing some consulting with a client with a few terabytes in SQL Server. He keeps talking about challenges in reprocessing data in a migration to clickhouse.

I find it interesting that the solution to a lot of the problems is to just reprocess the data and don't try to optimize anything. 10 TB is not a lot of data with NVMe.

danking00 · on Feb 20, 2024

I gathered a table for Google and Amazon's options. I do not have experience with on-prem solutions so I don't know how to compare these prices to the cost of owning and operating hardware. I'm sure its cheaper over time for the hardware but I imagine you need sufficient scale to amortize the personnel costs.

Storage

    | product              | price (USD/GiB-month) | price (USD/IOPS-month)            | claimed max read bandwidth (MiB/s) | minimum price to achieve bandwidth |
    |----------------------|-----------------------|-----------------------------------|------------------------------------|------------------------------------|
    | GCP NVMe Local SSD   | 0.1046 [1]            | 0.00                              | 12,480 [2]                         | 13,000 USD/month [3]               |
    | AWS io2 SSD          | 0.1250 [4]            | 0.065 [4,5]                       | 4,000 [6]                          | 1,042 USD/month [4,6,14]           |
    | AWS io1 SSD          | 0.1250 [4]            | 0.065 [4,5]                       | 500 [7]                            | 135 USD/month [4,7,15]             |
    | Google Cloud Storage | 0.0200 [8]            | 0.0004 per "1k Class B Op" [8,16] | 23,842 [9]                         | [10]                               |
    | AWS S3               | 0.0230 [11]           | 0.0004 per 1k GET [8,16]          | 11,921? [12]                       | [13]                               |

You could build a similar table for compute but it gets complicated. FLOP seems like a reasonable unit of compute, but there are things other than FLOPs (e.g. decoding your column-oriented compression scheme).

I've tried to do this comparison a few times but I usually find it hard to get clear aggregate FLOP numbers for GPUs. GPUs also require caretaker CPUs and I don't have experience using them so I'm not certain how to spec a VM that can practically saturate the compute of the GPUs. My gut instinct is that the big compute consumers must be able to arbitrage this to some extent by shifting some workloads to chase the cheapest FLOP.

EDIT(2x): Table formatting. We could really use some Markdown styling on HN.

EDIT3: Clarify incomparability of IOPS and GETs.

[1] https://cloud.google.com/compute/disks-image-pricing#localss...

[2] https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[3] For 12TiB. https://cloud.google.com/compute/docs/disks/local-ssd#nvme_l...

[4] https://aws.amazon.com/ebs/pricing/

[5] We only need need 16,000 IOPS for peak performance of io2, so I ignore the drop in price at higher volumes.

[6] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[7] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisio...

[8] https://cloud.google.com/storage/pricing#price-tables

[9] https://cloud.google.com/storage/quotas#bandwidth

[10] Honestly not sure. You can do 5,000 parallel "reads" per second to a single object. I'm not sure what kind of instance you need to receive 23,842 MiB/s or if a single object can actually deliver that much bandwidth. https://cloud.google.com/storage/quotas#objects

[11] It gets slightly cheaper as volume goes up. https://aws.amazon.com/s3/pricing/

[12] I could not find a clear answer, but it seems at least 100 Gbps (11,921 MiB/s) https://repost.aws/knowledge-center/s3-maximum-transfer-spee...

[13] As with Google [10], I'm not really sure.

[14] For 16GiB and 16k IOPS.

[15] For 40GiB and 2000 IOPS.

[16] Not really comparable to provisioned IOPS because you pay for the IOPS once per month whereas you pay for every individual GET request.

thworp · on Feb 20, 2024

This tripped me up as well. While modern NAND SSDs do slow down to about a third of the advertised speeds on non-sequential reads/writes, that is still 2-3 GB/s. The "normal" (SATA) SSDs are just limited by SATA's link speed, they could easily get to similar rates.

thrwwycbr · on Feb 20, 2024

"Up to xyz MB/s"

This very detail is the reason USB flash drives have useless performance numbers on them, which oftentimes lead to 1/100th of the advertised performance.

Everything that's bigger than the sector size (4k) is usually far lower than those numbers.

wtallis · on Feb 20, 2024

Getting peak sequential read speed out of SSDs and USB flash drives usually requires issuing read commands that are much larger than 4kB, for two main reasons: the page size of the underlying flash memory is larger than 4kB, and SSDs and some USB flash drives stripe data access across multiple channels so you need to ask for at least a full strip of data. Queuing many commands for contiguous sequential reads can usually get close to the same performance, but has more overhead (and isn't an option for USB drives that don't support UASP).

causi · on Feb 20, 2024

I've never seen an SSD as slow as 300MB/s. My first SSD from a decade ago was over 500MB/s sequential read.

pixl97 · on Feb 20, 2024

Are you measuring the NAND or its faster cache that can be a few GB in size?

All the different levels of caches make the raw performance numbers (also how full is the disk) hard to find.

timenova · on Feb 20, 2024

I'm pretty sure they mean HDDs as they write hard disks in each paragraph below.

SushiHippie · on Feb 20, 2024

I don't think there are HDDs that can read 300MB/s

breck-fresen · on Feb 20, 2024

Spinning HDDs have ~3MB on the outer track and 1MB on the inner track. 7200rpm/60 * 3MB = 360MB/second for sequential outer track reads.

Source: I wrote S3's on-disk placement algorithm

araes · on Feb 20, 2024

Do you happen to be the writer of this technical article on AWS? [1]

If not, someone on Seth Markle's team over at Amazon wrote a pretty fascinating article on the S3 architecture. Never actually heard the "airplane at 75 MPH over a grass lawn comparison" before.

This [2] aggregate workload graph/video is pretty sweet too. Very matrixy while still actually being fairly clear about what's going on.

Imagery implies S3 (at writing) had a 3.7 PB bucket size with 2.3M req/sec. Worked out to 143 hard drives (storage constrained result) and 19,000 hard drives for the I/O constrained result. Kind of an interesting counterpoint to the article. Here's what really large workload / access rate systems with spikey, intermittent workloads are structured like.

[1] https://www.allthingsdistributed.com/2023/07/building-and-op...

[2] https://www.allthingsdistributed.com/videos/fast-agg-graph-c...

magicalhippo · on Feb 20, 2024

My 20TB WDs get pretty close, around 270MB/s sequential read[1]. Higher density means higher sequential speed, so if not the 22TB models then the 24TB should reach that.

[1]: https://www.kitguru.net/components/hard-drives/simon-crisp/w...

bravetraveler · on Feb 20, 2024

I mostly play with 7200RPM consumer drives, they hang around there in ideal situations: sequential non-fragmented reads

Not to put stock in the assertion, I have no idea what the article intends

dale_glass · on Feb 20, 2024

https://www.seagate.com/content/dam/seagate/migrated-assets/...

Claims 500 MB/s. But of course your chances of getting that from a hard disk are far less than from a SSD.

tecleandor · on Feb 20, 2024

Current high capacity enterprise HDDs (Like an Exos 24 or an WD Gold 24), in ideal conditions, reach around 250-280MiB/s, so quite sort of 300 as a regular number.

Probably they are pulling info from an older article or just using a round number to make it more clear.

Edit note: As a sibling comment says, Seagate had for a while a "2x" Exos model that had dual actuators and claimed ~500Mbytes/s of max transfer speed. But seem discontinued or unavailable and didn't have much success (out of the datacenter at least)

HPsquared · on Feb 20, 2024

1 million is 100^3 (i.e. a cube 1 metre across, contains a million 1 cm cubes)

1 billion is 1000^3 (1 m cube contains a billion 1 mm cubes)

1 trillion is 10,000^3 (I can't visualize 0.1 mm, so increase size: the big cube is now 10 m on side, with 1 mm parts)

So the million and billion you could have on your desk made of parts that you can see and handle; the trillion is just a bit too big for that, you'd need a big room with a high ceiling.

Edit: and, of course, you can also see a million in 2D. A sheet of grid paper 1 m across, with 1mm grid squares. Or, more practically, try counting the pixels on a 720p monitor: 921600 pixels on that screen. Use checkerboard pattern.

bumbledraven · on Feb 20, 2024

There are ink pens with 0.1 mm tips. https://www.jetpens.com/Copic-Multiliner-SP-Pen-0.1-mm-Black...

readingnews · on Feb 20, 2024

There be wallet burning dragons here...

gilleain · on Feb 20, 2024

There absolutely are when it comes to materials like pens, but I bought a set of 3 pens (0.1 mm, 0.3 mm, 0.5 mm) for less than ten GBP. Specifically, Mitsubishi pencil co. unipen fine line.

Sure you can probably buy much more expensive examples, but it's not essential.

(Side note : don't try to use a gel ink pen across varnish, or it stops working until you clean it off with solvent)

hnuser123456 · on Feb 20, 2024

0.1mm is about the thickness of a sheet of paper.

HPsquared · on Feb 20, 2024

A 0.1mm cube is like a grain of fine sand, can't really see the size of it. I couldn't imagine stacking them by hand with tweezers, but just about doable for 1mm.

HPsquared · on Feb 20, 2024

Thinking again about it, you could have sheets of paper stacked into a cube. 1 m^2 sheets, with 0.1mm checkerboard pattern, stacked a metre high. That's some fine detail...

favourable · on Feb 20, 2024

https://en.m.wikipedia.org/wiki/Zettabyte_Era

> The Zettabyte Era or Zettabyte Zone[1] is a period of human and computer science history that started in the mid-2010s. The precise starting date depends on whether it is defined as when the global IP traffic first exceeded one zettabyte, which happened in 2016, or when the amount of digital data in the world first exceeded a zettabyte, which happened in 2012. A zettabyte is a multiple of the unit byte that measures digital storage, and it is equivalent to 1,000,000,000,000,000,000,000 (1021) bytes.

> According to Cisco Systems, an American multinational technology conglomerate, the global IP traffic achieved an estimated 1.2 zettabytes (an average of 96 exabytes (EB) per month) in 2016. Global IP traffic refers to all digital data that passes over an IP network which includes, but is not limited to, the public Internet. The largest contributing factor to the growth of IP traffic comes from video traffic (including online streaming services like Netflix and YouTube).

ykonstant · on Feb 20, 2024

A couple of AAA games ٩(ఠ益ఠ)۶

causi · on Feb 20, 2024

Really wish games would give you a "cleanup" option in the settings. If I'm only going to use the Ultra textures it could delete all the others, and same for if someone is only going to use Medium, etc.

supertrope · on Feb 20, 2024

As much as I want to be able to un-select game modes, ultra textures, and foreign language audio, this would increase complexity and QA required.

causi · on Feb 20, 2024

Letting me or third-party software go in and delete the files manually without pitching a fit would be an acceptable compromise. I'd even purchase such a utility.

supertrope · on Feb 20, 2024

Some studios have tinkered with the idea of games that can be played while the download completes. There has to be a way to trace all assets that are needed for the first chapter. https://patents.google.com/patent/US11123634B1/

yreg · on Feb 20, 2024

World of Warcraft has worked that way for years. When you start, you can initially see other players fight phantoms because your game hasn't downloaded the assets for those mobs yet.

PlatinumBench · on Feb 20, 2024

Blizzard has had this for a number of years, at least for World Of Warcraft. You can start playing and it downloads the rest of the game in the background. I think the loading screens are a bit longer since my guess would be that the program checks to make sure the needed files are there for the zone and, if not, it downloads them as a priority.

xhrpost · on Feb 20, 2024

I think it should also be pointed out that a lot of DB's have a lot of redundant data due to indexing. When your DB is small, an index seems negligible, but once you get into millions or even billions of rows, you see first hand just how much space they can take up.

nofinator · on Feb 20, 2024

If you're looking for a visualization, a YouTube team of VFX artists did some neat ones last year in this video: https://youtu.be/J-K2yeQylCk

JackSlateur · on Feb 20, 2024

  But for a city wide or even some state wide institutions, it (300rps) is really a big number.

Well, I laughed. It is common for monitoring infrastructure to poll at the minute-range, and to have dozens of probes per server. 300 is really few for a whole company

itslennysfault · on Feb 20, 2024

Interesting read, but it could REALLY use some visualization to drive the point home.

eulenteufel · on Feb 20, 2024

The first thing that comes to my mind is a quantum state of 36 qubits represented in 64-bit complex floats.

forinti · on Feb 20, 2024

The first thing I thought of was a pile of 3 million 360KB floppies.

kenm47 · on Feb 20, 2024

1 tb of searchable data is $25/mo if you do it right.

textread · on Feb 20, 2024

I am interested in the back of the envelope calculations you did to come to this conclusion. Would you please elaborate if possible?

I know of an early stage YC startup that has a 6TB Postgres DB. Would it be fair to say that the DB hosting (neglecting replica, engineering time) can be done at $150/month?

fibadi-fabadi · on Feb 20, 2024

* Mama joke joined in chat.

dadzilla · on Feb 20, 2024

From Groq using Llama 2 70b: To calculate the volume of space required to contain the world's population, we need to first convert the area calculated earlier to square feet.

Area (square feet) = 600 square miles × 5,280 square feet/square mile = 3,160,000,000 square feet

Next, we need to assume a height at which the population can comfortably stand. Let's assume an average height of 5 feet.

Volume = Area (square feet) × Height (feet) Volume = 3,160,000,000 square feet × 5 feet Volume = 15,800,000,000 cubic feet

Now, we need to convert the volume from cubic feet to cubic miles. There are 1,476,333,333 cubic feet in a cubic mile, so:

Volume (cubic miles) = Volume (cubic feet) ÷ 1,476,333,333 Volume (cubic miles) = 15,800,000,000 ÷ 1,476,333,333 Volume (cubic miles) = 10.7 cubic miles

Therefore, to contain the entire world population, we would need a volume of approximately 10.7 cubic miles, assuming an average height of 5 feet and a density similar to that of a solid object.

Please note that this calculation is purely theoretical and doesn't take into account factors like personal space, comfort, and actual population density.

gcr · on Feb 20, 2024

Please don't just paste AI-generated summaries into a low-effort comment. I don't know why your summary only picked up on the first half of the first paragraph, but estimating global population volume has nothing to do with the article.

Also, even taking the assumptions at face value, your model's calculations are wrong by several orders of magnitude. There are not 5,280 square feet in one square mile and 600*5280 is not 3,160,000,000.

cyclotron3k · on Feb 20, 2024

If you weren't concerned with preserving life, you could take the current population of Earth, multiply by the average weight of a human (69kg according to Wolfram Alpha's sources), assume that we have the same density as water, and find that we could all fit in a 0.5km³ box.