Cassettes and DNA to cope with the explosion of our digital data

Research, industries and individuals accumulate more and more digital data. So much so that hard drives and other recorders will soon be overwhelmed. To compensate for future shortcomings, an ancient object is constantly evolving: the magnetic cassette, while waiting for breakthrough technology based on DNA.

An Instagram photo, videos on a driveof e-mails… each one accumulates a considerable amount of digital data, which is constantly increasing with the new technologies at our disposal – videos in 4K, streaming on Netflix – all stored not on a hard drive, but in the “cloud”, ” clouds “, sometimes hundreds of kilometers from oneself. But this data, although very familiar, is not the one that weighs the heaviest in “Big Data”, massive data.

Research is a much more important contributor. Human science experiments are heavy, very heavy: European Organization for Nuclear Research, CERN, near Geneva, since its creation has accumulated more than 100 petabytes (PB) of images, raw data, information to be preserved for future generations who will want to study them. 100 Po is equivalent to approximately 102,400 hard drives of 1 terabyte (TB), for sale to individuals…

The first image of the black hole M87* required a huge amount of data. Event Horizon Telescope (EHT)/National Science Foundation/Handou

That first image of a black hole required almost 5 Po, which is equivalent to 5,000 1 TB hard drives. Industries, such as Twitter, EDF or any company with a minimum of digitization, are other contributors to Big Data.

physical boundaries

Between 2010 and 2020, the amount of information contained in massive data has multiplied by more than 30, from 2 zettabytes (2 million Po) to 60 zettabytes. And the pace increases. By 2025, humanity is expected to produce 175 zettabytes of data.

François Képès, cell biologist, responsible between 2018 and 2021 for a future working group on the storage of digital data, explains: ” In 2018, one millionth of the earth’s land mass was occupied by data centers. At this exponential rate there, by 2060, all landmasses will be covered in data centers. »

Construction of a Facebook data center on October 5, 2021 in Eagle Mountain, Utah.
Construction of a Facebook data center on October 5, 2021 in Eagle Mountain, Utah. Getty Images via AFP – GEORGE FREY

But over 70 years, researchers have continued to reduce the size of storage systems, moving from floppy disks to hard drives, to increase capacity. But in its conclusions, the working group’s report published in 2020 recalls that Moore’s Law about semiconductors also applies to electronic and magnetic storage systems. ” It is not possible to miniaturize and optimize indefinitely. There was a doubling of capacity and a halving of price, every two years, for several decades, but this optimization is slowing down. We are reaching some hard physical limits and the optimization we can still expect is relatively low », explains François Képès.

The cassette, an emergency solution

If electronic storage systems reach their limits, the cassette continues to break records. Yes, we’re talking about the cassette here, the one you put in your old video camera or cassette player, the tapes of which could go off in all directions in the event of a faulty rewind. But the cassettes developed today have nothing to do with yesterday. The latest record from Fujifilm and IBM stands at 580 TB this corresponds to 76 million audio cassettes from the 1990s (60 Mb/cassette). Here is a video of the 2017 record, which at the time was 330 TB.

With ribbons twenty times thinner than a hair and over a kilometer long, the cassette fits in the palm of your hand and still has a few years left in it. Mark Lantz, magnetic tape researcher at IBM, says: This really demonstrates the ability to continue to scale tape technology, essentially at historic rates of doubling cartridge capacity every two years, for at least the next ten years. »

The next ten years… and beyond? By emphasizing this temporality, Mark Lantz, like many engineers who work in storage, shows that he is well aware of the limits of electronic and magnetic storage. Both use enormous resources in energy and space.

Mark Lantz, a scientist at IBM, holds a tape of several hundred TB in his hand.
Mark Lantz, a scientist at IBM, holds a tape of several hundred TB in his hand. © Photo courtesy of IBM Research

However, the magnetic cassette has the advantage that it requires less electronics: a single reader can read several cassettes, where each hard drive has its own reading system. Additionally, unlike a hard drive, a tape lasts for decades and is more energy efficient.

Nevertheless, a tape, no matter how powerful, still takes up too much physical space and will not be able to contain the size of the massive data to come. We must therefore move up a gear. And that is what François Képès’ working group sought to do. ” We logically considered alternatives such as etching on glass, crystal or storage on polymers such as DNA. It seemed likely to us that the only technology that could be developed in time and had sufficient improvement factors was storage on polymer summarizes the researcher.

Awaiting DNA

DNA? Don’t panic: it’s not about storing information in living beings or changing it directly in someone. It was certainly imagined to do it in bacteria or spores, but that is no longer the main track.

DNA is a large chain of molecules that carries the instructions for the reproduction and development of living things. Here it is the concept of “instruction” that is interesting. DNA is a chain of four monomers, the “rods” that connect the two helices: A, C, G and T. The order of these monomers (AAGTTCCGATAT, for example) provides the information, exactly like … the binary system, based on 1 and 0, at the origin of any computer system.

DNA sequencing consists of four different monomers: A, C, T, G.
DNA sequencing consists of four different monomers: A, C, T, G. Getty Images – alan phillips

First, it is necessary to determine which row of monomer one wants to align to encode the digital file. Let’s imagine that A is 0 0, C is 0 1, G is 1 1, and T is 1 0. Let’s take a completely bogus example. If we want to save a photo coded as 01 11, this means that the computer must translate » 01 11 in CG. This is the encoding we encode the file. Then you “chemically” write the CG into the DNA and then store it to bring it out when you need it.

At the time of reading, the software will translate the sequence of letters into binary code and thus reconstitute the image on the screen. Therefore, to summarize, there are five stages: encoding, writing, storing, reading, decoding.

But why store our information on DNA? For the amount of information that can be encoded in it (the information density), its energy sobriety and its durability. No need to cool the DNA, unlike in data centers: it can be stored at room temperature … for up to 52,000 years if the encapsulation technique of the French company Imagene is used.

Each of its capsules can contain up to 0.8 g of DNA or 1.4 exabytes of data. As a reminder, an exabyte represents one million 1 TB hard drives. 0.8 g of DNA would thus contain as much information as 150 tons of hard drives! To store the 175 Zettabytes of Big Data from 2025, it would only take 175 kilos of DNA. The American DARPA agency estimates that DNA could make it possible to divide the energy consumption of our data by a thousand.

Development potential?

The biggest advantage of DNA is that we know it very well, recalls François Képès: ” Biomedical has led to the development of DNA technology, which is already very advanced. This means that all the necessary methods for the work of storing and archiving digital data have already been done, now this does not mean that it is commercial level, not at all. »

However, technology advances very quickly. ” The cost of sequencing a human genome [la lecture, NDLR] have extraordinarily low. We were at $3 billion in 2003, we’re at $500 today », enthuses the researcher. But there are limits: $500 for a DNA read at the speed of 2022 is still 1,000 times too expensive and 1,000 times too slow compared to a hard drive. For writing, it’s even 100 million times too slow and too expensive.

There are people who told us to come back and talk about it at the end of the century. No way! DNA-related technologies are advancing by a factor of two about every six months : four times faster than electronics between 1976 and 2011. At this rate, the factor of 1000 for reading will be absorbed within five years, around 2025. And the 100 million for writing, around 2035! »

Already, some applications are possible for DNA until 2035. Not all data needs to be read or written regularly. INA, a French organization responsible for archiving audiovisual productions, thus accumulates an additional 20 PB of data each year. All this data does not need to be brought out quickly, hence the interest in encoding it in DNA. Likewise, the banking sector, which must store its customers’ bank data, sometimes for decades, could use this new storage technology.

Proof that the efforts are enormous, the American DARPA has invested hundreds of millions of euros in DNA technologies. France, for its part, has started to get going, thanks in particular to François Képès’ working group an investment of 20 million euros government support for research into DNA storage.

Also read: Faced with the enormous amount of Big Data, investigative journalists’ strategies

Leave a Comment