Large language models, connected devices, and many other modern technologies require copious amounts of data. Massive scientific projects, such as those in astrophysics, generate petabytes of data (that’s millions of gigabytes or tens of thousands of iPhones). Moreover, as sequencing technologies become more commonplace, there will be even more omics data to store.
This exponential growth of data generation isn’t halting anytime soon. As Kyle Tomek, CEO of US-based startup DNAli Data Technologies, puts it, “You'd have to cover the surface of the Earth in data centers by the year 2060 to save as much data as we're going to make before then.”
Meeting the future’s unimaginable data storage needs necessitates alternative storage mediums. With its incredibly high information density, DNA makes for an attractive candidate. Consider this: a single gram of DNA can store hundreds of petabytes of data.
It has other benefits as well. DNA (almost) faithfully replicates information across generations and samples as old as over a million years old have been sequenced. It is particularly relevant for long-term archival storage. Moreover, the format will surely be around, eliminating the need for any data migration in the distant future.
Over the last decade, scientists have stored books, movies, and even the entire Wikipedia in DNA. But is DNA data storage ready for larger-scale applications in petabytes and exabytes (1000 times a petabyte)?
Any data storage solution involves methods to encode information, write it, maintain it, and retrieve it when required. For DNA data storage, encoding involves mapping digital information to nucleotide sequences and often depends on the synthesis method.
Preserving DNA for archival purposes involves preventing mechanisms that lead to DNA decay by encapsulating it in a chemical or physical barrier. Under aging protocols that simulate degrading conditions that could occur over hundreds or thousands of years, encapsulated DNA stays resilient.
Writing and retrieving data correspond to synthesizing and sequencing DNA. Despite significant improvements, DNA synthesis remains a major bottleneck for DNA data storage. It is still expensive and slow to synthesize DNA, particularly stretches longer than a few hundred base pairs.
DNA synthesis methods include chemical synthesis, enzymatic synthesis, and ligation. Whereas the first two methods stitch together DNA strands base by base, the third approach combines shorter strands from a library. DNAli takes this approach. “We take reusable blocks of DNA, make them in mass amounts, and then assemble them for each new file,” says Tomek. This approach, however, still requires the blocks to be synthesized using more conventional methods.
Alongside DNA synthesis advances, optimizing encoding could bring down costs. Unlike DNA synthesis for biomedical purposes, “we don't need high fidelity in DNA data storage because we have error correction coding to cope with errors introduced during synthesis,” says Robert Grass, a researcher at ETH Zurich.
Although DNA sequencing is significantly cheaper than synthesis, Grass stresses that “it is more expensive to sequence DNA than to read files from a memory stick or a hard drive.” It limits its applicability for use cases that require frequent data retrieval, such as in data centers. “What is important in a data center is how long it takes to retrieve data, says Erfane Arwani, CEO of French startup Biomemory. “With DNA sequencing, the latency is too high.”
Data centers keep multiple copies of data to ensure redundancy against damage and provide low latency. On a hard drive, copying data is similar to rewriting it. Compared to writing DNA, copying it is far cheaper as it relies on the well-established polymerase chain reaction (PCR). As better encoding methods reduce the need for redundancy, it would reduce costs by limiting reagent use.
Retrieval also needs to be computationally efficient: you shouldn’t have to sequence the entire DNA when you need to access a particular part. Biomemory achieves this with a data organization approach it calls a DNA drive. “Like the physical environment of the hard drive, we organize the DNA data physically to be retrieved efficiently,” says Arwani. It stores data in DNA pools tagged with barcodes.
DNAli and Cache DNA, an MIT spinout, are also tackling this problem, with the latter focusing on it. “We decided to divorce the encoding or the biomolecule that stores the actual information from the indexing,” says James Banal, co-founder of Cache DNA. DNAli’s access methods provide a quick and cheap preview of a file stored in DNA. Cache DNA’s technology enables a Boolean search on a storage system, allowing quick access, akin to random access memory, for retrieval.
Biobanks are an obvious use case for DNA data storage. “With this technology, you could convert a biobank that is the size of a football field into something that can fit with everything in the palm of your hand,” says Banal. With encapsulation technologies, the DNA samples can be stored at room temperature. Compared to storing samples in freezing conditions in conventional biobanks or data centers that require extensive cooling, this has significantly lower energy consumption.
Until recently, scientific and medical applications were the sole drivers behind storing data in DNA. New research could broaden its scope to cryptography and nanotechnology. Another interesting development is the emerging intersection of DNA data storage and DNA computing. Indexing methods for DNA data retrieval mentioned earlier are an early example of that. Today, one of the most pressing commercial drivers of the technology is the data centers.
As researchers and startups chip away at its limitations, DNA data storage is becoming a viable commercial solution for storing all kinds of data at scale. The DNA Data Storage Alliance, a consortium founded in 2020, counts legacy data storage giants such as Western Digital and Seagate among its members.
However, there’s a vital piece missing. “There's no end-to-end product to enable DNA data storage on-site at data centers,” says Arwani. Such a solution should also interface with the rest of the data center for seamless conversion between DNA and other data formats.
Meanwhile, DNA data storage is finding applications outside laboratories and data centers. “The advantage of having DNA as a data carrier is that you can put it into other things”, says Grass. “You can put it into a liquid or a polymer.” This is at the center of DNA of Things, a paradigm that could combine DNA data storage and consumer biotech.
Already, small sequences of synthetic DNA are in use as traceable barcodes for improving supply chain transparency. Also at smaller scales, researchers are developing creative approaches that achieve DNA data storage without the need for DNA synthesis. For example, NUS researchers developed a biological camera that stores images directly in live bacteria using optogenetics and DNA barcoding.
As DNA data storage improves, its information density will approach its theoretical maximum. Will you be streaming a movie from DNA stored in a vial? Not anytime soon. But a lot of your other data could be moving to DNA-based cloud storage.