Stellus Helps Drive Actionable Insight
in Life Sciences Research
Written by Jaideep Joshi
Published on April 7, 2020
In the scientific community, problems in bioinformatics, computational biology, and structural biology are known to be very hard to solve. The core scientific difficulties are compounded by compute and data processing complexities, extremely long timelines, and accuracy concerns, not to mention high costs. However, solving these problems are critical in the identification and cure of diseases by developing new drugs and medical treatments.
Until recently, the widespread adoption of Genomic Sequencing & Analysis, or Structural Biology workflows in Cryo-EM, was often limited to very few organizations.
Enhancements in speed and accuracy, coupled with reduction in costs in front-end laboratory instruments, are quickly making these scientific endeavors mainstream in many research, pharmaceutical, and clinical organizations. Advancements in automation are reducing slowdowns in multiple rounds of human trial-and-error activities in these workflows.
Today, advancements in high-throughput sequencing have made it possible to sequence a whole human genome for under $600. Many organizations are now routinely sequencing hundreds of samples per day.
Innovations in high-resolution image capture in Cryo-EM have enabled researchers to clearly identify shapes of single cell proteins, thousands of times smaller than a human hair.
New techniques of spectrometry-based proteomics are increasingly being applied to biological and biomedical research.
All of the aforementioned scientific breakthroughs are being aided in innovative ways using widely available ML/AI and related data science activities to uncover unique actionable insight previously unavailable.
The Changing Landscape
While the digital creation and capture of raw data has indeed become fast and economical, with many environments generating hundreds of terabytes (TB) to multiple petabytes (PB) of data per day, the efforts to quickly derive actionable insight from this data are pushing the boundaries of existing IT environments.
Traditional HPC is changing. Much of the computation is now shifting from CPU-centric tools to GPUs, FPGAs, and ASICs. The versatility, simplicity, and cost-effectiveness of Ethernet has made it possible to deploy 25, 40, and 100 Gigabit Ethernet as alternatives to InfiniBand in many HPC environments. Ethernet is quickly becoming the de facto choice to deliver throughput and low latency required by instruments and compute clusters to store and access large amounts of scientific data generated on a daily basis. Software-based parallelism with modern frameworks like Spark are also making it possible to efficiently solve these large data-intensive problems in new ways.
The end result of the changes in the compute and network areas is that workloads like Genomic Analysis and Cryo-EM Modeling are shifting from (previously) being compute bound to (now) being I/O bound.
The Data Access Problem
As an example, a single new camera working in conjunction with a Cryo-EM microscope is capable of producing 5PB of data per day. This kind of equipment needs an extremely reliable high-speed data platform to sustain these high data ingest rates. In the absence of such a platform, researchers are constantly faced with massive data storage bottlenecks. Data migrations and movements result in lengthy (and expensive) microscope downtimes and wasted research cycles, all before one can even begin the data analysis phase. This is further compounded when one understands that research activities are iterative in nature. It is very normal (and imperative) for researchers to repeat their experiments and analysis multiple times.
Storage systems built in the era of HDDs have only been capable of delivering an incremental improvement in performance with inclusion of SSDs, mostly in faster read activities. The legacy block-based storage architectures just cannot handle the high sustained writes that are necessary.
The introduction of new innovative memory and NAND devices and components have made it evident that these decades-old file systems and software stacks cannot truly exploit the capabilities of next-generation data storage media.
Considering that most of these datasets are created in research facilities on premises, most scientists do not have cloud-based options today. In rare cases where data could be transferred to the cloud, many cloud-based alternatives can be 5x to 9x as expensive as on-premise environments, mostly due to high throughput and data movement costs.
We at Stellus Technologies have taken up this data storage challenge. First-hand knowledge of the latest memory and flash devices and deep understanding of data storage and access requirements have resulted in the creation of the Stellus Data Platform (SDP). SDP is unique in its use of Key-Value over Fabric (KVoF) techniques, coupled with NVMe and RDMA to deliver unmatched sustained read-write performance at scale.
SDP uses strongly consistent, scalable, and reliable Key-Value Stores as the underlying mechanism to ingest and access exabyte-scale unstructured data. By eliminating age-old and compute-intensive data maps, data look-ups, and cache-coherence tasks, the resulting architecture can consistently deliver 4x-5x low-latency throughput (GB/s) in very small footprints compared to other industry counterparts. As an example, SDP can deliver 40GB/s of sustained read-writes in 5U.
The aforementioned performance can be increased predictably and independently from the underlying storage capacity. The disaggregated architecture of SDP allows you to scale the platform for throughput (GB/s) or capacity (TB) as your requirements change, without artificial limitations. Designed to run on industry-established x86 architecture, SDP is truly software defined, and no bespoke hardware nor custom client code is needed to avail these performance and access capabilities. The POSIX-compliant interface ensures that all familiar platform services are easy to use.
The scientific community is well-versed in delivering and, in turn, expecting tangible results. SDP has on several occasions quickly and measurably delivered on its core-value proposition: High-performance systems save precious time. Among others, it is noteworthy how SDP has delivered a solution to a leading scientific organization for its microscopy needs. Deployment of SDP in the data analysis pipeline has successfully eliminated multiple days of instrument downtime and weeks-worth of delays in lab activities. Please visit here to learn more.
Subsequent posts will dive into the details about how the Stellus Data Platform provides value in areas of Genome Analysis, Cryo-EM, and ML/AI workflows.