Case study

"Where can we put all those data?"

What if research data serve multiple purposes, and need to be stored in different places, when local facilities are not sufficient for this purpose? Erwin van Wieringen explains how SURF's Storage scale-out provides a suitable solution for RIVM.

14 December 2023

Key facts

Who: Erwin van Wieringen
Function: DevOps engineer and bioinformatician
Organisation: RIVM
Service: Storage scale-out
Challenge: Storing large amounts of DNA data
Solution: A module that connects relatively easily to one's own iRODS data management environment

Public health

On the outskirts of rustic Bilthoven is the otherwise high-security site of the National Institute for Public Health and the Environment (RIVM). Inside the fences, some 2,100 people are committed to a healthy population and living environment. Quite a few of those employees are engaged in scientific research and collect data for topics that regularly make the news; from detecting and monitoring new viruses and providing screening programmes to detect cancer, to measuring harmful substances in the environment. As an agency of the Ministry of Health, Welfare and Sport, the RIVM positions itself as the knowledge institute for public health, care, nutrition, environment and safety.

"When the COVID pandemic broke out, we suddenly had to process more data in a week than we previously did in a year"
Erwin van Wieringen, RIVM

"I have always felt at home in IT. Before this, I worked at a regional hospital for over 14 years on storage solutions, virtualisation, application management and networking and network security." Speaking is Erwin van Wieringen, DevOps engineer and bioinformatician at the RIVM. "I started here as an infrastructure architect. A few years ago, I ended up at what we then called the 'modelling platform'. This was a kind of computing environment for research institutes within the government, which we offered more broadly than to RIVM researchers alone. Then I started working at the 'Bioinformatics' programme, which, by the way, I still support."

Data from DNA sequencing

"The biologists at the RIVM are working a lot with data from DNA sequencing, a technique that can read the code of DNA. That code is the sequence of information on DNA. This usually creates very large files that require a lot of post-processing before you can do anything with them. Take, for example, a virus like COVID-19. Using the sequences, you can track how an infection spreads through the Netherlands. Researchers find the source and see how the virus moves."

Computing power and data management

"Initially, those researchers need computing power, precisely because the post-processing of the data is very computationally intensive. So we set up a computing cluster. That infrastructure is now in Amsterdam, near SURF's, if I'm not mistaken. Then data management came into the picture. A lot of data is generated, but there was little insight into what was stored where and with which software (versions) the data had been processed.

Sharing data securely

So we started looking for a system to record all those data properties, preferably in an automated way. This is how we ended up with iRODS. With iRODS, we manage secure data sharing, metadata management and workflow automation. You can store data in different physical locations without the user being aware of it. That can be on disk, but also on tape."

Scaling up

"After a while of hard work, we had set up quite a nice system with iRODS - if I do say so myself. It was running successfully, until the COVID pandemic broke out. From then on, the Ministry put the RIVM to work even harder and we suddenly had to 'sequence' a lot more data; suddenly we were doing in a week what we were previously processing in a year. And so we had to scale up, because where could we put all those data?"

Relatively simple

"Around that time, we contacted SURF and Storage scale-out came into the picture, a module that we could link relatively easily to our own iRODS environment. We set up 'policies', with which the data per project is automatically stored locally or archived in SURF's Data Archive. In our case, this means raw data sets of up to 300 to 400 gigabytes. That certainly does not make us the largest purchaser of storage capacity at SURF."

Pragmatic

"The collaboration runs to great satisfaction. I have noticed that the people at SURF are very pragmatic. We make contact easily and get things done quickly. We really only speak to the support department in case of minor malfunctions, for instance when a process has failed or a certificate has expired. In such cases, you create an incident and then they usually respond within an hour."

"I am happy to see a greater emphasis on data-centric working everywhere, not just at the RIVM. There is a realisation among research institutes that data is your biggest 'asset'. What data do we actually have at our disposal? How can we better organise and retrieve our data? Is that data confidential, freely available or can it, for instance, only be published in five years' time? And how is privacy optimally guaranteed? These are topics we naturally also look at with our colleagues."

Still a long way to go

"That dynamic really appeals to me. On the one hand, I am close to the researcher or the users of our systems and help colleagues do their work better. On the other hand, I am on top of the technical developments and see IT solutions evolve at a rapid pace. In any case, I am far from being out of the loop in the coming years."