"Sharing genetic data is incredibly difficult because of the strict GDPR rules."
Collaborative hub for Alzheimer's research: 'We are pioneering'
Making Dutch DNA data available to researchers from other countries, so that we can accelerate Alzheimer's research? Strict European privacy legislation makes this quite difficult. Biochemist Henne Holstege, together with bioinformatician and colleague Marc Hulsman, and SURF are working on a platform that should make this possible.
During her PhD research on breast cancer, Henne Holstege initially planned to specialise in genetic research on tumours. But then in 2005 Hendrikje van Andel-Schipper, the oldest person in the world, died. "She was 115 years old and still razor sharp. I actually found that much cooler. She proved on her own that not everyone deteriorates. The human brain can remain healthy for a very long time. I wanted to know how that was possible. So I started by completely analysing Ms Van Andel-Schipper's genome. But of course you need a lot more test subjects."
That is why Holstege set up her now famous 100-plus study at the Amsterdam UMC. In this project, she does not study why people develop Alzheimer's, but rather why healthy people do not develop it. To this end, she studies the DNA of these elderly people, among other things, to see what genetic material they are born with. "In addition, we ask people over 100 who participate in our research to donate their brains after their death. Then we can see what a very old, but properly functioning brain actually looks like on the inside. We are learning a great deal from this!
Together with SURF, your research group is now working on a platform where researchers from different countries can upload and analyse their data. Why is this necessary?
"In this project we want to know exactly which genetic variants play a role in Alzheimer's disease. It is still in the future, but the idea is that with the genetic variants found we will be able to predict who will become demented later in life and who will not. So that we will know in time whom to treat, before clinical symptoms appear. Because by then it's too late: you lose those brain cells and no medicine can bring them back. Because genetic material is so incredibly complex, you have to compare very large groups of Alzheimer's patients with very large control groups. And because it often involves rare variants, you have to have enough evidence to establish that a particular hereditary variant leads to an increased risk of getting the disease."
This means you have to collaborate with as many researchers as possible, preferably worldwide, to get as much data as possible. This is difficult because as soon as we are allowed to look into someone's dataset, the data has to come to us and you have to comply with the General Data Protection Regulation (GDPR). Genetic material falls into the category of 'special personal data' and is therefore highly protected. Genetic research is therefore extremely difficult at the moment, because it is not easy to share this data. And yet all those Alzheimer's patients have donated their DNA with the idea: we can learn from it.
In the US, they take a different view. Everything that the federal National Institutes of Health (NIH) funds for the gathering of data must be shared publicly. So we can just download the data from America. Then we can merge this with our data and the data of other European researchers.
But if they are in our account on the supercomputer, only we can access this large dataset. This made me feel uncomfortable. To solve this problem, we are developing a platform that runs at SURF: the Alzheimer Genetics Hub. This is where we will store the data and make it available, so that our American and European research partners can also access it. We all benefit from that, don't we?"
How can you be sure that the personal data remains secure?
"Anyone who is granted access by us can do their computations on the platform. Therefore, there is no need to download the datasets. Moreover, there is a limit on the amount of data you can download, so you cannot download entire genomes. But you can download the final results of your analysis, because that is what you want to publish. We also keep track of what is downloaded and by whom.
The data that is uploaded by others is not ours. We cannot provide access to these datasets, the data owners have to give permission for that. To obtain access, you have to sign all kinds of legal forms and contracts. Personal data therefore remains safe on the platform. But it's also about trust. You must never violate that."
What are your expectations for the collaborations that will take place?
"I hope that if we show that we are really going to share, many more researchers will put their data on the hub. And that the best bioinformaticians in the world will think: yes, finally I can access that data! They will start working with it, which will allow us to make a lot of new findings. We do have certain suspicions, but we need many more Alzheimer's patients and control DNA to confirm them. Technology is moving fast, we can do so much more than five years ago. I am optimistic that one day treatments will be possible."
"With this, we hope to really boost the field of Alzheimer's research"
In the initial phase, Holstege plans to offer access to the hub free of charge to people who share their data. “This is possible because we have received a large grant from NWO for computing time and storage. We are very much pioneering here. I must confess that I sometimes think: what have we got ourselves into? We cannot put all the energy we put into this into other papers, for example for the 100-plus research. That is sometimes difficult, because as a scientist you are judged on your output. But I have the feeling that we are really giving Alzheimer's research a boost here."
Spider technology: spider in the web
"When I first heard about this project, I thought: you can't do this with technology alone. Giving someone interactive access to data without being able to store it," says Coen Schrijvers, senior advisor and team lead at SURF. "So one of the biggest challenges is to develop a solution that is a seamless unit in technical and legal terms."
The cloud-based Spider technology, which SURF has only recently acquired, proved to be the ideal solution. "This allows us to design a customised environment that is fully tailored to the requirements of this project: a secure yet flexible environment for collaboration on large datasets."
The design philosophy of Spider Technology is that the computer cluster lives like a spider in the web: necessary components that you can already find elsewhere do not need to be built. Schrijvers: "You just link it to existing services, such as the supercomputer Snellius, our Grid storage or Research Drive."
Firewall and various roles
The Alzheimer's Genetics Hub is behind a firewall, with some specially designed services to enable logging in, selective file transfers and intensive logging. The technology offers different authorisation and authentication roles, such as a data manager who has all rights to the data, and a regular member who only gets access to certain data. "That suits this project very well," says Schrijvers. "We are also deploying the SURF Research Access Management service for this, which Holstege's group can use to verify the authenticity of users and add and remove them."
A proof of concept, scheduled for December 2021, should show whether the Alzheimer platform will also work practically. "If all goes well, we can then be in production within a few months. The project should run for several years. And will hopefully become a good example for similar projects."
Off-the-shelf Spider service
The Alzheimer's Genetics Hub is custom-made. SURF also provides a generic Spider Service, a dynamic and flexible platform suitable for large-scale data processing and optimisations for data-centric collaboration. Projects that do not require a separate environment are running on this. As with the Alzheimer's Genetics Hub, these users have access to powerful data processing and storage systems, and network connectivity to other SURF services of 1.2 Terabits per second.
"But most important is the support we provide for setting up and automating data-centric projects. That means, for example, that several people can work on one dataset, with a data manager assigning rights to project data and a software manager managing the tools for analysis. This makes it a reasonably low-threshold service, without any concessions to data processing power."