Q&A GPT-NL: Dutch own open AI language model

Following the development of "GPT-NL", a proprietary open language model for the Netherlands, you will find a list of frequently asked questions and answers here:

Q&A

SF Trendrapport_AI
Why is the Netherlands developing its own open language model?

An own Dutch open language model is needed to develop, strengthen and perpetuate digital sovereignty. We thus take an important step towards transparent, fair and testable use of AI according to Dutch and European values and guidelines and with respect for ownership of data.

What exactly is GPT-NL?

GPT-NL will be an open language model incl. a virtual facility open to partners who want to contribute with data and knowledge or develop applications based on GPT-NL. It can thus be deployed within academic institutions as well as by researchers and by governments. It allows them to explore and try out language models in general including specific applications in the fields of security, health, education, services and numerous other domains.

Which parties are involved?

SURF, TNO and NFI are the parties involved in the realisation of GPT-NL. Furthermore, co-creation with a wide range of Dutch institutions will take place and use will be made of expertise already present there.

What budget?

Funding for the model comes from RVO/Ministry of Economic Affairs. The project plan "Facility for a sovereign Dutch language model" was submitted for this purpose in May 2023 and awarded at the end of October 2023. An amount of EUR 13.5 million will be made available for the project. Read official announcement.

What is the schedule?

The project consists of two phases: in the first year, the language model will be developed. The follow-up phase is that of exploitation, with a connection to the supercomputer in Amsterdam for computing power. In addition, SURF is developing its own deployment platform for use in the education and research environment.

Will the model be built from scratch?

We will reuse state-of-the-art model architecture. However, the training itself will probably be done from the ground up to avoid inheriting unknown factors from previous models. Since the training procedure of most models is opaque, using pre-trained starting points would limit the openness of our model. Moreover, when training a Dutch model on top of a predominantly English base, we have to be mindful of biases.

How open will the model be?

We intend to distribute the dataset and model weights as openly as possible. Choices made during data generation will be transparent.

Use of GPT-NL

GPT-NL is hosted in a virtual facility. In this way, it can be used by academic institutions, researchers and governments, as well as companies. It allows them to explore and try out language models in general, including specific applications in the fields of security, health, education, services and many other domains.

What are the benefits of GPT-NL for scientific research?

The value of the project lies both in the development of the ecosystem and expertise and in the model itself. Strengthening this core expertise will improve the responsible use of the technology and the overall position regarding commercial models. The virtual facility in this project aims to democratise the responsible use of the technology by facilitating experimentation and knowledge sharing. Moreover, the release of the training set will also benefit subsequent generations of models.

How sustainable is this language model?

We take sustainability and CO2 emissions into account. We need to be responsible in our use of resources. Together with our partners, we are building the most efficient language model we can based on the latest research; this includes discussion on both the size the model should be and how its training and implementation can be optimised given that size. See also https://www.surf.nl/en/energy-aware-computing how SURF is working more broadly on energy-conscious computing.

How is the model trained?

The model will be trained on a computer cluster hosted by SURF. The whole process will be transparent and require contributions from relevant stakeholders regarding data collection, curation, model validation, etc. Furthermore the project is in close contact with specialized legal experts on the topic to help us properly navigate the question of copyright.

Media coverage overview

The development of GPT-NL has attracted the attention of many different media outlets:

In addition to traditional media, GPT-NL also went viral on LinkedIN. with dozens of posts shared and international coverage of this upcoming model.