"This does not mean we are going to do it perfectly, no doubt we are also going to make mistakes. But we do believe we can develop a model with enormous added value. A reliable, robust and ethical model that is compliant with the law."
5 questions on GPT-NL, the open Dutch language model
1. What is GPT-NL again?
Together with partners TNO and NFI and with funding from the Ministry of Economic Affairs, SURF is working on a Dutch large language model: an algorithm that generates text thanks to generative artificial intelligence. In other words: our own large language model, as the technology behind ChatGPT is called, but based on the Dutch language and culture. The language model can in the future be built into various applications by SURF as well as partners TNO, NFI and the government.
2. What is GPT-NL not?
The aim of GPT-NL is not to develop a Dutch alternative to a chatbot like ChatGPT. ChatGPT is in fact a chat application offered by OpenAI as a service to individual end-users. The underlying technology of such a chatbot is a language model, which in the case of OpenAI are, for example, GPT-3.5 or GPT-4. The goal of GPT-NL is to build that underlying technology, i.e. a language model, but then specifically for the Dutch language. Such a model can then be built into various services or applications within the government, industry or SURF services.
The functionalities of the GPT-NL model will be similar to other language models (e.g. summarising documents, converting texts into understandable language or retrieving information from one's organisation), but this does not necessarily have to be done via a human-feeling chat, as with ChatGPT. However, it is possible to build a chat interface on GPT-NL.
Compared to the billion-dollar budgets of the companies behind ChatGPT and other commercial models, the €14 million budget for GPT-NL is quite small. Therefore, GPT-NL focuses only on text and not on audio and images.
3. Why a proprietary language model?
- Existing models are mainly trained on American or Chinese data and this determines the (biased) results. Moreover, the models reproduce stereotypes around, for instance, gender and ethnicity. We want a model trained on the Dutch language and Dutch values.
- Existing language models are not open: we have no insight into which choices were made and on which data they were trained.
- We want to comply with legislation around copyright and GDPR.
- The concentration of expertise in a limited number of companies is also a risk for open discussion on the risks and opportunities of the technology. The project aims to open that discussion a little.
4. What are the ambitions and planning?
- We are currently hard at work on data collection. By early 2025 at the latest, we will start training the model.
- We are building a Dutch-English language model (datasets from Flemish, Frisian or other variants of Dutch are also welcome).
- We are building the model from scratch and thus not building on an existing language model.
- We train that language model with data we are entitled to.
- We provide a dataset that is free of personal, confidential and/or sensitive data.
- We are transparent about the choices we make, we make our codes public and we share our knowledge and experiences.
- We consider a business model in which GPT-NL becomes available to both public and private parties and educational and research institutions.
See also: gpt-nl.nl/commitments
5. How can I contribute to the development of GPT-NL?
GPT-NL requires cooperation and open discussion. Only together can we build GPT-NL! The developers need help in, among other things, collecting varied and rich datasets. We would also like to know which applications the SURF community finds valuable for education and research. These include developing educational resources, promoting AI literacy, grading submitted work, coaching students and research.
Want to contribute? Donate your data.
Want to know more?
Go to the project website https://gpt-nl.nl