What you need to know about... GPT-NL
The American ChatGPT had great impact upon release. Now TNO, the Netherlands Forensic Institute and SURF are working on a Dutch language model; GPT-NL. Why that is badly needed and how we are approaching it tells Thomas van Osch, machine learning advisor at SURF.
Listen to the podcast now in your favourite app!
1. Introduction (0:00 - 0:30)
- Introduction of the SURFshort podcast by Sanne Koenen.
- Guest: Thomas van Os, Machine Learning Advisor at SURF.
- Topic: GPT-NL and its impact.
2. What is GPT-NL? (0:31 - 1:48)
- GPT-NL is a new language model, specific to Dutch society.
- The model is being developed completely from scratch, so that Dutch norms and values are better incorporated into the model.
- Goal: more control and autonomy compared to other models such as GPT-4, which are trained on US data.
3. Examples of problems with existing models (1:49 - 2:58)
- Example of ChatGPT giving incorrect answers, such as the capital of North Holland.
- The aim of GPT-NL is to better respond to Dutch values and sensitive issues such as the Zwarte Piet discussion.
4. Data and cooperation (2:59 - 4:15)
- Cooperation with Dutch data providers, such as libraries, archives, and municipalities.
- Importance of access to Dutch data to build an accurate and representative model.
5. Benefits for institutions (4:16 - 5:42)
- The model will be open source and accessible to Dutch institutions.
- Benefits: autonomy, confidentiality and compliance with ethical and legal rules such as the GDPR.
6. Regulation and the AI Act (5:43 - 6:56)
- The new AI Act coming on top of GDPR will require more transparency and documentation on training data.
- OpenAI and other companies will likely have to be more open about their data structure.
7. Competition with commercial companies such as OpenAI (6:57 - 7:42)
- GPT-NL will focus more on specific, Dutch applications and standards.
- The goal is not to be better in benchmarks, but to be relevant to Dutch society.
8. Team and priorities (7:43 - 9:00)
- A diverse team is working on GPT-NL, with expertise in technology, ethics and legal issues.
- Priority is on data collection and building benchmarks relevant to the Dutch context.
9. Benefits for research and education (9:01 - 9:58)
- Research: GPT-NL offers opportunities for Dutch research and experiments.
- Education: the model can educate lecturers and students about how GPT models work.
10. Future expectations (9:59 - 11:32)
- The first model is expected to be published in early 2025.
- The model will be developed incrementally with versions more specific to different tasks, such as instruction or chat.
11. Tips for further deepening (11:33 - 12:50)
- Technical tip: Hugging Face has published a dataset called Flying Web, which is useful for data acquisition and filtering.
- Reading tip: Isaac Asimov's short story The Last Question, a science fiction classic about the future of technology.
GPT-NL: a customised language model for the Netherlands
The Netherlands is on the brink of a revolution in language modelling. In the latest episode of our SURFshort podcast, Thomas van Os, Machine Learning Advisor at SURF, talks about GPT-NL: a project to develop a language model specifically for the Netherlands. What makes GPT-NL different from existing models, and why is it so important for our society? In this podcast, you will hear what GPT-NL is, why it is needed, and what steps are being taken to make it a reality.
Why do we need our own language model?
In the podcast, Thomas van Os explains that most language models, such as GPT-4 which ChatGPT uses, are trained on English-language data coming from US sources. As a result, the results often have a strong US perspective or bias. This means they do not always fit well with Dutch norms, values and cultural context. Thomas gives the example that some models struggle to give correct answers to questions about Dutch cities, or cannot properly place sensitivities such as the Zwarte Piet discussion. GPT-NL aims to build a language model that better understands these nuances and takes Dutch society and culture into account.
Working together for the right data
Thomas stresses that GPT-NL is trained with specifically Dutch data, obtained through collaborations with libraries, archives and government agencies. He says these partnerships are essential to ensure that the model uses data representative of the Netherlands. The ultimate goal is a Dutch language model that can answer and provide advice relevant to our society based on local data.
GPT-NL and the GDPR: what about privacy?
During the podcast, Thomas discusses that privacy and ethics play a central role in the development of GPT-NL. He explains that the model strictly adheres to European privacy legislation (GDPR). This means that data is carefully selected and processed to protect personal information. Moreover, the development of GPT-NL takes into account the upcoming AI Act, which imposes additional requirements for transparency and documentation of AI systems.
What makes GPT-NL unique compared to GPT-4?
In the podcast, Thomas explains that GPT-NL is not focused on achieving high scores on international benchmarks, unlike GPT-4. Instead, GPT-NL focuses specifically on Dutch applications and context, such as legal terms or societal issues. This means that the model is better tailored to the unique needs of the Netherlands, without trying to emulate the broad functionality of commercial models.
Support for research and education
Thomas also talks about the applications of GPT-NL within research and education. He emphasises that a language model tailored to Dutch language and culture can be particularly useful for scientists and students. This is noticeable, for example, when automatically summarising Dutch-language scientific articles or explaining complex legal concepts in understandable language. The deployment of GPT-NL for such situations contributes to better accessibility of knowledge and supports the education system.
A look to the future: when will GPT-NL become available?
According to Thomas, the first model of GPT-NL is expected in 2025. After that, the language model will be developed step by step. He explains that there will be different versions focusing on specific tasks, such as text generation for chatbots or giving instructions. In this way, the model can be further refined in the future to suit the needs of different sectors.
Tip for developers: Hugging Face and the Flying Web dataset
For those who want to get started with AI and language processing themselves, Thomas recommends Hugging Face as a good starting point. This platform offers a large collection of open source tools and datasets for training and improving AI models. One such dataset is the Flying Web dataset, which helps collect and filter web data - an essential process for developing smart language models. Hugging Face also offers extensive tutorials and tools to get you started.
Every month we update you in 15 minutes on technological developments in education and research with a new SURFshort.