AI tools and the Large Language Models behind them are trained with data that is freely accessible on the internet, even if this information is illegal or copyrighted. This is especially the case with non-English language models.
Recent research shows that Dutch language models are largely trained with data from an illegal pirate site. It also appears that the filter of OpenAI (the maker of ChatGPT) to check online content for quality does not work well for Dutch-language content.
The controversial Dutch pirate site Docplayer accounts for 3.6 percent of the total training dataset. This website contains private information, such as applicant evaluation documents, and data breach data, including complete resumes and tax returns. Although the website has been declared illegal by the Dutch Data Protection Authority and the National Cyber Security Center, the website is still up and running.
Advertisements from private sellers are also well represented in the dataset. 0.3 percent comes from ebay.nl. Marktplaats.nl has a share of 0.2 percent. This means that the language model also contains many telephone numbers of private individuals from advertisements.
And it can be worse
Even more disturbing is that the dataset also draws a lot of information from websites that are full of misinformation. For example, the investigation showed that the neo-Nazi website Stormfront, the conspiracy site Vrijspreker and the anti-Islamic and Europhobic blog E.J. Bron have been used as training material. The neo-Nazi website is even one place lower in the list of sources than a general news site such as RTL Nieuws. AI uses approximately the same amount of data of both websites…
Furthermore, the top two hundred most cited websites contain a striking number of quality media. These are used without ever being paid for. From a quality newspaper like ‘de Volkskrant’, 162,000 unique texts were used – approximately ten years of journalistic work.
Poor quality filter for non-English websites
Non-English websites are difficult for the companies behind chatbots to check for reliability and relevance. Language models are mainly developed in the United States, where researchers mainly speak English. It is therefore difficult for them to determine which websites should be included in the dataset and which should be left out.
Moreover, the number of Dutch-language websites on the worldwide Internet is not that high. You can only train a chatbot properly with sufficient training material and you cannot achieve that amount with only top websites.
All non-English websites
The problem occurs with all non-English language models. These are also trained with datasets full of disinformation, private data and copyrighted content. And part of this can be found in the answer that a chatbot gives you.
The Dutch Data Protection Authority (AP) has sent OpenAI a letter asking for more clarity about ChatGPT, but has not yet received a response.
In any case, it does indicate the need to regulate and stop the spread of disinformation and personal data through AI-generated content. The European Union’s AI Act should come into force before the end of 2023 and put an end to piracy and privacy violations by the language models.