High-performance large language models for Europe
The High-Performance Language Technologies (HPLT) project is developing very large-scale multilingual resources for large language models and machine translation. Massive text collections for pre-training are the ‘crude oil’ of the large language model (LLM) era. The process of ‘refining’ high-quality datasets from web data at scale presupposes computational infrastructure and technological muscle that is often characteristic of corporate environments, as evidenced, for example, by some notable generally available pre-training datasets: C4,¹ FineWeb 1 & 2,2,3 MADLAD-400,⁴ or Nemotron-CC.⁵ With a few notable exceptions, this line of work tends to capitalise on the English language. Here, we present the open-source results6,9,10 of the European R&D consortium HPLT – a project that has been funded under the auspices of the Horizon Europe programme in 2022–2025. Together with a myriad of additional results, HPLT has produced massive pre-training datasets of high-quality texts in close to 200 distinct language–script combinations. Its 2025 monolingual data release, HPLT 3.0, comprises some 30 trillion sub-word tokens in total, of which close to half represent languages other than English. We make this resource …









