Natural Language Processing

in a nutshell

Prof. Ivan Yamshchikov's laboratory works on various aspects of artificial intelligence aligned with verbal cognition. This includes training and inference of large language models, as well as the research on their various applications.

The laboratory has several subfields of work.

Open and efficient language models
The laboratory partners with Jet Brains to work on efficient tokenization algorithms that could improve LM performance for code generation. We also closely work with French startup Pleias and together we contributed to the creation of Common Corpus — the biggest dataset for LLM pretraining published under permissive license.
AI Safety
NLP@CAIRO works on various aspects of LLM safety both in terms of system alignment as well as evaluation of potential harmful bias that could affect humans using LLMs for their daily needs.
AI and empathy
The third direction of the laboratory that emerged recently is focused on the questions that arise when humans interact with language models. We try to understand how LLMs affect human behavior both in terms of individual decisions as well as on the level of the social fabric.

current project(s)

ERIC — Efficient Representations for Intelligent Coding

project title	ERIC — Efficient Representations for Intelligent Coding
summary	The project is focused on the creation of new tokenization algorithms that could improve the efficiency of generative models for code, but could also have positive impact on a broader set of NLP tasks especially for the low resource languages.
key words	tokenization, generative models for code
collaborator	Jet Brains
funding	Jet Brains
duration	3 years

AIOLIA

project title	AIOLIA
summary	AIOLIA gives a robust 3-tier response to the complex challenges posed by the need to operationally interpret the EU AI Act and global AI regulation. Resolutely European, AIOLIA's vision propagates beyond EU, embracing global cooperation with leading universities and think tanks in China, South Korea, Japan, and Canada. Utilizing UNESCO platform with its reach to Africa and South Asia, AIOLIA’s guidelines evolve into an analytic toolbox for key international AI dialogues and processes. This global perspective ensures that AIOLIA's impact is not only significant but also sustainable, contributing to fair scientific cooperation and providing concrete and culturally informed ethics instruments to shape the next generation of AI systems.
key words	ai ethics
collaborators	French Alternative Energies and Atomic Energy Commission (CEA), Research Institute of Sweden (RISE), Karlsruher Insistut für Technologie (KIT), Center for Research and Technology Hellas (CERTH), CENTRIC, Amsterdam University Medical Centers, Center for European Policy Studies (CEPS), European Network of Research Ethics Committees (EUREC), Euractiv, European Research Consortium for Informatics and Mathematics (ERCIM), AI Data Robotics Association (ADRA), Afliant, Oxipit, NIT Institute, McGill University, Chinese Academy for Science and Technology for Development (CASTED), ETICAS.AI, University of Osaka, Science and Technology Policy Institute (STEPI)
funding	European Commission Grant Agreement 101187937
duration	3 years
website	aiolia.eu

research group

Prof. Dr. Ivan Yamshchikov
(Professor for NLP)

Dr. Angelica Henestrosa
(Post-doctoral researcher)

Pavel Chizhov
(PhD student)

Bibin Babu
(PhD student)

Vishnu Prasad
(PhD student)

Svetlana Gorovaia
(PhD student)

contact for collaboration

Prof. Dr. Ivan Yamshchikov

ivan.yamshchikov[at]thws.de