Thesis: AI Agents for the Extraction of Chemical Toxicity Data from Scientific Literature
Join us for your thesis work! Gain hands-on experience, work on real projects, and develop your skills in a supportive and innovative environment!
High Level Description
Chemical pollution globally threatens human health and ecosystems. Chemical hazard data is critical to detect and mitigate potential negative impacts at an early stage. Recently, computational tools, such as transformers, have been used to accurately predict, e.g., chemical toxicity towards the environment. The training of such models relies on large amounts of toxicity assay data being extracted from scientific publications, reports, and safety documents, and collected in open-access databases such as ECOTOX.
This master’s thesis aims to investigate how AI (LLM) agents can be used to automatically extract chemical toxicity data from scientific literature, enabling more accurate hazard prediction and supporting effective environmental legislation.
Project Description
The project will involve designing and implementing AI agents that can process scientific publications, reports, and safety documents to extract structured chemical toxicity data. Students will develop a proof-of-concept pipeline that accepts a list of publications as input and output validated data, with the ECOTOX database serving as a reference for evaluation.
The research will focus on handling noisy PDFs (including figures, tables, and OCR errors), designing evaluation metrics for extraction accuracy and traceability, and integrating domain-specific validation against ECOTOX. Depending on student interest, the project may also include fine-tuning LLMs to optimize extraction performance.
Who are we looking for?
We are looking for two students who want to write a 30 credit MSc thesis during the Spring of 2026. You should have:
- Required: Some programming experience (Python, pipelines, basic ML).
- Nice-to-haves: experience in OpenWebUI (or similar), web-scraping, NLP, pipelines, databases, or MLOps tools.
Students have most likely studied a master’s program in computational science or a program that involves software development.
Purpose
The purpose of this research is to explore whether AI agents can overcome the bottleneck of manual toxicity data extraction. By creating accurate, traceable, and automated extraction pipelines, the thesis aims to enable large-scale hazard prediction and contribute to more effective environmental protection.
An Exciting Journey with Knightec Group
Semcon and Knightec have joined forces as Knightec Group. Together, we are Northern Europe’s leading strategic partner in product and digital service development. With a unique combination of cross-functional expertise and a holistic business understanding, we help our clients realize their strategies – from idea to complete solution.
Practical Information
This is a master thesis position, located at our office in Lindholmsallén 2, Gothenburg. Start date as agreed.
Please submit your application as soon as possible, but no later than 2025-11-30. If you have any questions, you are welcome to contact Julia Hellberg. Note that due to GDPR, we only accept applications through our careers page.
- Business unit
- Thesis
- Role
- Master thesis
- Locations
- Göteborg

Already working at Knightec Group?
Let’s recruit together and find your next colleague.