Thesis: Out-of-Distribution Detection Techniques on Trained Chemical Transformer Models
Join us for your thesis work! Gain hands-on experience, work on real projects, and develop your skills in a supportive and innovative environment!
High Level Description
Recent advances in deep learning have made it possible to represent chemical structures as dense, high dimensional embeddings where the AI model has captured subtle relationships from training samples. These embeddings are used to predict various chemical properties, crucial in fields such as drug discovery and chemical risk assessment. However, in real-world scenarios these models often encounter input samples that are outside of the scope of the model, so called out-of-distribution (OOD) samples. The detection of OOD samples is critical in order to guarantee prediction accuracy and reliability.
Project Description
This thesis aims to systematically evaluate and categorize recent advances OOD detection methods mentioned in scientific literature and evaluate their applicability on chemical embeddings from a transformer model trained for chemical toxicity prediction [1]. The trained transformer model represents a chemical structure as textual-tokens, same as modern LLM:s, and updates their token embedding iteratively over several layers. The final layer outputs a single embedding which is used for toxicity prediction.
The central focus of the thesis is to investigate how OOD detection can be applied on the token embeddings to quantify how far off a new chemical lie compared to its in-distribution. Using energy based or distance-based measures, such as cosine similarity, the project aims to evaluate OOD detection applied on the embedding vectors and evaluate the applicability of the methods on TRIDENT-models [1,2]. The data to be used comprises ~10 000 chemicals stemming from the ECOTOX database [3].
Who are we looking for?
We are looking for students who want to write a 30 credit MSc thesis. You should have:
- Required: Programming experience (Python), basic understanding of AI/ML concepts, interest in both software and hardware integration
- Nice-to-haves: Experience with DNN Architecture, PyTorch, LLMs and LLM APIs, Statistics
Students should have studied computer science, AI/ML, robotics, or related fields where software and algorithms are relevant. An interest in data science is helpful but not required.
Purpose
The purpose of this research is to explore the usage of OOD detection in Life Science by exploring existing state-of-the-art and apply it on a real-world scenario. By creating accurate OOD detection methods, this thesis aims to contribute towards more trustworthy AI models that can be incorporated in data-driven life science.
An Exciting Journey with Knightec Group
Semcon and Knightec have joined forces as Knightec Group. Together, we are Northern Europe’s leading strategic partner in product and digital service development. With a unique combination of cross-functional expertise and a holistic business understanding, we help our clients realize their strategies – from idea to complete solution.
Practical Information
This is a master’s thesis position, located at our office in Gotheburg. Start date January 2026. Please submit your application as soon as possible, but no later than 2025-11-30. If you have any questions, you are welcome to contact Julia Hellberg. Note that due to GDPR, we only accept applications through our careers page.
References
[1] Mikael Gustavsson et al., Transformers enable accurate prediction of acute and chronic chemical toxicity in aquatic organisms. Sci. Adv.10,eadk6669(2024). DOI:10.1126/sciadv.adk6669
[2] TRIDENT prediction tool: https://trident.serve.scilifelab.se/
[3] ECOTOX database: https://cfpub.epa.gov/ecotox/index.cfm
- Business unit
- Thesis
- Role
- Master thesis
- Locations
- Göteborg

Already working at Knightec Group?
Let’s recruit together and find your next colleague.