Text embeddings for efficient search in digital investigations
Primary Investigator:
Umit Karabiyik
Vinicius Lima, Adil Koeken, Avantika Shah, Samay Nandwana, Umit Karabiyik
Abstract
With the increasing storage capacities of modern devices, digital investigators face significant challenges in efficiently locating relevant information on incriminating devices. It is not uncommon for a single smartphone to contain over 100GB of data, making manual searches impractical. The problem becomes even more complex when the needed information is not easily discoverable through keyword searches or when investigators do not know which keywords to use. This makes digital investigations cumbersome and time-consuming, highlighting the need for more efficient search methods. For text-based data, one promising solution is the use of text embeddings (or vectors) instead of traditional keyword searches. In this study, we simulated a database containing text related to drug-related discussions, where specific drug names were not explicitly mentioned. This reflects real-world scenarios, such as conversations among drug dealers who use coded language (slangs) instead of technical terms. Our research demonstrates how embeddings can be leveraged to identify drug-related content without relying on predefined drug keywords. We applied the LLM2Vec technique using different models and instruction formats, achieving strong accuracy metrics. Our findings serve as a proof of concept that text embeddings can significantly enhance search efficiency in digital investigations compared to conventional keyword-based methods.