With the introduction of the vector data type and the algorithms available in Oracle Machine Learning (OML) starting with Oracle Database 23ai [2], it is now possible to vectorize records — e.g., via PCA — to support both clustering and similarity search. However, these algorithms do not natively handle fields that contain natural language effectively. This limitation is common in real-world scenarios such as CRM systems, where free-text operator notes or customer feedback coexist with structured attributes like customer profiles and product details.
In this article, we present a technique that seamlessly combines numerical, categorical, and natural language fields into a single, unified vector representation of the entire record. The objective is to improve similarity search and clustering accuracy by preserving both the numerical structure of the data and the semantic meaning of its textual content — without relying on rigid, static WHERE filters that can unnecessarily restrict the results returned.