The article uses Oracle AI Vector for a use case to cluster and visualize vectors. In my previous article I showed how to enable Oracle AI Vector, create a table with a vector and query it with simple SQL. Today we’ll run a more colorful experiment: storing Spotify track embeddings in Oracle AI vector, pulling them into Python, and visualizing songs on a single canvas. The source code is at the end of the article.

Why vectors belong in the database

Search that understands meaning rather than keywords is reshaping every data platform. Once a piece of content—text, image, audio—has been converted into a numeric embedding vector, you can:

Find similar items even if they share no common words (semantic search).
Detect anomalies by measuring the distance from “normal” vectors.
Cluster users or products into behavior-based segments.
Feed downstream models without leaving the database.

The use case in one sentence

“Can we group thousands of songs by sonic similarity—independent of explicit genre tags—and see what that community structure looks like?”

That is exactly what a streaming service’s “Fans also like” feature does under the hood. I will reproduce a tiny slice of that pipeline using two classic machine-learning tools:

Step	Tool	Simplified purpose	Result
Dimensionality reduction	t-SNE	Compress vectors down to 2 (or 3) visually interpretable axes	A cloud-like scatter plot
Clustering	K-Means	Assign each song to the nearest of k centroids	5 colour-coded groups

Machine Learning explained: K-Means and t-SNE

My Python example below uses two machine learning algorithms: K-Means and t-SNE. The following explanations are simplified but show the principles used in the program.

K-Means Clustering: “The digital music sorter”

K-Means works like a very systematic record dealer who sorts his collection into categories:

Starting point: the algorithm randomly places 5 “sorting centers” in the music space
Assignment: Each song is assigned to the nearest center
Adjustment: The centers move to the center of their assigned songs
Repetition: This process is repeated until stable groups are formed

The result is a set of clearly defined music clusters.

t-SNE: “The digital music cartographer”

t-SNE (t-distributed Stochastic Neighbor Embedding) is like a cartographer that projects a multidimensional musical landscape onto a 2D map:

The problem: Spotify vectors have hundreds of dimensions – impossible to visualize
The solution: t-SNE “flattens” these dimensions while preserving the neighborhood relationships
The result: songs that are similar in high-dimensional space also appear close to each other on the 2D map

Unlike PCA, t-SNE does not preserve the global structure, but focuses on local similarities – perfect for discovering music clusters.

The technical implementation

My Python program combines the following techniques that could be arranged into a data pipeline:

Database query

Connect to the database and get the data from the database by an SQL query. The query retrieves vectors for different genres from the Oracle AI Vector database. Each vector represents the complex audio characteristics like danceability, energy, speechiness, loudness, etc. of a song.

Data preprocessing and clustering

Standardization ensures that all vector dimensions are treated equally. K-Means then groups the songs into 5 clusters based on their similarity in high-dimensional space.

X = np.vstack(df[“vector”].to_numpy())
X_scaled = StandardScaler().fit_transform(X)

kmeans = KMeans(n_clusters=5, random_state=42).fit(X_scaled)

Visualization with t-SNE

t-SNE projects the high-dimensional vectors onto a 2D plane, with the perplexity parameter controlling the balance between local and global structure.

u = TSNE(n_components=2, perplexity=45, random_state=42).fit_transform(X_scaled)

The results

The resulting diagram shows fascinating insights into the music landscape:

Cluster analysis:

Techno (Blue): Forms compact, clearly delineated areas spread across the entire space
Opera/Guitar (Red): Interesting combination that unites acoustic and classical elements
Rock/Disco (Green): Shows the rhythmic connections between these genres
Pop (Black): The largest cluster, reflecting the diversity of the pop genre
Div (Orange): An isolated cluster that may contain cross-genre songs

Why keep vectors inside an RDBMS like Oracle?

There are many advantages to using established technologies like RDBMS for vectors, e.g.:

One source of truth – No siloed feature store to sync.
SQL + ML synergy – Combining filtering with vector search.
Security & governance – Security features can be applied to vectors, too.
Scalability – HNSW or IVF indexes can handle millions of vectors; Python only needs a working-set slice.

Summary & Takeaways

This article demonstrates how to use K-Means for clustering and t-SNE for dimensionality reduction to visualize high-dimensional vectors. Vectors are becoming first-class citizens in modern data architecture, and RDBMS like Oracle AI Vector makes them accessible via everyday SQL.

t-SNE visualizations help domain experts trust your embeddings before deploying them to production.
K-Means remains a solid baseline for segmentation—fast, interpretable, and effective for initial results.

What’s next?
Future work could include multimodal vectors (that include lyrics), LLM-enhanced clustering, or integrating user behavior for even richer clustering. Stay tuned—and happy clustering!

Python Code

from __future__ import annotations

import os
import warnings
from textwrap import dedent

import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import oracledb
import pandas as pd
from dotenv import load_dotenv
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

matplotlib.use("Agg")

query_def = dedent(
    """
        SELECT vector,
            track_genre,
            track_name,
            artists
        FROM   spotify_track
        WHERE  track_genre IN ('techno', 'opera', 'disco', 
                               'pop', 'rock', 'guitar')
        --FETCH FIRST 10000 ROWS ONLY
    """
).strip()


def get_oracle_connection() -> oracledb.Connection:  # pragma: no cover
    host = os.getenv("DB_HOST")
    port = os.getenv("DB_PORT")
    service = os.getenv("DB_NAME")
    user = os.getenv("DB_USER")
    password = os.getenv("DB_PASSWORD")

    if not all([host, port, service, user, password]):
        missing = [
            var
            for var, val in {
                "DB_HOST": host,
                "DB_PORT": port,
                "DB_NAME": service,
                "DB_USER_": user,
                "DB_PASSWORD": password,
            }.items()
            if not val
        ]
        raise EnvironmentError(f"Missing environment variables: {', '.join(missing)}")

    dsn = oracledb.makedsn(host, port, service_name=service)
    return oracledb.connect(user=user, password=password, dsn=dsn)


def fetch_df() -> pd.DataFrame:
    with get_oracle_connection() as con:
        df = pd.read_sql(query_def, con)
    df.columns = df.columns.str.lower()
    return df


def cluster_and_plot(
    n_clusters: int = 5, perplexity: int = 45, random_state: int = 42
) -> None:
    df = fetch_df()

    X = np.vstack(df["vector"].to_numpy())
    X_scaled = StandardScaler().fit_transform(X)

    kmeans = KMeans(
        n_clusters=n_clusters, random_state=random_state, n_init="auto"
    ).fit(X_scaled)
    labels = kmeans.labels_

    cluster_genres = pd.DataFrame({"cluster": labels, "genre": df["track_genre"]})
    for i in range(n_clusters):
        cluster_data = cluster_genres[cluster_genres["cluster"] == i]
        print(f"Cluster {i}: {cluster_data['genre'].value_counts().head(3)}")

    u = TSNE(
        n_components=2, perplexity=perplexity, random_state=random_state
    ).fit_transform(X_scaled)

    fig = plt.figure()
    ax = fig.add_subplot(111)

    colors = labels.astype(str)
    colors[colors == "0"] = "red"
    colors[colors == "1"] = "blue"
    colors[colors == "2"] = "green"
    colors[colors == "3"] = "orange"
    colors[colors == "4"] = "black"

    ax.scatter(u[:, 0], u[:, 1], c=colors)
    ax.set_title("KMeans Cluster after t-SNE dimensionality Reduction")

    red_patch = mpatches.Patch(color="red", label="Opera/Guitar")
    blue_patch = mpatches.Patch(color="blue", label="Techno")
    green_patch = mpatches.Patch(color="green", label="Rock/Disco")
    orange_patch = mpatches.Patch(color="orange", label="Div")
    black_patch = mpatches.Patch(color="black", label="Pop")
    ax.legend(handles=[red_patch, blue_patch, green_patch, orange_patch, black_patch])

    fig.tight_layout()
    fig.savefig("kmeans_tsne.png", dpi=150)
    print("PNG written: kmeans_tsne.png")


if __name__ == "__main__":
    load_dotenv()
    warnings.filterwarnings("ignore", category=FutureWarning)
    cluster_and_plot()

SQL Code

Source of the spotify music data are datasets on Huggingface.

CREATE OR REPLACE DIRECTORY spotify_csv_dir AS '/opt/oracle/external/';
GRANT READ, WRITE ON DIRECTORY spotify_csv_dir TO ;   


DROP TABLE spotify_track_ext PURGE;
CREATE TABLE spotify_track_ext
(
  id                NUMBER,            
  track_id          VARCHAR2(50),
  artists           VARCHAR2(4000),
  album_name        VARCHAR2(400),
  track_name        VARCHAR2(400),
  popularity        NUMBER(5,3),       
  duration_ms       NUMBER,
  explicit          VARCHAR2(5),
  danceability      NUMBER,            
  energy            NUMBER,
  track_key         NUMBER,
  loudness          NUMBER,
  modality          NUMBER,
  speechiness       NUMBER,
  acousticness      NUMBER,
  instrumentalness  NUMBER,
  liveness          NUMBER,
  valence           NUMBER,
  tempo             NUMBER,
  time_signature    NUMBER,
  track_genre       VARCHAR2(100)
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY spotify_csv_dir
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE CHARACTERSET AL32UTF8
    SKIP     1
    BADFILE  spotify_csv_dir:'dataset.bad'
    LOGFILE  spotify_csv_dir:'dataset.log'
    FIELDS CSV WITH EMBEDDED
  )
  LOCATION ('dataset.csv')
)
REJECT LIMIT UNLIMITED;


DROP TABLE spotify_track PURGE;

CREATE TABLE spotify_track
(
  id                NUMBER PRIMARY KEY,
  track_id          VARCHAR2(50),
  artists           VARCHAR2(4000),
  album_name        VARCHAR2(400),
  track_name        VARCHAR2(400),
  popularity        NUMBER(3),
  duration_ms       NUMBER,
  explicit          VARCHAR2(5),
  danceability      NUMBER(4,3),
  energy            NUMBER(4,3),
  track_key         NUMBER(2),       
  loudness          NUMBER(6,2),
  modality          NUMBER(1),
  speechiness       NUMBER(5,3),
  acousticness      NUMBER(5,3),
  instrumentalness  NUMBER(5,3),
  liveness          NUMBER(5,3),
  valence           NUMBER(5,3),
  tempo             NUMBER(6,2),
  time_signature    NUMBER(1),
  track_genre       VARCHAR2(100),
  vector            VECTOR(14)
);


INSERT /*+ APPEND */ INTO spotify_track
SELECT  id,
        track_id,
        artists,
        album_name,
        track_name,
        popularity,
        duration_ms,
        explicit,
        danceability,
        energy,
        track_key,
        loudness,
        modality,
        speechiness,
        acousticness,
        instrumentalness,
        liveness,
        valence,
        tempo,
        time_signature,
        track_genre,
        TO_VECTOR(
          '['||
            popularity        ||','||
            duration_ms       ||','||
            danceability      ||','||
            energy            ||','||
            track_key         ||','||
            loudness          ||','||
            modality          ||','||
            speechiness       ||','||
            acousticness      ||','||
            instrumentalness  ||','||
            liveness          ||','||
            valence           ||','||
            tempo             ||','||
            time_signature    ||']',
          14,        
          FLOAT32    
        )  AS vector
FROM    spotify_track_ext;

COMMIT;


SELECT COUNT(*) AS rows_loaded FROM spotify_track;

Trackbacks/Pingbacks

From Heart Rate to H3: Six Ways to Think About Your Running Data in Oracle - data.KISS - […] Cosine distance measures the angle between two vectors, not their length. It is appropriate when magnitude is irrelevant and…

Spotify music clustering with Oracle AI Vector: From vectors to visual insights

Why vectors belong in the database

The use case in one sentence

Machine Learning explained: K-Means and t-SNE