The article uses Oracle AI Vector for a use case to cluster and visualize vectors. In my previous article I showed how to enable Oracle AI Vector, create a table with a vector and query it with simple SQL. Today we’ll run a more colorful experiment: storing Spotify track embeddings in Oracle AI vector, pulling them into Python, and visualizing songs on a single canvas. The source code is at the end of the article.
Why vectors belong in the database
Search that understands meaning rather than keywords is reshaping every data platform. Once a piece of content—text, image, audio—has been converted into a numeric embedding vector, you can:
-
Find similar items even if they share no common words (semantic search).
-
Detect anomalies by measuring the distance from “normal” vectors.
-
Cluster users or products into behavior-based segments.
-
Feed downstream models without leaving the database.
The use case in one sentence
“Can we group thousands of songs by sonic similarity—independent of explicit genre tags—and see what that community structure looks like?”
That is exactly what a streaming service’s “Fans also like” feature does under the hood. I will reproduce a tiny slice of that pipeline using two classic machine-learning tools:
Step | Tool | Simplified purpose | Result |
---|---|---|---|
Dimensionality reduction | t-SNE | Compress vectors down to 2 (or 3) visually interpretable axes | A cloud-like scatter plot |
Clustering | K-Means | Assign each song to the nearest of k centroids | 5 colour-coded groups |
Machine Learning explained: K-Means and t-SNE
My Python example below uses two machine learning algorithms: K-Means and t-SNE. The following explanations are simplified but show the principles used in the program.
K-Means Clustering: “The digital music sorter”
K-Means works like a very systematic record dealer who sorts his collection into categories:
- Starting point: the algorithm randomly places 5 “sorting centers” in the music space
- Assignment: Each song is assigned to the nearest center
- Adjustment: The centers move to the center of their assigned songs
- Repetition: This process is repeated until stable groups are formed
The result is a set of clearly defined music clusters.
t-SNE: “The digital music cartographer”
t-SNE (t-distributed Stochastic Neighbor Embedding) is like a cartographer that projects a multidimensional musical landscape onto a 2D map:
- The problem: Spotify vectors have hundreds of dimensions – impossible to visualize
- The solution: t-SNE “flattens” these dimensions while preserving the neighborhood relationships
- The result: songs that are similar in high-dimensional space also appear close to each other on the 2D map
Unlike PCA, t-SNE does not preserve the global structure, but focuses on local similarities – perfect for discovering music clusters.
The technical implementation
My Python program combines the following techniques that could be arranged into a data pipeline:
Database query
Connect to the database and get the data from the database by an SQL query. The query retrieves vectors for different genres from the Oracle AI Vector database. Each vector represents the complex audio characteristics like danceability, energy, speechiness, loudness, etc. of a song.
Data preprocessing and clustering
Standardization ensures that all vector dimensions are treated equally. K-Means then groups the songs into 5 clusters based on their similarity in high-dimensional space.
X = np.vstack(df[“vector”].to_numpy())
X_scaled = StandardScaler().fit_transform(X)
kmeans = KMeans(n_clusters=5, random_state=42).fit(X_scaled)
Visualization with t-SNE
t-SNE projects the high-dimensional vectors onto a 2D plane, with the perplexity parameter controlling the balance between local and global structure.
u = TSNE(n_components=2, perplexity=45, random_state=42).fit_transform(X_scaled)
The results
The resulting diagram shows fascinating insights into the music landscape:
Cluster analysis:
- Techno (Blue): Forms compact, clearly delineated areas spread across the entire space
- Opera/Guitar (Red): Interesting combination that unites acoustic and classical elements
- Rock/Disco (Green): Shows the rhythmic connections between these genres
- Pop (Black): The largest cluster, reflecting the diversity of the pop genre
- Div (Orange): An isolated cluster that may contain cross-genre songs

Why keep vectors inside an RDBMS like Oracle?
There are many advantages to using established technologies like RDBMS for vectors, e.g.:
-
One source of truth – No siloed feature store to sync.
-
SQL + ML synergy – Combining filtering with vector search.
-
Security & governance – Security features can be applied to vectors, too.
-
Scalability – HNSW or IVF indexes can handle millions of vectors; Python only needs a working-set slice.
Summary & Takeaways
This article demonstrates how to use K-Means for clustering and t-SNE for dimensionality reduction to visualize high-dimensional vectors. Vectors are becoming first-class citizens in modern data architecture, and RDBMS like Oracle AI Vector makes them accessible via everyday SQL.
-
t-SNE visualizations help domain experts trust your embeddings before deploying them to production.
-
K-Means remains a solid baseline for segmentation—fast, interpretable, and effective for initial results.
What’s next?
Future work could include multimodal vectors (that include lyrics), LLM-enhanced clustering, or integrating user behavior for even richer clustering. Stay tuned—and happy clustering!
Python Code
from __future__ import annotations
import os
import warnings
from textwrap import dedent
import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import oracledb
import pandas as pd
from dotenv import load_dotenv
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
matplotlib.use("Agg")
query_def = dedent(
"""
SELECT vector,
track_genre,
track_name,
artists
FROM spotify_track
WHERE track_genre IN ('techno', 'opera', 'disco',
'pop', 'rock', 'guitar')
--FETCH FIRST 10000 ROWS ONLY
"""
).strip()
def get_oracle_connection() -> oracledb.Connection: # pragma: no cover
host = os.getenv("DB_HOST")
port = os.getenv("DB_PORT")
service = os.getenv("DB_NAME")
user = os.getenv("DB_USER")
password = os.getenv("DB_PASSWORD")
if not all([host, port, service, user, password]):
missing = [
var
for var, val in {
"DB_HOST": host,
"DB_PORT": port,
"DB_NAME": service,
"DB_USER_": user,
"DB_PASSWORD": password,
}.items()
if not val
]
raise EnvironmentError(f"Missing environment variables: {', '.join(missing)}")
dsn = oracledb.makedsn(host, port, service_name=service)
return oracledb.connect(user=user, password=password, dsn=dsn)
def fetch_df() -> pd.DataFrame:
with get_oracle_connection() as con:
df = pd.read_sql(query_def, con)
df.columns = df.columns.str.lower()
return df
def cluster_and_plot(
n_clusters: int = 5, perplexity: int = 45, random_state: int = 42
) -> None:
df = fetch_df()
X = np.vstack(df["vector"].to_numpy())
X_scaled = StandardScaler().fit_transform(X)
kmeans = KMeans(
n_clusters=n_clusters, random_state=random_state, n_init="auto"
).fit(X_scaled)
labels = kmeans.labels_
cluster_genres = pd.DataFrame({"cluster": labels, "genre": df["track_genre"]})
for i in range(n_clusters):
cluster_data = cluster_genres[cluster_genres["cluster"] == i]
print(f"Cluster {i}: {cluster_data['genre'].value_counts().head(3)}")
u = TSNE(
n_components=2, perplexity=perplexity, random_state=random_state
).fit_transform(X_scaled)
fig = plt.figure()
ax = fig.add_subplot(111)
colors = labels.astype(str)
colors[colors == "0"] = "red"
colors[colors == "1"] = "blue"
colors[colors == "2"] = "green"
colors[colors == "3"] = "orange"
colors[colors == "4"] = "black"
ax.scatter(u[:, 0], u[:, 1], c=colors)
ax.set_title("KMeans Cluster after t-SNE dimensionality Reduction")
red_patch = mpatches.Patch(color="red", label="Opera/Guitar")
blue_patch = mpatches.Patch(color="blue", label="Techno")
green_patch = mpatches.Patch(color="green", label="Rock/Disco")
orange_patch = mpatches.Patch(color="orange", label="Div")
black_patch = mpatches.Patch(color="black", label="Pop")
ax.legend(handles=[red_patch, blue_patch, green_patch, orange_patch, black_patch])
fig.tight_layout()
fig.savefig("kmeans_tsne.png", dpi=150)
print("PNG written: kmeans_tsne.png")
if __name__ == "__main__":
load_dotenv()
warnings.filterwarnings("ignore", category=FutureWarning)
cluster_and_plot()
SQL Code
Source of the spotify music data are datasets on Huggingface.
CREATE OR REPLACE DIRECTORY spotify_csv_dir AS '/opt/oracle/external/';
GRANT READ, WRITE ON DIRECTORY spotify_csv_dir TO ;
DROP TABLE spotify_track_ext PURGE;
CREATE TABLE spotify_track_ext
(
id NUMBER,
track_id VARCHAR2(50),
artists VARCHAR2(4000),
album_name VARCHAR2(400),
track_name VARCHAR2(400),
popularity NUMBER(5,3),
duration_ms NUMBER,
explicit VARCHAR2(5),
danceability NUMBER,
energy NUMBER,
track_key NUMBER,
loudness NUMBER,
modality NUMBER,
speechiness NUMBER,
acousticness NUMBER,
instrumentalness NUMBER,
liveness NUMBER,
valence NUMBER,
tempo NUMBER,
time_signature NUMBER,
track_genre VARCHAR2(100)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY spotify_csv_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET AL32UTF8
SKIP 1
BADFILE spotify_csv_dir:'dataset.bad'
LOGFILE spotify_csv_dir:'dataset.log'
FIELDS CSV WITH EMBEDDED
)
LOCATION ('dataset.csv')
)
REJECT LIMIT UNLIMITED;
DROP TABLE spotify_track PURGE;
CREATE TABLE spotify_track
(
id NUMBER PRIMARY KEY,
track_id VARCHAR2(50),
artists VARCHAR2(4000),
album_name VARCHAR2(400),
track_name VARCHAR2(400),
popularity NUMBER(3),
duration_ms NUMBER,
explicit VARCHAR2(5),
danceability NUMBER(4,3),
energy NUMBER(4,3),
track_key NUMBER(2),
loudness NUMBER(6,2),
modality NUMBER(1),
speechiness NUMBER(5,3),
acousticness NUMBER(5,3),
instrumentalness NUMBER(5,3),
liveness NUMBER(5,3),
valence NUMBER(5,3),
tempo NUMBER(6,2),
time_signature NUMBER(1),
track_genre VARCHAR2(100),
vector VECTOR(14)
);
INSERT /*+ APPEND */ INTO spotify_track
SELECT id,
track_id,
artists,
album_name,
track_name,
popularity,
duration_ms,
explicit,
danceability,
energy,
track_key,
loudness,
modality,
speechiness,
acousticness,
instrumentalness,
liveness,
valence,
tempo,
time_signature,
track_genre,
TO_VECTOR(
'['||
popularity ||','||
duration_ms ||','||
danceability ||','||
energy ||','||
track_key ||','||
loudness ||','||
modality ||','||
speechiness ||','||
acousticness ||','||
instrumentalness ||','||
liveness ||','||
valence ||','||
tempo ||','||
time_signature ||']',
14,
FLOAT32
) AS vector
FROM spotify_track_ext;
COMMIT;
SELECT COUNT(*) AS rows_loaded FROM spotify_track;