In today’s data-driven world, vector databases are available to handle complex, high-dimensional data. This article describes vector databases including use cases as well as an example with the PostgreSQL extension pg_vector.

What is a vector database?

A vector database is used to efficiently store and process vector data. Vectors are represented by an arrow in space like the two-dimensional vectors shown in the image below.

Vector

In machine learning, multidimensional vectors are called “embeddings” and are represented as arrays of numbers. For example, the word “Apple” is represented as a vector [0.234, -0.567] in the image above.

Text, audio, images, and videos can be represented as vectors. These vectors are a point in a multidimensional space. GenAI models can be used to create embeddings, e.g.

  • Transformers for text data,
  • Residual networks for images,
  • Spectogram-based models for audio data.

Each model generates embeddings with a different number of dimensions, e.g. OpenAI text-embedding-3-large has 3072 dimensions.

There are independent vector databases such as Pinecone. In addition, existing relational (e.g. PostgreSQL, Oracle) and NoSQL databases are expanded to include vector functionalities. The advantage of vector databases is that vectors and business data can be stored in the same data store while pure dedicated vector indexes store vectors only and the business data in another database. Vector databases offer specific features for handling vector data:

  • Vector data type to store vectors in columns (attributes). Vectors should never be stored in a text/string field but in columns with proper data types like vector (or array).
  • Functions for evaluating the vectors, e.g. similarity metrics to find the nearest neighbors. There exist a variety of similarity search algorithms like euclidean distance, cosine distance, dot product similarity, manhattan distance, etc.

Why – Use Cases

Typical use cases for vector databases are based on semantic/similarity search:

  • RAG (Retrieval-Augmented Generation)
    The output of an LLM can be improved if the prompt is enriched by more (and up-to-date) context. The context can come from a vector database with data obtained by a similarity search.
  • Recommendation systems
    Vector databases can store user preferences and product data in vector form to generate personalized recommendations. For example, a movie recommendation system can match a user’s viewing habits and preferences with a catalog of movies to make the most relevant suggestions.
  • Image and video search
    In image and video search, vector databases enable the storage of multimedia objects as vectors that are derived from the visual content using machine learning. Users can search for images or videos that are visually similar.
  • Natural Language Processing (NLP)
    Vector databases play a key role in NLP by supporting the processing and analysis of text data in the form of word or sentence vectors. Applications include semantic text search, chatbots, sentiment analysis, and automated summaries.
  • Anomaly detection
    In monitoring IT networks, preventive maintenance and quality assurance, for example, vector databases can be used to detect anomalies in real time. By analyzing operational data as vectors, unusual patterns can be identified that indicate problems or failures.
  • Chatbots
    A chatbot is a software application designed to simulate human-like conversation based on user inputs. It operates via text interfaces, offering automated responses that can range from answering FAQs to assisting in transactions. Chatbots leverage natural language processing and machine learning technologies to interpret questions and provide relevant, conversational answers, improving efficiency and user experience in digital services.
  • Voice assistants
    A voice assistant is an advanced digital helper that responds to voice commands and questions. These assistants use speech recognition, natural language understanding, and synthesis to interact in a human-like manner, often through smartphones, smart speakers, and other devices.

How – Example usage

The following example explains the practical use of a vector database shown step by step. A Postgres database with the pgvector extension is used.

First, PostgreSQL must be installed with the VectorDB extension pg_vector. For a quick start, a Docker container is suitable for testing.

docker pull pgvector/pgvector:pg16
Next, the pgvector extension is created and a simple table containing the vector embeddings with their associated text is created.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS test_embedding (
     id SERIAL PRIMARY KEY
   , embedding vector
   , text varchar(100)
   , created_at timestamptz DEFAULT now()
);
Finally, data is inserted into the table. In this case, the arrays are hard-coded. Normally, those vectors would be calculated using a.g. an embedding model.
INSERT INTO test_embedding (embedding, text) VALUES (ARRAY[0.234, -0.567], 'Apple');
INSERT INTO test_embedding (embedding, text) VALUES (ARRAY[0.456, -0.678], 'Pear');
INSERT INTO test_embedding (embedding, text) VALUES (ARRAY[-0.789, 0.345], 'Elephant');
Now lets run some queries with similarity search:

  • exact match with a query for apple [0.234, -0.567]
  • similarity match with a query – let’s assume [-0.999, 0.123] as embedding for word “animal”

The queries use operator <-> for Euclidean distance.

SELECT text, embedding <-> ARRAY[0.234, -0.567]::vector AS similarity
FROM test_embedding
ORDER BY similarity
LIMIT 5;

SELECT text, embedding <-> ARRAY[-0.999, 0.123]::vector AS similarity
FROM test_embedding
ORDER BY similarity
LIMIT 5;
The output is shown below.
An example with pg_vector and Large Language Model BERT (Bidirectional Encoder Representations from Transformers) to compute embeddings can be found in my github repository pgvector.

Summary

The example is very simplified and shows the basic functionality of vector databases. By using vector operations, complex similarity searches can be performed efficiently. A text index could also have been used for the data sets used. However, vectors and similarity searches show their strength in semantic contexts, for example when animal is searched and elephant is returned and not a fruit like apple.

Relational databases such as PostgreSQL or Oracle are increasingly being expanded to include vector functions, so that in addition to the new functionality, the well-known and proven “relational” range of functions is also available.