SAP HANA Cloud Vector Engine
SAP HANA Cloud Vector Engine is a vector store fully integrated into the
SAP HANA Cloud
database.
You'll need to install langchain-community
with pip install -qU langchain-community
to use this integration
Setting up
Installation of the HANA database driver.
# Pip install necessary package
%pip install --upgrade --quiet hdbcli
For OpenAIEmbeddings
we use the OpenAI API key from the environment.
import os
# Use OPENAI_API_KEY env variable
# os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
Create a database connection to a HANA Cloud instance.
from dotenv import load_dotenv
from hdbcli import dbapi
load_dotenv()
# Use connection settings from the environment
connection = dbapi.connect(
address=os.environ.get("HANA_DB_ADDRESS"),
port=os.environ.get("HANA_DB_PORT"),
user=os.environ.get("HANA_DB_USER"),
password=os.environ.get("HANA_DB_PASSWORD"),
autocommit=True,
sslValidateCertificate=False,
)
Example
Load the sample document "state_of_the_union.txt" and create chunks from it.
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.hanavector import HanaDB
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
text_documents = TextLoader("../../how_to/state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
text_chunks = text_splitter.split_documents(text_documents)
print(f"Number of document chunks: {len(text_chunks)}")
embeddings = OpenAIEmbeddings()
Number of document chunks: 88
Create a LangChain VectorStore interface for the HANA database and specify the table (collection) to use for accessing the vector embeddings
db = HanaDB(
embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)
Add the loaded document chunks to the table. For this example, we delete any previous content from the table which might exist from previous runs.
# Delete already existing documents from the table
db.delete(filter={})
# add the loaded document chunks
db.add_documents(text_chunks)
[]
Perform a query to get the two best-matching document chunks from the ones that were added in the previous step. By default "Cosine Similarity" is used for the search.
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
Query the same content with "Euclidian Distance". The results shoud be the same as with "Cosine Similarity".
from langchain_community.vectorstores.utils import DistanceStrategy
db = HanaDB(
embedding=embeddings,
connection=connection,
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
table_name="STATE_OF_THE_UNION",
)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.
While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
Maximal Marginal Relevance Search (MMR)
Maximal marginal relevance
optimizes for similarity to query AND diversity among selected documents. The first 20 (fetch_k) items will be retrieved from the DB. The MMR algorithm will then find the best 2 (k) matches.
docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.
Creating an HNSW Vector Index
A vector index can significantly speed up top-k nearest neighbor queries for vectors. Users can create a Hierarchical Navigable Small World (HNSW) vector index using the create_hnsw_index
function.
For more information about creating an index at the database level, please refer to the official documentation.
# HanaDB instance uses cosine similarity as default:
db_cosine = HanaDB(
embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)
# Attempting to create the HNSW index with default parameters
db_cosine.create_hnsw_index() # If no other parameters are specified, the default values will be used
# Default values: m=64, ef_construction=128, ef_search=200
# The default index name will be: STATE_OF_THE_UNION_COSINE_SIMILARITY_IDX (verify this naming pattern in HanaDB class)
# Creating a HanaDB instance with L2 distance as the similarity function and defined values
db_l2 = HanaDB(
embedding=embeddings,
connection=connection,
table_name="STATE_OF_THE_UNION",
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE, # Specify L2 distance
)
# This will create an index based on L2 distance strategy.
db_l2.create_hnsw_index(
index_name="STATE_OF_THE_UNION_L2_index",
m=100, # Max number of neighbors per graph node (valid range: 4 to 1000)
ef_construction=200, # Max number of candidates during graph construction (valid range: 1 to 100000)
ef_search=500, # Min number of candidates during the search (valid range: 1 to 100000)
)
# Use L2 index to perform MMR
docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
print("-" * 80)
print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.
Key Points:
- Similarity Function: The similarity function for the index is cosine similarity by default. If you want to use a different similarity function (e.g.,
L2
distance), you need to specify it when initializing theHanaDB
instance. - Default Parameters: In the
create_hnsw_index
function, if the user does not provide custom values for parameters likem
,ef_construction
, oref_search
, the default values (e.g.,m=64
,ef_construction=128
,ef_search=200
) will be used automatically. These values ensure the index is created with reasonable performance without requiring user intervention.
Basic Vectorstore Operations
db = HanaDB(
connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_BASIC"
)
# Delete already existing documents from the table
db.delete(filter={})
True
We can add simple text documents to the existing table.
docs = [Document(page_content="Some text"), Document(page_content="Other docs")]
db.add_documents(docs)
[]
Add documents with metadata.
docs = [
Document(
page_content="foo",
metadata={"start": 100, "end": 150, "doc_name": "foo.txt", "quality": "bad"},
),
Document(
page_content="bar",
metadata={"start": 200, "end": 250, "doc_name": "bar.txt", "quality": "good"},
),
]
db.add_documents(docs)
[]
Query documents with specific metadata.
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
# With filtering on "quality"=="bad", only one document should be returned
for doc in docs:
print("-" * 80)
print(doc.page_content)
print(doc.metadata)
--------------------------------------------------------------------------------
foo
{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}
Delete documents with specific metadata.
db.delete(filter={"quality": "bad"})
# Now the similarity search with the same filter will return no results
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
print(len(docs))
0
Advanced filtering
In addition to the basic value-based filtering capabilities, it is possible to use more advanced filtering. The table below shows the available filter operators.
Operator | Semantic |
---|---|
$eq | Equality (==) |
$ne | Inequality (!=) |
$lt | Less than (<) |
$lte | Less than or equal (<=) |
$gt | Greater than (>) |
$gte | Greater than or equal (>=) |
$in | Contained in a set of given values (in) |
$nin | Not contained in a set of given values (not in) |
$between | Between the range of two boundary values |
$like | Text equality based on the "LIKE" semantics in SQL (using "%" as wildcard) |
$and | Logical "and", supporting 2 or more operands |
$or | Logical "or", supporting 2 or more operands |
# Prepare some test documents
docs = [
Document(
page_content="First",
metadata={"name": "adam", "is_active": True, "id": 1, "height": 10.0},
),
Document(
page_content="Second",
metadata={"name": "bob", "is_active": False, "id": 2, "height": 5.7},
),
Document(
page_content="Third",
metadata={"name": "jane", "is_active": True, "id": 3, "height": 2.4},
),
]
db = HanaDB(
connection=connection,
embedding=embeddings,
table_name="LANGCHAIN_DEMO_ADVANCED_FILTER",
)
# Delete already existing documents from the table
db.delete(filter={})
db.add_documents(docs)
# Helper function for printing filter results
def print_filter_result(result):
if len(result) == 0:
print("<empty result>")
for doc in result:
print(doc.metadata)
Filtering with $ne
, $gt
, $gte
, $lt
, $lte
advanced_filter = {"id": {"$ne": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$gt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$gte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$lt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"id": {"$lte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$ne': 1}}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$gt': 1}}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$gte': 1}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$lt': 1}}
<empty result>
Filter: {'id': {'$lte': 1}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
Filtering with $between
, $in
, $nin
advanced_filter = {"id": {"$between": (1, 2)}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$in": ["adam", "bob"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$nin": ["adam", "bob"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$between': (1, 2)}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$in': ['adam', 'bob']}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$nin': ['adam', 'bob']}}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Text filtering with $like
advanced_filter = {"name": {"$like": "a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"name": {"$like": "%a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'name': {'$like': 'a%'}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
Filter: {'name': {'$like': '%a%'}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Combined filtering with $and
, $or
advanced_filter = {"$or": [{"id": 1}, {"name": "bob"}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"$and": [{"id": 1}, {"id": 2}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
advanced_filter = {"$or": [{"id": 1}, {"id": 2}, {"id": 3}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'$or': [{'id': 1}, {'name': 'bob'}]}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'$and': [{'id': 1}, {'id': 2}]}
<empty result>
Filter: {'$or': [{'id': 1}, {'id': 2}, {'id': 3}]}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}