KNN Search with OpenSearch and OpenAI Embeddings: An In-Depth Guide

K-nearest neighbors (KNN) search aka semantic search is a simple and intuitive algorithm although if you haven't used them, the topic can seem daunting. In this blog post, we will go from no familiarity to KNN to building a full functional backend and search UI - using OpenSearch for the search engine, OpenAI for vector embeddings and ReactiveSearch for cloud hosting of the backend and the search UI components. KNN search is a very useful algorithm that can be used for a variety of tasks, such as clustering data, building recommendation systems, etc. Some examples of it are below:

Improve search results:

KNN Search can be used to improve traditional search results by providing more relevant and accurate results to users. This can be particularly useful when searching for information whose meaning is difficult to describe in words. For example, if a user searches "Not good dog food", then traditional search might show results containing "good dog food" which is opposite to the user's intention.
Recommend items:
It can be used to recommend items to users based on their preferences or behavior. For example, a recommendation system might use similarity search to find items that are similar to those that a user has previously purchased or viewed.
Categorizing data into clusters:

KNN search can group similar items together, allowing us to identify patterns and relationships in the data. For example, an e-commerce company might use clustering to group customers based on their purchase behavior. Customers who frequently purchase items in a specific category or price range can be grouped together in a cluster, while customers who tend to make one-time purchases can be grouped in a different cluster. The company can then create targeted marketing campaigns for each cluster, such as offering discounts or promotions on items that are popular within a particular cluster.

Visualizing data as a vector space

Before we implement KNN Search we need to understand a fundamental concept on which it is built, "vectors". Vectors are simply arrays. While in programming we can have arrays of strings, objects, etc., vectors can contain only numeric values like [2, 3]. Hence vectors are numeric arrays. The cool part about vectors is that we can plot them on a graph and find distance between them and see how close one is to another.

We can't plot text or audio information on graph or can we? What if we transform the textual data to a vector. That is essentially what a machine learning model does. It takes real world objects like text, audio, etc. and generates a vector. A collection of vectors is called a vector space. You can see the space of sentences below:

Source: DeepAI

Above is a 2 dimensional graph. Adding more dimensions to the vector means nearest points get more similar. A real world model can map data to thousands of dimensions.

What are we building?

We are going to build a search UI for a dataset of Amazon reviews of various products. This search wouldn't be a traditional search, but would show a user similar results. For example, if a user searches "Not good dog food", then traditional search might show results containing "good dog food" which might not be the intention. But KNN search would show similar results like "bad dog food" because it maps the two queries as nearest neighbours.

The Building Blocks

OpenAI Vector Embeddings API: Model for vectorizing data

We would use this API to convert our textual review data into vectors. You would need an API key to access the Embeddings API. You can do that by going to the OpenAI website and signing up.

OpenSearch: Search index

We use a search index that has the capability of performing a KNN search on vectors. For this, we will use an OpenSearch index because it doesn't have limitations as opposed to other search indexes.

💡 Note: ElasticSearch has a vector dimension limitation. This means the vector array in ElasticSearch can have a maximum length of 1024 but OpenAI vectors need an array of 1536 length. Read more about this limitation in ElasticSearch on this open issue.

Update: The issue looks addressed now and as Elasticsearch releases a new version addressing this limitation, we will be publishing a tutorial on the same using Elasticsearch.

Reactivesearch: The Catalyst

We have all the tools but we need the infrastructure to make the above pieces work with each other without you having to do any work. Reactivesearch.io provides hosting an OpenSearch index but you can also "Bring Your Own Cluster"(BYOC). We would also use a feature called pipelines which helps in organizing various steps like vectorizing data, indexing it, etc. into stages. This makes the process of developing such an application highly efficient. Reactivesearch also has a UI library which we would use to build the app.

Let's dive into code

We are going to use Reactivesearch.io pipelines to build this app. If you are unfamiliar with it, then you can read it in the documentation. In short, they help us to perform operations on data in various stages. The results can then be sent through as response to the front end.

Indexing pipeline:

Creating and configuring index:

First, we would need to create an index with vectorized data of our product reviews. We would do so by using an indexing pipeline. First things first, we would need to create an index named amazon_reviews. The index also needs to be created such that it is aware of the vectorized form of the data which is later going to be used for KNN search. We specify the field type as knn_vector and also have additional settings. Also, note that the field name is vector_data which would be referenced later when we build the frontend app. You can look at below curl script which you can run from the terminal with reactivesearch_cloud_url of your cluster. We would also need the index.

curl --location --request PUT 'https://{{reactivesearch_cloud_url}}/amazon_reviews' \
--header 'Content-Type: application/json' \
--data-raw '{
    "settings": {
        "knn": true,
        "knn.algo_param.ef_search": 100
    },
    "mappings": {
        "properties": {
            "vector_data": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib"
                }
            }
        }
    }
}'

Anatomy of pipeline:

We need a pipeline to first convert the raw product data into vector form. Then, we can take that data and index as knn_vector in search index. This would make us ready to perform KNN search.

Here is what our pipeline looks like in a config file. If you don't know how to setup an indexing pipeline using Reactivesearch.io you can follow the documentation. You can paste the below in the pipeline editor and it would convert itself to JSON config. You would need to paste OpenAI api key in envs field to make the pipeline work.

We want to vectorize to text data fields Text and Summary and we specify that in the stage.

Notice that we mention the output_key as vector_data. This is the field where the text information would be stored as vectors.

enabled: true
description: Index pipeline to store vectorized data
routes:
  - path: /amazon_reviews/_doc
    method: POST
    classify:
      category: elasticsearch
      acl: index

envs:
  openAIApiKey: ${{ OPENAI_API_KEY }}

stages:
- id: authorize user
  use: authorization
- id: fetch embeddings
  use: openAIEmbeddingsIndex
  inputs:
    apiKey: "{{openAIApiKey}}"
    inputKeys:
    - Summary
    - Text
    outputKey: vector_data
  continueOnError: false
- id: index data
  use: elasticsearchQuery
  needs:
  - fetch embeddings

Indexing data:

Once the pipeline is deployed at the above endpoint we can index the data. We have created a script to index data. You can follow the steps after creating the above indexing pipeline.

Search pipeline:

We just indexed our textual data (product reviews) inside the index. But when the user performs a search it is still text and not a vector. Hence we can't perform a KNN search without vectorizing the query value. Also, we shouldn't index the search query using the index pipeline because the search query isn't a product review and would be different each time. Instead, we create a new pipeline which transforms the search query using the OpenAI embedding into a vector and then gives it to the search index to perform KNN search.

Config of the pipeline looks like below.

enabled: true
routes:
- path: "/amazon_reviews/_reactivesearch"
  method: POST
  classify:
    category: reactivesearch

envs:
  openAIApiKey: ${{ OPENAI_API_KEY }}

stages:
- id: authorize user
  use: authorization
- id: fetch embeddings
  use: openAIEmbeddings
  inputs:
    apiKey: "{{openAIApiKey}}"
    useWithReactiveSearchQuery: true
  continueOnError: false
- use: reactivesearchQuery
  needs:
  - fetch embeddings
  continueOnError: false
- use: elasticsearchQuery
  continueOnError: false

Below shows a diagram of what might happen in the search phase. We already have the reviews inside the index(Green dots). We get the search query from the user and vectorize it and provide it to the search index, which maps it near to some reviews in the vector space (review 1 and 2). Those are the nearest neighbours and the results sent back to the user would reflect it i.e. Review 1 and 2 would be ranked higher, and would appear before other reviews. Note that we don't index the search query vector. It is temporary and would not be present in the next search.

Building the UI:

All the code we are going to show is present inside knn-search-demo github repo. We also would like you to go through the quick start guide of the Reactivesearch UI library to get some familiarity. The guide will not only show how to set up a boilerplate react app but also how to setup @appbaseio/reactivesearch library. We would primarily modify the src/App.jsx file and would be using ReactiveBase, SearchBox, and ReactiveList which the quick start guide already covers.

Connecting to the search pipeline:

In order to send the search query to the pipeline and getting results back we would need to establish a connection with it. We do it by using the endpoint property available on ReactiveBase. You would be able to get the credentials from dash.reactivesearch.io. Also, note that you would have to set reactivesearchAPIConfig.recordAnalytics to false.

import React from "react";
import {
  ReactiveBase,
} from "@appbaseio/reactivesearch";

function Main() {
  const HOST_URL = "https://{{user}}:{{password}}@{{host}}/amazon_reviews/_reactivesearch"

  return (
    <ReactiveBase
      endpoint={{
        url: HOST_URL,
        method: "POST",
      }}
      reactivesearchAPIConfig={{
        recordAnalytics: false,
        userId: "jon",
      }}
    >
      {/* Search and ReactiveList component go here */}
    </ReactiveBase>
  );
}

const App = () => <Main />;

export default App;

Adding search and result component:

In our dataset, there are two fields in which we are mainly interested, Summary and Text. Summary is a short description of the whole review(Text).

We would keep the dataField as Summary inside the Searchbox. This would be used in showing suggestions matching the search query (this would not be KNN search, rather a regular search). We would also control the value of the SearchBox component by using value and onChange props. This would be used later when we transform the normal search request. We also use debounce which fetches suggestions only after an interval when the user has stopped typing.

// src/App.jsx
function Main() {
  const [searchValue, setSearchValue] = useState("");

  return (
    <ReactiveBase
      {...configProps}
    >
      <SearchBox
        dataField={["Summary"]}
        componentId="SearchComponent"
        size={5}
        showClear
        value={searchValue}
        debounce={SUGGESTION_DEBOUNCE_DELAY}
        onChange={(value) => {
          setSearchValue(value);
        }}
      />
    </ReactiveBase>
  );
}

Next, we would add a component to show the results, ReactiveList. Make sure to set the react property to the componentId value of the SearchBox component, i.e. "SearchComponent". We can then use the render prop to customize the look and feel of the results.


function Main() {
  const [searchValue, setSearchValue] = useState("");

  return (
    <ReactiveBase {...configProps}>
      <ReactiveList
        componentId="SearchResult"
        dataField="Summary"
        size={12}
        pagination
        react={{ and: "SearchComponent" }}
        render={({ data }) => {
          return (
            <div className="mx-5 my-2">
                {data.map((item) => (
                  <div>
                    <h1>{item["Summary"]}</h1>
                    <p>{item["Text"]}</p>
                  </div>
                ))}
            </div>
          );
        }}
      />
    </ReactiveBase>
  );
}

Transforming the request:

Before we transform the request, we need to have several clarifications regarding when the network requests are performed and which network requests are performed.

Suggestion query: When the user types into the SearchBox and doesn't hit enter, he sees a list of suggestions. Those suggestions are fetched by performing a network request whose type is suggestion. This query is debounced such that it fires only after an interval the user stops typing in the SearchBox and can be controlled by passing a numeric value to debounce prop.
Search query: When the user hits Enter or selects a suggestion then a query of type search is fired. The search query is fired for both the components, SearchBox and ReactiveList. We can identify the component for which the query by looking at id property of the query. This id matches with the componentId of the specified component.

Now we are all set to transform network requests. Our search pipeline works if we provide a query of type search with a vectorDataField which we indexed earlier. We also need to provide a value which is the search query that would be vectorized and used for performing the KNN search. Rough structure of component queries is as below:

{
    "query": [
        {
            "id": "search",
            "type": "search",
            "dataField": [
                "Text",
                "Summary"
            ],
            "vectorDataField": "vector_data",
            "value": "good dog food",
            "excludeFields": [
                "vector_data"
            ]
        }
    ]
}

We can transform the network request by specifying a prop on ReactiveBase called transformRequest.

We would change only the query of type search and would augment it with vectorDataField and value of the search query. Here we would use the controlled value of the search component.

function Main() {
  const [searchValue, setSearchValue] = useState("");

  return (
    <ReactiveBase
      {...otherConfigProps}
      transformRequest={(req) => {
        const body = JSON.parse(req.body);
        // Transform query
        body.query = body.query.map((componentQuery) => {
          if (
            componentQuery.id === "SearchComponent" &&
            componentQuery.type === "search"
          ) {
            return { ...componentQuery, vectorDataField: "vector_data" };
          }
          if (
            componentQuery.id === "SearchResult" &&
            componentQuery.type === "search"
          ) {
            const searchQuery = body.query.find(
              (q) => q.id === "SearchComponent" && q.type === "search"
            );
            const searchValue = searchQuery.value;
            delete componentQuery.react;

            return {
              ...componentQuery,
              vectorDataField: "vector_data",
              value: searchValue,
            };
          }
          return componentQuery;
        });
        body.settings = {
          recordAnalytics: true,
          backend: "opensearch",
        };

        const newReq = { ...req, body: JSON.stringify(body) };
        return newReq;
      }}
    >
    </ReactiveBase>
  );
}

Finishing Up

Putting all of the above together and styling the app we would get a complete app below. We also add a few sample queries which can be used to perform search. Since the value of SearchBox is controlled, it is easy to make it possible. You can browse the codesandbox below. All the code is primarily present in src/App.jsx.

Summary

In this blog post, we discuss the implementation of K-nearest neighbors (KNN) search to improve search results, recommend items, and categorize data into clusters. We explain the concept of vector spaces and how they relate to KNN search. We then dive into the implementation of a search UI for a dataset of Amazon reviews, utilizing OpenAI's Vector Embeddings API, OpenSearch as the search engine backend, and ReactiveSearch for cloud hosting of OpenSearch and to build the UI for the app. We provide code examples and explanations for each step of the process, including indexing pipelines, search pipelines, and building the UI with ReactiveSearch components. Finally, we show how to transform the network request to complete the app.