Introducing the analytical vector database

Introducing the analytical vector database

Every business or organization has unstructured data. Banks log their customer service calls. ECommerce marketplaces have millions of customer reviews. Intelligence agencies have untold amounts of satellite imagery. Even the humble New York bodega is now a generator of unstructured data in the form of customer reviews on Google and elsewhere.
But generating quantitative insight from such data has historically been a hard problem. Until recently, this data was not readable by machines, which meant finding workarounds to find ways of querying it. A common method is filing or “tagging” pieces of unstructured data. This is partly why you are asked to identify whether you are calling about a billing dispute or service interruption when you dial customer service. Sure, it’s to route you to the right customer service agent, but it also enables analysis downstream of how many calls of a particular type are coming through.
For large organizations with power over the customer experience and the ability to force customers to tag their needs, this is a scalable albeit imperfect solution. But where there is less power over the customer experience, or where an organization is small, tagging might not be an option. In those cases, manual review becomes necessary. This is time consuming, leads to inconsistencies across reviewers and makes it difficult to join unstructured data with structured data to answer questions like how customer review performance correlates with revenue.
But the arrival of large AI models from companies like Open AI (the maker of ChatGPT) and Cohere is set to change all of that. While these models have wowed us with their ability to generate text and images, one of their super powers is to reverse that process and to create machine-readable representations of these forms of unstructured data. Known as vector or neural embeddings, this new datatype makes it possible to extract rigorous quantitative insights from unstructured data. With a little help from NNext of course!

Remind me, what are vector embeddings?

This article is meant to be accessible to all readers, so we won’t get too technical here. If you want to get into the details, you can find plenty more information on our wiki.
To understand vector embeddings, consider a simple x,y axis (or Cartesian plane, if you prefer).
notion image
Any point on the axis can be represented by two numbers. One for the x position, and one for the y position. And you can imagine assigning the points meaning. Consider product reviews where customer’s award a score out of 10 for both quality of the product and value for money. A really high quality product that is well priced might be (8,9). Or a decent quality product that is drastically overpriced might be (5, 2).
Vector embeddings are just like these points, but instead of just two axes or in this case “dimensions”, they have hundreds or even thousands of dimensions. The model which powers ChatGPT has 2,048 dimensions.
What’s valuable about having these many dimensional coordinates is that they enable vector embeddings to represent unstructured data in a way that’s readable by machines, while preserving much of the meaning embedded in the unstructured data itself. In other words, when you pass the phrase “this was overpriced” and the phrase “too expensive given the quality” to a large AI model, the vector embeddings the model returns should be very similar to one another. That’s despite those two phrases not having any words in common. Cool!

Why is this useful?

This is useful because once unstructured data is converted into vector embeddings, it can be subject to what is called semantic search. This is just a technical way of saying you can search for things based on meaning.
Let’s stay with the product review example. Imagine you have a body of product reviews and you are trying to find all of the ones that reference bad quality to understand specifically what it is about a product that makes it bad quality in the eyes of a consumer. For many people in this situation, the best they can do is potentially “ctrl + f” and search for various combinations like “bad quality” or “not good quality”. Ultimately they won’t be able to come up with all the variations and some important insights will get missed. Not to mention, if they try the same exercise in 12 months time, they might use different words to search on and get inconsistent results.
Vector embeddings changes all that by enabling you to submit a query and then search for pieces of unstructured data that are semantically equivalent or close to that query. This is a publicly available dataset from Kaggle based on reviews on Amazon for women’s clothing. It can help us demonstrate the power of semantic search using vector embeddings.
Imagine we want to find out how big a problem it is for items to quickly become damaged after being washed or worn. Before vector embeddings, we likely would have filtered to one or two star reviews and then read them individually to identify the ones that highlight this issue. This would be highly manual and inaccurate.
But with vector embeddings, we can just ask our data for all the reviews that talk about items being “easily damaged when washed or worn”. From our sample dataset we get the following top five results:
This fell apart after wearing it once. for the price, it should hold up better. it's too delicate.
Bought this shirt in mid-july, washed twice (cold, delicate) and hung to dry. already holes/falling apart
Bought this early in the season - probably washed and worn <8 times and there are small holes that have formed in the front towards the bottom...bummer!
Cute top but started disintegrating after two washes. the delicate cutouts in the top tore apart. i machine washed it in the gentle cycle and didn't use the dryer. disappointed to have to discard the top after just buying it.
Love the color and style, but material snags easily
There are a few things worth observing in this sample. Consider the first result. It clearly refers to the issue we have identified: items becoming quickly damaged after wearing or washing. But it doesn’t use any of the key words from the query we submitted. To reiterate, that’s because vector embeddings let us search for the meaning of unstructured data. Look at the fifth result. This refers to material snagging easily which at first glance suggests a false positive in the results. But if material can snag easily, that means it is liable to be damaged quickly when worn, which is the issue we are trying to investigate.
Clearly, vector embeddings provide us with powerful tools for searching unstructured data.

How does NNext help?

The role that NNext plays is in storing and enabling fast searching and querying of vector embeddings. Embeddings, being highly complex, multi-coordinate numbers, are not well suited to legacy, tabular data structures like those present in Snowflake or Google Big Query. So that is the first benefit of NNext: we treat vector embeddings as “first class citizens” in our data structure which enables rapid searching and querying.
The second role we play is providing tools for searching vectors and joining them with your existing datasets to create helpful queries across all of your data.
Finally, for those customers that want them, we provide visualization tools for the embeddings themselves so that users can see the distribution of their data in 2D and 3D. We also provide dashboards for users looking to pair the new insights they can extract with existing, structured data including financial performance (e.g., revenue, COGS) and operating metrics (e.g., delivery speed).

About NNext

At NNext, we are building the tools needed to enable semantic search in familiar SQL. If you’d like early access, register on our app at!