Minh Tran

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an all-in-one embeddings database that enables Natural Language Understanding (NLU) based search in any application.

Introducing txtai

Introduction to txtai

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling, retrieval augmented generation and more.

Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.

The following is a summary of key features:

🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing
📄 Create embeddings for text, documents, audio, images and video
💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more
↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.
⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
☁️ Run local or scale out with container orchestration

txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers, FastAPI. txtai is open-source under an Apache 2.0 license.

Install and run txtai

txtai can be installed via pip or Docker. The following shows how to install via pip.

bash

1pip install txtai

Semantic search

txtai enables semantic search with SQL and object storage.

Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.

The basic use case for an embeddings database is building an approximate nearest neighbor (ANN) index for semantic search. The following example indexes a small number of text entries to demonstrate the value of semantic search.

python

1from txtai import Embeddings
2
3# Works with a list, dataset or generator
4data = [
5  "US tops 5 million confirmed virus cases",
6  "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
7  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
8  "The National Park Service warns against sacrificing slower friends in a bear attack",
9  "Maine man wins $1M from $25 lottery ticket",
10  "Make huge profits without work, earn up to $100,000 a day"
11]
12
13# Create an embeddings
14embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")
15
16# Create an index for the list of text
17embeddings.index(data)
18
19print("%-20s %s" % ("Query", "Best Match"))
20print("-" * 50)
21
22# Run an embeddings search for each query
23for query in ("feel good story", "climate change", 
24    "public health story", "war", "wildlife", "asia",
25    "lucky", "dishonest junk"):
26  # Extract uid of first result
27  # search result format: (uid, score)
28  uid = embeddings.search(query, 1)[0][0]
29
30  # Print text
31  print("%-20s %s" % (query, data[uid]))

Results from semantic search example.

The example above shows that for all of the queries, the query text isn’t in the data. This is the true power of transformers models over token based search.

Updates and deletes

Updates and deletes are supported for embeddings. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

python

1# Run initial query
2uid = embeddings.search("feel good story", 1)[0][0]
3print("Initial: ", data[uid])
4
5# Create a copy of data to modify
6udata = data.copy()
7
8# Update data
9udata[0] = "See it: baby panda born"
10embeddings.upsert([(0, udata[0], None)])
11
12uid = embeddings.search("feel good story", 1)[0][0]
13print("After update: ", udata[uid])
14
15# Remove record just added from index
16embeddings.delete([0])
17
18# Ensure value matches previous value
19uid = embeddings.search("feel good story", 1)[0][0]
20print("After delete: ", udata[uid])

shell

1Initial:  Maine man wins $1M from $25 lottery ticket
2After update:  See it: baby panda born
3After delete:  Maine man wins $1M from $25 lottery ticket

Persistence

Embeddings can be saved to storage and reloaded.

python

1embeddings.save("index")
2
3embeddings = Embeddings()
4embeddings.load("index")
5
6uid = embeddings.search("climate change", 1)[0][0]
7print(data[uid])

shell

1Canada's last fully intact ice shelf has suddenly collapsed, forming a
2Manhattan-sized iceberg

Hybrid search

While dense vector indexes are by far the best option for semantic search systems, sparse keyword indexes can still add value. There may be cases where finding an exact match is important.

Hybrid search combines the results from sparse and dense vector indexes for the best of both worlds.

python

1# Create an embeddings
2embeddings = Embeddings(
3  hybrid=True,
4  path="sentence-transformers/nli-mpnet-base-v2"
5)
6
7# Create an index for the list of text
8embeddings.index(data)
9
10print("%-20s %s" % ("Query", "Best Match"))
11print("-" * 50)
12
13# Run an embeddings search for each query
14for query in ("feel good story", "climate change", 
15    "public health story", "war", "wildlife", "asia",
16    "lucky", "dishonest junk"):
17  # Extract uid of first result
18  # search result format: (uid, score)
19  uid = embeddings.search(query, 1)[0][0]
20
21  # Print text
22  print("%-20s %s" % (query, data[uid]))

Results from hybrid search example.

Same results as with semantic search. Let’s run the same example with just a keyword index to view those results.

python

1# Create an embeddings
2embeddings = Embeddings(keyword=True)
3
4# Create an index for the list of text
5embeddings.index(data)
6
7print(embeddings.search("feel good story"))
8print(embeddings.search("lottery"))

shell

1[]
2[(4, 0.5234998733628726)]

See that when the embeddings instance only uses a keyword index, it can’t find semantic matches, only keyword matches.

Content storage

Up to this point, all the examples are referencing the original data array to retrieve the input text. This works fine for a demo but what if you have millions of documents? In this case, the text needs to be retrieved from an external datastore using the id.

Content storage adds an associated database (i.e. SQLite, DuckDB) that stores associated metadata with the vector index. The document text, additional metadata and additional objects can be stored and retrieved right alongside the indexed vectors.

python

1# Create embeddings with content enabled.
2# The default behavior is to only store indexed vectors.
3embeddings = Embeddings(
4  path="sentence-transformers/nli-mpnet-base-v2",
5  content=True,
6  objects=True
7)
8
9# Create an index for the list of text
10embeddings.index(data)
11
12print(embeddings.search("feel good story", 1)[0]["text"])

shell

1Maine man wins $1M from $25 lottery ticket

The only change above is setting the content flag to True. This enables storing text and metadata content (if provided) alongside the index. Note how the text is pulled right from the query result!

Let’s add some metadata.

Query with SQL

When content is enabled, the entire dictionary is stored and can be queried. In addition to vector queries, txtai accepts SQL queries. This enables combined queries using both a vector index and content stored in a database backend.

python

1# Create an index for the list of text
2embeddings.index([{"text": text, "length": len(text)} for text in data])
3
4# Filter by score
5print(embeddings.search("select text, score from txtai where similar('hiking danger') and score >= 0.15"))
6
7# Filter by metadata field 'length'
8print(embeddings.search("select text, length, score from txtai where similar('feel good story') and score >= 0.05 and length >= 40"))
9
10# Run aggregate queries
11print(embeddings.search("select count(*), min(length), max(length), sum(length) from txtai"))

shell

1[{'text': 'The National Park Service warns against sacrificing slower friends in a bear attack', 'score': 0.3151373863220215}]
2[{'text': 'Maine man wins $1M from $25 lottery ticket', 'length': 42, 'score': 0.08329027891159058}]
3[{'count(*)': 6, 'min(length)': 39, 'max(length)': 94, 'sum(length)': 387}]

This example above adds a simple additional field, text length.

Note the second query is filtering on the metadata field length along with a similar query clause. This gives a great blend of vector search with traditional filtering to help identify the best results.

Object storage

In addition to metadata, binary content can also be associated with documents. The example below downloads an image, upserts it along with associated text into the embeddings index.

python

1import urllib
2
3from IPython.display import Image
4
5# Get an image
6request = urllib.request.urlopen("https://raw.githubusercontent.com/neuml/txtai/master/demo.gif")
7
8# Upsert new record having both text and an object
9embeddings.upsert([("txtai", {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "object": request.read()}, None)])
10
11# Query txtai for the most similar result to "machine learning" and get associated object
12result = embeddings.search("select object from txtai where similar('machine learning') limit 1")[0]["object"]
13
14# Display image
15Image(result.getvalue(), width=600)

Search with object storage

Searching with txtai.

Topic modeling

Topic modeling with txtai

txtai enables topic modeling with semantic graphs.

Topic modeling is enabled via semantic graphs. Semantic graphs, also known as knowledge graphs or semantic networks, build a graph network with semantic relationships connecting the nodes. In txtai, they can take advantage of the relationships inherently learned within an embeddings index.

python

1# Create embeddings with a graph index
2embeddings = Embeddings(
3  path="sentence-transformers/nli-mpnet-base-v2",
4  content=True,
5  functions=[
6    {"name": "graph", "function": "graph.attribute"},
7  ],
8  expressions=[
9    {"name": "category", "expression": "graph(indexid, 'category')"},
10    {"name": "topic", "expression": "graph(indexid, 'topic')"},
11  ],
12  graph={
13    "topics": {
14      "categories": ["health", "climate", "finance", "world politics"]
15    }
16  }
17)
18
19embeddings.index(data)
20embeddings.search("select topic, category, text from txtai")

shell

1[{'topic': 'confirmed_cases_us_5',
2  'category': 'health',
3  'text': 'US tops 5 million confirmed virus cases'},
4 {'topic': 'collapsed_iceberg_ice_intact',
5  'category': 'climate',
6  'text': "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg"},
7 {'topic': 'beijing_along_craft_tensions',
8  'category': 'world politics',
9  'text': 'Beijing mobilises invasion craft along coast as Taiwan tensions escalate'}]

When a graph index is enabled, topics are assigned to each of the entries in the embeddings instance. Topics are dynamically created using a sparse index over graph nodes grouped by community detection algorithms.

Topic categories are also be derived as shown above.

Subindexes

Subindexes with txtai

Subindexes can be configured for an embeddings. A single embeddings instance can have multiple subindexes each with different configurations.

We’ll build an embeddings index having both a keyword and dense index to demonstrate.

python

1# Create embeddings with subindexes
2embeddings = Embeddings(
3  content=True,
4  defaults=False,
5  indexes={
6    "keyword": {
7      "keyword": True
8    },
9    "dense": {
10      "path": "sentence-transformers/nli-mpnet-base-v2"
11    }
12  }
13)
14embeddings.index(data)

python

1embeddings.search("feel good story", limit=1, index="keyword")

shell

1[]

python

1embeddings.search("feel good story", limit=1, index="dense")

shell

1[{'id': '4',
2  'text': 'Maine man wins $1M from $25 lottery ticket',
3  'score': 0.08329027891159058}]

Once again, this example demonstrates the difference between keyword and semantic search. The first search call uses the defined keyword index, the second uses the dense vector index.

LLM orchestration

LLM orchestration with txtai

txtai enables LLM orchestration with a pipeline that extracts knowledge from content by joining a prompt, context data store and generative model together.

txtai is an all-in-one embeddings database. It is the only vector database that also supports sparse indexes, graph networks and relational databases with inline SQL support. In addition to this, txtai has support for LLM orchestration.

The extractor pipeline is txtai’s spin on retrieval augmented generation (RAG). This pipeline extracts knowledge from content by joining a prompt, context data store and generative model together.

The following example shows how a large language model (LLM) can use an embeddings database for context.

python

1import torch
2from txtai.pipeline import Extractor
3
4def prompt(question):
5  return [{
6    "query": question,
7    "question": f"""
8Answer the following question using the context below.
9Question: {question}
10Context:
11"""
12}]
13
14# Create embeddings
15embeddings = Embeddings(
16  path="sentence-transformers/nli-mpnet-base-v2",
17  content=True,
18  autoid="uuid5"
19)
20
21# Create an index for the list of text
22embeddings.index(data)
23
24# Create and run extractor instance
25extractor = Extractor(
26  embeddings,
27  "google/flan-t5-large", 
28  torch_dtype=torch.bfloat16, 
29  output="reference"
30)
31extractor(prompt("What country is having issues with climate change?"))[0]

shell

1{'answer': 'Canada', 'reference': 'da633124-33ff-58d6-8ecb-14f7a44c042a'}

The logic above first builds an embeddings index. It then loads a LLM and uses the embeddings index to drive a LLM prompt.

The extractor pipeline can optionally return a reference to the id of the best matching record with the answer. That id can be used to resolve the full answer reference. Note that the embeddings above used an uuid autosequence.

python

1uid = extractor(prompt("What country is having issues with climate change?"))[0]["reference"]
2embeddings.search(f"select id, text from txtai where id = '{uid}'")

shell

1[{'id': 'da633124-33ff-58d6-8ecb-14f7a44c042a',
2  'text': "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg"}]

LLM inference can also be run standalone.

python

1from txtai.pipeline import LLM
2
3llm = LLM("google/flan-t5-large", torch_dtype=torch.bfloat16)
4llm("Where is one place you'd go in Washington, DC?")

shell

1national museum of american history

Language model workflows

Language model workflows with txtai

txtai enables language model workflows.

Language model workflows, also known as semantic workflows, connect language models together to build intelligent applications.

Workflows can run right alongside an embeddings instance, similar to a stored procedure in a relational database. Workflows can be written in either Python or YAML. We’ll demonstrate how to write a workflow with YAML.

yaml

1# Embeddings instance
2writable: true
3embeddings:
4  path: sentence-transformers/nli-mpnet-base-v2
5  content: true
6  functions:
7    - {name: translation, argcount: 2, function: translation}
8
9# Translation pipeline
10translation:
11
12# Workflow definitions
13workflow:
14  search:
15    tasks:
16      - search
17      - action: translation
18        args:
19          target: fr
20        task: template
21        template: "{text}"

The workflow above loads an embeddings index and defines a search workflow. The search workflow runs a search and then passes the results to a translation pipeline. The translation pipeline translates results to French.

python

1from txtai import Application
2
3# Build index
4app = Application("embeddings.yml")
5app.add(data)
6app.index()
7
8# Run workflow
9list(app.workflow(
10  "search", 
11  ["select text from txtai where similar('feel good story') limit 1"]
12))

shell

1['Maine homme gagne $1M à partir de $25 billet de loterie']

SQL functions, in some cases, can accomplish the same thing as a workflow. The function below runs the translation pipeline as a function.

python

1app.search("select translation(text, 'fr') text from txtai where similar('feel good story') limit 1")

shell

1[{'text': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]

LLM chains with templates are also possible with workflows. Workflows are self-contained, they operate both with and without an associated embeddings instance. The following workflow uses a LLM to conditionally translate text to French and then detect the language of the text.

yaml

1sequences:
2  path: google/flan-t5-large
3  torch_dtype: torch.bfloat16
4
5workflow:
6  chain:
7    tasks:
8      - task: template
9        template: Translate '{statement}' to {language} if it's English
10        action: sequences
11      - task: template
12        template: What language is the following text? {text}
13        action: sequences

python

1inputs = [
2  {"statement": "Hello, how are you", "language": "French"},
3  {"statement": "Hallo, wie geht's dir", "language": "French"}
4]
5
6app = Application("workflow.yml")
7list(app.workflow("chain", inputs))

shell

1['French', 'German']

Wrapping up

NLP is advancing at a rapid pace. Things not possible even a year ago are now possible. This article introduced txtai, an all-in-one embeddings database. The possibilities are limitless and we’re excited to see what can be built on top of txtai!

Visit the links below for more.

Introducing txtai, the all-in-one embeddings database

Introducing txtai

Install and run txtai

Semantic search

Updates and deletes

Persistence

Hybrid search

Content storage

Query with SQL

Object storage

Topic modeling

Subindexes

LLM orchestration

Language model workflows

Wrapping up

References

Tags: