Introducing txtai, the all-in-one embeddings database

Tuesday, March 5, 2024

2554 words13 min read

Last updated on Tuesday, March 5, 2024

Introducing txtai, the all-in-one embeddings database

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an all-in-one embeddings database that enables Natural Language Understanding (NLU) based search in any application.

Introducing txtai

Introduction to txtai

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling, retrieval augmented generation and more.

Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.

The following is a summary of key features:

  • 🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing

  • 📄 Create embeddings for text, documents, audio, images and video

  • 💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more

  • ↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.

  • ⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.

  • ☁️ Run local or scale out with container orchestration

txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers, FastAPI. txtai is open-source under an Apache 2.0 license.

Install and run txtai

txtai can be installed via pip or Docker. The following shows how to install via pip.

1pip install txtai

Semantic search

Semantic search with txtai

txtai enables semantic search with SQL and object storage.

Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.

The basic use case for an embeddings database is building an approximate nearest neighbor (ANN) index for semantic search. The following example indexes a small number of text entries to demonstrate the value of semantic search.

1from txtai import Embeddings 2 3# Works with a list, dataset or generator 4data = [ 5 "US tops 5 million confirmed virus cases", 6 "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg", 7 "Beijing mobilises invasion craft along coast as Taiwan tensions escalate", 8 "The National Park Service warns against sacrificing slower friends in a bear attack", 9 "Maine man wins $1M from $25 lottery ticket", 10 "Make huge profits without work, earn up to $100,000 a day" 11] 12 13# Create an embeddings 14embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2") 15 16# Create an index for the list of text 17embeddings.index(data) 18 19print("%-20s %s" % ("Query", "Best Match")) 20print("-" * 50) 21 22# Run an embeddings search for each query 23for query in ("feel good story", "climate change", 24 "public health story", "war", "wildlife", "asia", 25 "lucky", "dishonest junk"): 26 # Extract uid of first result 27 # search result format: (uid, score) 28 uid = embeddings.search(query, 1)[0][0] 29 30 # Print text 31 print("%-20s %s" % (query, data[uid]))

Semantic search results

Results from semantic search example.

The example above shows that for all of the queries, the query text isn’t in the data. This is the true power of transformers models over token based search.

Updates and deletes

Updates and deletes are supported for embeddings. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

1# Run initial query 2uid = embeddings.search("feel good story", 1)[0][0] 3print("Initial: ", data[uid]) 4 5# Create a copy of data to modify 6udata = data.copy() 7 8# Update data 9udata[0] = "See it: baby panda born" 10embeddings.upsert([(0, udata[0], None)]) 11 12uid = embeddings.search("feel good story", 1)[0][0] 13print("After update: ", udata[uid]) 14 15# Remove record just added from index 16embeddings.delete([0]) 17 18# Ensure value matches previous value 19uid = embeddings.search("feel good story", 1)[0][0] 20print("After delete: ", udata[uid])
1Initial: Maine man wins $1M from $25 lottery ticket 2After update: See it: baby panda born 3After delete: Maine man wins $1M from $25 lottery ticket

Persistence

Embeddings can be saved to storage and reloaded.

1embeddings.save("index") 2 3embeddings = Embeddings() 4embeddings.load("index") 5 6uid = embeddings.search("climate change", 1)[0][0] 7print(data[uid])
1Canada's last fully intact ice shelf has suddenly collapsed, forming a 2Manhattan-sized iceberg

Hybrid search

While dense vector indexes are by far the best option for semantic search systems, sparse keyword indexes can still add value. There may be cases where finding an exact match is important.

Hybrid search combines the results from sparse and dense vector indexes for the best of both worlds.

1# Create an embeddings 2embeddings = Embeddings( 3 hybrid=True, 4 path="sentence-transformers/nli-mpnet-base-v2" 5) 6 7# Create an index for the list of text 8embeddings.index(data) 9 10print("%-20s %s" % ("Query", "Best Match")) 11print("-" * 50) 12 13# Run an embeddings search for each query 14for query in ("feel good story", "climate change", 15 "public health story", "war", "wildlife", "asia", 16 "lucky", "dishonest junk"): 17 # Extract uid of first result 18 # search result format: (uid, score) 19 uid = embeddings.search(query, 1)[0][0] 20 21 # Print text 22 print("%-20s %s" % (query, data[uid]))

Hybrid search results

Results from hybrid search example.

Same results as with semantic search. Let’s run the same example with just a keyword index to view those results.

1# Create an embeddings 2embeddings = Embeddings(keyword=True) 3 4# Create an index for the list of text 5embeddings.index(data) 6 7print(embeddings.search("feel good story")) 8print(embeddings.search("lottery"))
1[] 2[(4, 0.5234998733628726)]

See that when the embeddings instance only uses a keyword index, it can’t find semantic matches, only keyword matches.

Content storage

Up to this point, all the examples are referencing the original data array to retrieve the input text. This works fine for a demo but what if you have millions of documents? In this case, the text needs to be retrieved from an external datastore using the id.

Content storage adds an associated database (i.e. SQLite, DuckDB) that stores associated metadata with the vector index. The document text, additional metadata and additional objects can be stored and retrieved right alongside the indexed vectors.

1# Create embeddings with content enabled. 2# The default behavior is to only store indexed vectors. 3embeddings = Embeddings( 4 path="sentence-transformers/nli-mpnet-base-v2", 5 content=True, 6 objects=True 7) 8 9# Create an index for the list of text 10embeddings.index(data) 11 12print(embeddings.search("feel good story", 1)[0]["text"])
1Maine man wins $1M from $25 lottery ticket

The only change above is setting the content flag to True. This enables storing text and metadata content (if provided) alongside the index. Note how the text is pulled right from the query result!

Let’s add some metadata.

Query with SQL

When content is enabled, the entire dictionary is stored and can be queried. In addition to vector queries, txtai accepts SQL queries. This enables combined queries using both a vector index and content stored in a database backend.

1# Create an index for the list of text 2embeddings.index([{"text": text, "length": len(text)} for text in data]) 3 4# Filter by score 5print(embeddings.search("select text, score from txtai where similar('hiking danger') and score >= 0.15")) 6 7# Filter by metadata field 'length' 8print(embeddings.search("select text, length, score from txtai where similar('feel good story') and score >= 0.05 and length >= 40")) 9 10# Run aggregate queries 11print(embeddings.search("select count(*), min(length), max(length), sum(length) from txtai"))
1[{'text': 'The National Park Service warns against sacrificing slower friends in a bear attack', 'score': 0.3151373863220215}] 2[{'text': 'Maine man wins $1M from $25 lottery ticket', 'length': 42, 'score': 0.08329027891159058}] 3[{'count(*)': 6, 'min(length)': 39, 'max(length)': 94, 'sum(length)': 387}]

This example above adds a simple additional field, text length.

Note the second query is filtering on the metadata field length along with a similar query clause. This gives a great blend of vector search with traditional filtering to help identify the best results.

Object storage

In addition to metadata, binary content can also be associated with documents. The example below downloads an image, upserts it along with associated text into the embeddings index.

1import urllib 2 3from IPython.display import Image 4 5# Get an image 6request = urllib.request.urlopen("https://raw.githubusercontent.com/neuml/txtai/master/demo.gif") 7 8# Upsert new record having both text and an object 9embeddings.upsert([("txtai", {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "object": request.read()}, None)]) 10 11# Query txtai for the most similar result to "machine learning" and get associated object 12result = embeddings.search("select object from txtai where similar('machine learning') limit 1")[0]["object"] 13 14# Display image 15Image(result.getvalue(), width=600)

Search with object storage

Searching with txtai.

Topic modeling

Topic modeling with txtai

txtai enables topic modeling with semantic graphs.

Topic modeling is enabled via semantic graphs. Semantic graphs, also known as knowledge graphs or semantic networks, build a graph network with semantic relationships connecting the nodes. In txtai, they can take advantage of the relationships inherently learned within an embeddings index.

1# Create embeddings with a graph index 2embeddings = Embeddings( 3 path="sentence-transformers/nli-mpnet-base-v2", 4 content=True, 5 functions=[ 6 {"name": "graph", "function": "graph.attribute"}, 7 ], 8 expressions=[ 9 {"name": "category", "expression": "graph(indexid, 'category')"}, 10 {"name": "topic", "expression": "graph(indexid, 'topic')"}, 11 ], 12 graph={ 13 "topics": { 14 "categories": ["health", "climate", "finance", "world politics"] 15 } 16 } 17) 18 19embeddings.index(data) 20embeddings.search("select topic, category, text from txtai")
1[{'topic': 'confirmed_cases_us_5', 2 'category': 'health', 3 'text': 'US tops 5 million confirmed virus cases'}, 4 {'topic': 'collapsed_iceberg_ice_intact', 5 'category': 'climate', 6 'text': "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg"}, 7 {'topic': 'beijing_along_craft_tensions', 8 'category': 'world politics', 9 'text': 'Beijing mobilises invasion craft along coast as Taiwan tensions escalate'}]

When a graph index is enabled, topics are assigned to each of the entries in the embeddings instance. Topics are dynamically created using a sparse index over graph nodes grouped by community detection algorithms.

Topic categories are also be derived as shown above.

Subindexes

Subindexes with txtai

Subindexes can be configured for an embeddings. A single embeddings instance can have multiple subindexes each with different configurations.

We’ll build an embeddings index having both a keyword and dense index to demonstrate.

1# Create embeddings with subindexes 2embeddings = Embeddings( 3 content=True, 4 defaults=False, 5 indexes={ 6 "keyword": { 7 "keyword": True 8 }, 9 "dense": { 10 "path": "sentence-transformers/nli-mpnet-base-v2" 11 } 12 } 13) 14embeddings.index(data)
1embeddings.search("feel good story", limit=1, index="keyword")
1[]
1embeddings.search("feel good story", limit=1, index="dense")
1[{'id': '4', 2 'text': 'Maine man wins $1M from $25 lottery ticket', 3 'score': 0.08329027891159058}]

Once again, this example demonstrates the difference between keyword and semantic search. The first search call uses the defined keyword index, the second uses the dense vector index.

LLM orchestration

LLM orchestration with txtai

txtai enables LLM orchestration with a pipeline that extracts knowledge from content by joining a prompt, context data store and generative model together.

txtai is an all-in-one embeddings database. It is the only vector database that also supports sparse indexes, graph networks and relational databases with inline SQL support. In addition to this, txtai has support for LLM orchestration.

The extractor pipeline is txtai’s spin on retrieval augmented generation (RAG). This pipeline extracts knowledge from content by joining a prompt, context data store and generative model together.

The following example shows how a large language model (LLM) can use an embeddings database for context.

1import torch 2from txtai.pipeline import Extractor 3 4def prompt(question): 5 return [{ 6 "query": question, 7 "question": f""" 8Answer the following question using the context below. 9Question: {question} 10Context: 11""" 12}] 13 14# Create embeddings 15embeddings = Embeddings( 16 path="sentence-transformers/nli-mpnet-base-v2", 17 content=True, 18 autoid="uuid5" 19) 20 21# Create an index for the list of text 22embeddings.index(data) 23 24# Create and run extractor instance 25extractor = Extractor( 26 embeddings, 27 "google/flan-t5-large", 28 torch_dtype=torch.bfloat16, 29 output="reference" 30) 31extractor(prompt("What country is having issues with climate change?"))[0]
1{'answer': 'Canada', 'reference': 'da633124-33ff-58d6-8ecb-14f7a44c042a'}

The logic above first builds an embeddings index. It then loads a LLM and uses the embeddings index to drive a LLM prompt.

The extractor pipeline can optionally return a reference to the id of the best matching record with the answer. That id can be used to resolve the full answer reference. Note that the embeddings above used an uuid autosequence.

1uid = extractor(prompt("What country is having issues with climate change?"))[0]["reference"] 2embeddings.search(f"select id, text from txtai where id = '{uid}'")
1[{'id': 'da633124-33ff-58d6-8ecb-14f7a44c042a', 2 'text': "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg"}]

LLM inference can also be run standalone.

1from txtai.pipeline import LLM 2 3llm = LLM("google/flan-t5-large", torch_dtype=torch.bfloat16) 4llm("Where is one place you'd go in Washington, DC?")
1national museum of american history

Language model workflows

Language model workflows with txtai

txtai enables language model workflows.

Language model workflows, also known as semantic workflows, connect language models together to build intelligent applications.

Workflows can run right alongside an embeddings instance, similar to a stored procedure in a relational database. Workflows can be written in either Python or YAML. We’ll demonstrate how to write a workflow with YAML.

1# Embeddings instance 2writable: true 3embeddings: 4 path: sentence-transformers/nli-mpnet-base-v2 5 content: true 6 functions: 7 - {name: translation, argcount: 2, function: translation} 8 9# Translation pipeline 10translation: 11 12# Workflow definitions 13workflow: 14 search: 15 tasks: 16 - search 17 - action: translation 18 args: 19 target: fr 20 task: template 21 template: "{text}"

The workflow above loads an embeddings index and defines a search workflow. The search workflow runs a search and then passes the results to a translation pipeline. The translation pipeline translates results to French.

1from txtai import Application 2 3# Build index 4app = Application("embeddings.yml") 5app.add(data) 6app.index() 7 8# Run workflow 9list(app.workflow( 10 "search", 11 ["select text from txtai where similar('feel good story') limit 1"] 12))
1['Maine homme gagne $1M à partir de $25 billet de loterie']

SQL functions, in some cases, can accomplish the same thing as a workflow. The function below runs the translation pipeline as a function.

1app.search("select translation(text, 'fr') text from txtai where similar('feel good story') limit 1")
1[{'text': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]

LLM chains with templates are also possible with workflows. Workflows are self-contained, they operate both with and without an associated embeddings instance. The following workflow uses a LLM to conditionally translate text to French and then detect the language of the text.

1sequences: 2 path: google/flan-t5-large 3 torch_dtype: torch.bfloat16 4 5workflow: 6 chain: 7 tasks: 8 - task: template 9 template: Translate '{statement}' to {language} if it's English 10 action: sequences 11 - task: template 12 template: What language is the following text? {text} 13 action: sequences
1inputs = [ 2 {"statement": "Hello, how are you", "language": "French"}, 3 {"statement": "Hallo, wie geht's dir", "language": "French"} 4] 5 6app = Application("workflow.yml") 7list(app.workflow("chain", inputs))
1['French', 'German']

Wrapping up

NLP is advancing at a rapid pace. Things not possible even a year ago are now possible. This article introduced txtai, an all-in-one embeddings database. The possibilities are limitless and we’re excited to see what can be built on top of txtai!

Visit the links below for more.

References

Tags:
nlp
llms
txtai
embeddings
semantic-search
vector-database
Spotify