Conduct a search
Full-text, vector, sparse vector, tensor, hybrid search.
Overview
This document offers guidance on conducting a search within Infinity.
Full-text search
Work modes for full-text index
A full-text index must be built to perform a full-text search, and this index operates in two modes:
- Real-time mode - If created immediately after a table is created, a full-text index will be built on ingested data in real time, accumulating posting data within memory and flushing it to disk when a specified quota is reached.
- Offline mode - For data inserted before the creation of a full-text index, index will be built in offline mode using external sorting.
Tokenizer
When creating a full-text index, you are required to specify a tokenizer/analyzer, which will be used for future full-text searches on the same column(s). Infinity has many built-in tokenizers. Except for the Ngram analyzer and the default standard analyzer, all other analyzers require dedicated resource files. Please download the appropriate files for your chosen analyzer from this link and save it to the directory specified by resource_dir in the configuration file.
[resource]
# Directory for Infinity's resource files, including dictionaries to be used by analyzers
resource_dir = "/var/infinity/resource"
The following are Infinity's built-in analyzers/tokenizers.
Standard analyzer
The standard analyzer is the default tokenizer and works best with Latin characters. It uses stemmer before outputting tokens segmented by white space, and English is the default stemmer. To specify a stemmer for a different language, use "standard-xxx", where xxx is the language to use.
Supported language stemmers include: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
Ngram analyzer
A definition of N-gram can be found on wikipedia. Use "ngram-x" to select the Ngram analyzer, where x represents the value of N. For example, a common choice for full-text searches in code is "ngram-3".
Simplified Chinese analyzer
Use "chinese" to select the simplified Chinese analyzer, which is a wrapper of Jieba analyzer. Use "chinese-fine" to output fine-grained analyzer results.
Traditional Chinese analyzer
Use "traditional" to select the traditional Chinese analyzer, which essentially converts simplified Chinese into traditional Chinese based on the outputs of the Jieba analyzer.
Japanese analyzer
Use "japanese" to select the Japanese analyzer, which is a wrapper of mecab.
Korean analyzer
Use "korean" to select the Korean tokenizer, which is a wrapper of mecab but uses a different Korean dictionary.
RAG analyzer
The RAG analyzer is a bilingual tokenizer that supports Chinese (simplified and traditional) and English. It is a C++ adaptation of RAGFlow's tokenizer, and its tokenization of Latin characters derives from NLTK.
This analyzer offers better recall for Chinese than the Jieba analyzer, but lower tokenization throughput due to higher computational costs. Its English language processing involves an additional lemmatization step before stemming, different from that of the standard analyzer.
Use "rag" to select the RAG analyzer or "rag-fine" for fine-grained mode, which outputs tokenization results with the second highest score.
Both RAG tokenization and fine-grained RAG tokenization are used in RAGFlow to ensure high recall.
IK analyzer
The IK analyzer is a bilingual tokenizer that supports Chinese (simplified and traditional) and English. It is a C++ adaptation of the IK Analyzer, which is widely used as a tokenizer by Chinese Elasticsearch users.
Use "ik" to select this analyzer, which is equivalent to the ik_smart option in the IK Analyzer, or "ik-fine" for fine-grained mode, which is equivalent to the ik_max_word option in the IK Analyzer.
Keyword analyzer
The keyword analyzer is a "noop" analyzer used for columns containing keywords only, where traditional scoring methods like BM25 do not apply. It scores 0 or 1, depending on whether any keywords are matched.
Use "keyword" to select this analyzer.
Search and ranking syntax
Infinity supports the following syntax or full-text search expressions:
- Single term
- AND multiple terms
- OR multiple terms
- Phrase search
- CARAT opertor
- Sloppy phrase search
- Field-specific search
- Escape character
Single term
Example: "blooms"
AND multiple terms
"space AND efficient"
OR multiple terms
"Bloom OR filter""Bloom filter"
OR is the default semantic in a multi-term full-text search unless explicitly specified otherwise.