Conduct a search
Full-text, vector, sparse vector, tensor, hybrid search.
Overview
This document offers guidance on conducting a search within Infinity.
Full-text search
Work modes for full-text index
A full-text index must be built to perform a full-text search, and this index operates in two modes:
- Real-time mode - If created immediately after a table is created, a full-text index will be built on ingested data in real time, accumulating posting data within memory and flushing it to disk when a specified quota is reached.
- Offline mode - For data inserted before the creation of a full-text index, index will be built in offline mode using external sorting.
Tokenizer
When creating a full-text index, you are required to specify a tokenizer/analyzer, which will be used for future full-text searches on the same column(s). Infinity has many built-in tokenizers. Except for the Ngram analyzer and the default standard analyzer, all other analyzers require dedicated resource files. Please download the appropriate files for your chosen analyzer from this link and save it to the directory specified by resource_dir in the configuration file.
[resource]
# Directory for Infinity's resource files, including dictionaries to be used by analyzers
resource_dir = "/usr/share/infinity/resource"
The following are Infinity's built-in analyzers/tokenizers.
Standard analyzer
The standard analyzer is the default tokenizer and works best with Latin characters. It uses stemmer before outputting tokens segmented by white space, and English is the default stemmer. To specify a stemmer for a different language, use "standard-xxx", where xxx is the language to use.
Supported language stemmers include: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
Ngram analyzer
A definition of N-gram can be found on wikipedia. Use "ngram-x" to select the Ngram analyzer, where x represents the value of N. For example, a common choice for full-text searches in code is "ngram-3".
Simplified Chinese analyzer
Use "chinese" to select the simplified Chinese analyzer, which is a wrapper of Jieba analyzer. Use "chinese-fine" to output fine-grained analyzer results.
Traditional Chinese analyzer
Use "traditional" to select the traditional Chinese analyzer, which essentially converts simplified Chinese into traditional Chinese based on the outputs of the Jieba analyzer.
Japanese analyzer
Use "japanese" to select the Japanese analyzer, which is a wrapper of mecab.
Korean analyzer
Use "korean" to select the Korean tokenizer, which is a wrapper of mecab but uses a different Korean dictionary.