Adding upstream version 2.5.1.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
c71cb8b61d
commit
982828099e
783 changed files with 150650 additions and 0 deletions
BIN
docs/bleve.png
Normal file
BIN
docs/bleve.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 6.6 KiB |
3
docs/geo.md
Normal file
3
docs/geo.md
Normal file
|
@ -0,0 +1,3 @@
|
|||
# Geo spatial search
|
||||
|
||||
Redirect to [geo/README.md](https://github.com/blevesearch/bleve/blob/master/geo/README.md)
|
88
docs/scoring.md
Normal file
88
docs/scoring.md
Normal file
|
@ -0,0 +1,88 @@
|
|||
# Scoring models for document hits
|
||||
|
||||
* Search is performed on a collection fields using compound queries such as conjunction/disjunction/boolean etc. However, the scoring itself is done independently for each field and then aggregated to get the final score for a document hit.
|
||||
* Default scoring scheme for document hits involving text hits: `tf-idf`.
|
||||
* Nearest-neighbor/vector hits scoring depends on chosen `knn distance` metric, highlighted [here](https://github.com/blevesearch/bleve/blob/master/docs/vectors.md#supported).
|
||||
* Hybrid search scoring will combine `tf-idf` scores with `knn distance` numbers.
|
||||
* *v2.5.0* (and after) will come with support for `bm25` scoring for exact searches.
|
||||
|
||||
## BM25
|
||||
|
||||
When it comes to scoring a document hit for a specific field, BM25 scoring mechanism requires the following stats:
|
||||
* fieldLen - The number of analyzed terms in the current document's field.
|
||||
* avgFieldLen - The average number of analyzed terms in the field across all the documents.
|
||||
* docTotal - The total number of documents in the index.
|
||||
* docTerm - The total number of documents containing the query term within the index.
|
||||
|
||||
|
||||
The scoring formula followed in BM25 is
|
||||
|
||||
```math
|
||||
\sum_{i}^n IDF(q_i) {{f(q_i,D) * (k1 + 1)}\over{f(q_i,D) + k1 * (1-b+b*{{fieldLen}\over{avgFieldLen}})}}
|
||||
```
|
||||
|
||||
$IDF(q_i)$ here refers to Inverse Document Frequency talks about how rare (and hence rich in information) is a particular query term $`q_i`$ across all the documents in the index, which is calculated as
|
||||
```math
|
||||
\ln(1 + {{docTotal - docTerm + 0.5}\over{docTerm + 0.5}})
|
||||
```
|
||||
|
||||
Coming back to the BM25 scoring, $f(q_i,D)$ refers to the frequency of the query term in document $D$. The entire equation has certain multipliers
|
||||
* $k1$ - helps in controlling the saturation of the score with respect to query term in a document. Basicaly if the query term's frequency is too high, the score value gets saturated and doesn't increase beyond a certain point.
|
||||
* $b$ - controls the extent to which the $fieldLen$ normalizes the term's frequency.
|
||||
|
||||
### How to enable and use BM25
|
||||
|
||||
Bleve v2.5.0 updated the `indexMapping` construct with the concept of `scoringModel`. This is a global (meaning applicable to all the fields) setting which drives which scoring algorithm to apply while scoring the document hits. Supported scoring models are defined [here](https://github.com/blevesearch/bleve_index_api/blob/f54d76f0a71a838837159aa44ced0404bb6ec25f/indexing_options.go#L27)
|
||||
|
||||
For instance, while defining the index mapping for the data model that's been decided by the user, following snippet can be referred to enable BM25
|
||||
|
||||
```go
|
||||
indexMapping := bleve.NewIndexMapping()
|
||||
indexMapping.TypeField = "type"
|
||||
indexMapping.DefaultAnalyzer = "en"
|
||||
indexMapping.ScoringModel = index.BM25Scoring
|
||||
```
|
||||
|
||||
During search time there's explicit change involved, unless the user wants to perform a global scoring.
|
||||
|
||||
### Global Scoring
|
||||
|
||||
Let's say that the user has a dataset which is quite large (let's say 3 million) and to have good throughput, they create 3 shards (with the same index mapping) for the "index". Each of these shards can be `bleve.Index` type and while performing a search over the entire dataset, a `bleve.IndexAlias` can be created over which a search can be performed. This parallelizes things pretty good, both on the indexing path and the search path.
|
||||
|
||||
The concept of global scoring is applicable when the index is "sharded" (similar to above situation). This is because each index has data which is disjoint, and thereby while performing the scoring on document hits on each of them, the value of stats is not complete at a global level, since we're doing a search over the entire dataset using the `bleve.IndexAlias`. For eg: `docTotal` value while scoring the document hits would be 1 million which is incorrect at a global level.
|
||||
|
||||
So in order to keep the scoring roughly same across varying count of the number of shards involved, we provide a mechanism to enable "global scoring". In this type of search, an initial roundtrip is performed to gather and aggregate the stats necessary for the scoring mechanism and in the second phase, the actual search is performed. So naturally this comes at a cost of latency. As a reference here's how the user can go about with it
|
||||
|
||||
```go
|
||||
multiPartIndex := bleve.NewIndexAlias(shard1, shard2)
|
||||
// set the alias with the same index mapping which both the shards use.
|
||||
err = multiPartIndex.SetIndexMapping(indexMapping)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
ctx := context.Background()
|
||||
ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)
|
||||
|
||||
res, err := multiPartIndex.SearchInContext(ctx, searchRequest)
|
||||
```
|
||||
|
||||
A note here is that, this would only matter if the relative order of the document hits vary quite a bit (vs single shard case). This would be possible when the shard count increases quite a bit, in low doc count situations or if there is a heavy skew in the data distribution amongst the shards for some reason. Ideally the shards are created when the data is quite large and each of them index same amount of data - in which case the scores won't fluctuate much to affect the relative hit order and the user can choose to avoid the global scoring mechanism altogether.
|
||||
|
||||
## TF-IDF
|
||||
|
||||
TF-IDF is the default scoring mechanism involved (for backward compatibility reasons) and requires no change from the user at index or search time to avail it.
|
||||
|
||||
The scoring formula involved is
|
||||
|
||||
```math
|
||||
\sum_{i}^n f(q_i, D) * {{1}\over{\sqrt{fieldLen}}} * IDF(q_i)
|
||||
```
|
||||
|
||||
where $IDF(q_i)$ is
|
||||
|
||||
```math
|
||||
1 + {{docTotal}\over{1 + docTerm}}
|
||||
```
|
||||
|
||||
Note: TF-IDF formula doesn't accomodate logic for score saturation due to term frequency or fieldLen. So, it's recommended to use BM25 scoring by explicity setting it in the index mapping.
|
786
docs/sort_facet.md
Normal file
786
docs/sort_facet.md
Normal file
|
@ -0,0 +1,786 @@
|
|||
<h2>Purpose of Docvalues</h2>
|
||||
|
||||
<h3>Background</h3>
|
||||
|
||||
<p align="justify">What are docValues? In the index mapping, there is an option to enable or disable docValues for a specific field mapping. However, what does it actually mean to activate or deactivate docValues, and how does it impact the end user? This document aims to address these questions.</p>
|
||||
<pre>
|
||||
"default_mapping": {
|
||||
"dynamic": true,
|
||||
"enabled": true,
|
||||
"properties": {
|
||||
"loremIpsum": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "loremIpsum",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": true
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
<p align="justify">Enabling docValues will always result in an increase in the size of your Bleve index, leading to a corresponding increase in disk usage. But what advantages can you expect in return? This document also quantitatively assesses this trade-off with a test case.</p>
|
||||
|
||||
<p align="justify">In a more general sense, we recommend enabling docValues on a field mapping if you anticipate queries that involve sorting and/or facet operations on that field. It's important to note, though, that sorting and faceting will work irrespective of whether docValues are enabled or not. This may lead you to wonder if there's any real benefit to enabling docValues since you're allocating extra disk space without an apparent return. The real advantage, however, becomes evident in enhanced query response times and reduced memory consumption during active usage. By accepting a minor increase in the disk space used by your Full-Text Search (FTS) index, you can anticipate better performance in handling search requests that involve sorting and faceting.</p>
|
||||
|
||||
<h3>Usage</h3>
|
||||
|
||||
<p align="justify">The initial use of docValues comes into play when sorting is involved. In the search request JSON, there is a field named "sort." This optional "sort" field can have a slice of JSON objects as its value. Each JSON object must belong to one of the following types:
|
||||
<ul>
|
||||
<li>SortDocID</li>
|
||||
<li>SortScore (which is the default if none is specified)</li>
|
||||
<li>SortGeoDistance</li>
|
||||
<li>SortField</li>
|
||||
</ul>
|
||||
</p>
|
||||
<p align="justify">DocValues are relevant only when any of the JSON objects in the "sort" field are of type SortGeoDistance or SortField. This means that if you expect queries on a field F, where the queries either do not specify a value for the "sort" field or provide a JSON object of type SortDocID or SortScore, enabling docValues will not improve sorting operations, and as a result, query latency will remain unchanged. It's worth noting that the default sorting object, SortScore, does not require docValues to be enabled for any of the field mappings. Therefore, a search request without a sorting operation will not utilize docValues at all.</p>
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>No Sort Objects</th>
|
||||
<th>SortDocID</th>
|
||||
<th>SortScore</th>
|
||||
<th>SortField</th>
|
||||
<th>SortGeoDistance</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field":"dolor"
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field":"sit_amet"
|
||||
},
|
||||
"sort":[
|
||||
{
|
||||
"by":"id",
|
||||
"desc":true
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field":"sit_amet"
|
||||
},
|
||||
"sort":[
|
||||
{
|
||||
"by":"score",
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field":"sit_amet"
|
||||
},
|
||||
"sort":[
|
||||
{
|
||||
"by":"field",
|
||||
"field":"dolor",
|
||||
"type":"auto",
|
||||
"mode":"min",
|
||||
"missing":"last"
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "dolor"
|
||||
},
|
||||
"sort": [
|
||||
{
|
||||
"by": "geo_distance",
|
||||
"field": "sit_amet",
|
||||
"location": [
|
||||
123.223,
|
||||
34.33
|
||||
],
|
||||
"unit": "km"
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>No DocValues used</td>
|
||||
<td>No DocValues used</td>
|
||||
<td>No DocValues used</td>
|
||||
<td>DocValues used for field "dolor". Field Mapping for "dolor" may enable docValues.</td>
|
||||
<td>DocValues used, for field "sit_amet".
|
||||
Field Mapping for "sit_amet" may enable docValues.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
<p align="justify">Now, let's consider faceting. The search request object also includes another field called "facets," where you can specify a collection of facet requests, with each request being associated with a unique name. Each of these facet requests can fall into one of three types:
|
||||
<ul>
|
||||
<li>Date range</li>
|
||||
<li>Numeric range</li>
|
||||
<li>Term facet</li>
|
||||
</ul>
|
||||
Enabling docValues for the fields associated with such facet requests might provide benefits in this context.</p>
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>No Facet Request</th>
|
||||
<th>Date Range Facet</th>
|
||||
<th>Numeric Range Facet</th>
|
||||
<th>Term Facet</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "dolor"
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "sit_amet"
|
||||
},
|
||||
"facet": {
|
||||
"facetA": {
|
||||
"size": 1,
|
||||
"field": "dolor",
|
||||
"date_ranges": [
|
||||
{
|
||||
"name": "lorem",
|
||||
"start": "20/August/2001",
|
||||
"end": "22/August/2002",
|
||||
"datetime_parser": "custDT"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "sit_amet"
|
||||
},
|
||||
"facet": {
|
||||
"facetA": {
|
||||
"size": 1,
|
||||
"field": "dolor",
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"name":"lorem",
|
||||
"min":22,
|
||||
"max":34
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "sit_amet"
|
||||
},
|
||||
"facet": {
|
||||
"facetA": {
|
||||
"size": 1,
|
||||
"field": "dolor"
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>No DocValues used</td>
|
||||
<td colspan="3">DocValues used for field "dolor". Field Mapping for "dolor" may enable docValues.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<p align="justify">In summary, when a search request is received by the Bleve index, it extracts all the fields from the sort objects and facet objects. To potentially benefit from docValues, you should consider enabling docValues for the fields mentioned in SortField and SortGeoDistance sort objects, as well as the fields associated with all the facet objects. By doing so, you can optimize sorting and faceting operations in your search queries.</p>
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>Combo A</th>
|
||||
<th>Combo B</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "sit_amet"
|
||||
},
|
||||
"facet": {
|
||||
"facetA": {
|
||||
"size": 1,
|
||||
"field": "dolor",
|
||||
"date_ranges": [
|
||||
{
|
||||
"name": "lorem",
|
||||
"start": "20/August/2001",
|
||||
"end": "22/August/2002",
|
||||
"datetime_parser": "custDT"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"sort":[
|
||||
{
|
||||
"by":"field",
|
||||
"field":"lorem",
|
||||
"type":"auto",
|
||||
"mode":"min",
|
||||
"missing":"last"
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "lorem ipsum",
|
||||
"field": "sit_amet"
|
||||
},
|
||||
"facet": {
|
||||
"facetA": {
|
||||
"size": 1,
|
||||
"field": "dolor",
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"name":"lorem",
|
||||
"min":22,
|
||||
"max":34
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"sort": [
|
||||
{
|
||||
"by": "geo_distance",
|
||||
"field": "ipsum",
|
||||
"location": [
|
||||
123.223,
|
||||
34.33
|
||||
],
|
||||
"unit": "km"
|
||||
}
|
||||
],
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>DocValues used for field "dolor" and "lorem". Field Mapping for "dolor" and "lorem" may enable docValues.</td>
|
||||
<td>DocValues used for field "dolor" and "ipsum". Field Mapping for "dolor" and "ipsum" may enable docValues.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<h3>Empirical Analysis</h3>
|
||||
|
||||
<p align="justify">To evaluate our hypothesis, I've set up a sample dataset on my personal computer and I've created two Bleve indexes: one with docvalues enabled for three fields (<code>dummyDate</code>, <code>dummyNumber</code>, and <code>dummyTerm</code>), and another where I've disabled docValues for the same three fields. These field mappings were incorporated into the Default Mapping. It's important to mention that for both indexes, DocValues for dynamic fields were enabled, as the default mapping is dynamic.</p>
|
||||
|
||||
<p align="justify">The values for <code>dummyDate</code> and <code>dummyNumber</code> were configured to increase monotonically, with <code>dummyDate</code> representing a date value and `dummyNumber` representing a numeric value. This setup was intentional to ensure that facet aggregation would consistently result in cache hits and misses, providing a useful testing scenario.</p>
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>Index A</th>
|
||||
<th>Index B</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
"default_mapping": {
|
||||
"dynamic": true,
|
||||
"enabled": true,
|
||||
"properties": {
|
||||
"dummyNumber": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyNumber",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": true
|
||||
}
|
||||
]
|
||||
},
|
||||
"dummyTerm": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyTerm",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": true
|
||||
}
|
||||
]
|
||||
},
|
||||
"dummyDate": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyDate",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": true
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
"default_mapping": {
|
||||
"dynamic": true,
|
||||
"enabled": true,
|
||||
"properties": {
|
||||
"dummyNumber": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyNumber",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": false
|
||||
}
|
||||
]
|
||||
},
|
||||
"dummyTerm": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyTerm",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": false
|
||||
}
|
||||
]
|
||||
},
|
||||
"dummyDate": {
|
||||
"enabled": true,
|
||||
"dynamic": false,
|
||||
"fields": [
|
||||
{
|
||||
"name": "dummyDate",
|
||||
"type": "text",
|
||||
"store": false,
|
||||
"index": true,
|
||||
"include_term_vectors": false,
|
||||
"include_in_all": false,
|
||||
"docvalues": false
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>Docvalues enabled across all three field mappings</td>
|
||||
<td>Docvalues disabled across all three field mappings</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
Document Format used for the test scenario:
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>Document 1</th>
|
||||
<th>Document 2</th>
|
||||
<th>... Document i</th>
|
||||
<th>Document 5000</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"dummyTerm":"Term",
|
||||
"dummyDate":"2000-01-01T00:00:00,
|
||||
"dummyNumber:1
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"dummyTerm":"Term",
|
||||
"dummyDate":"2000-01-01T01:00:00,
|
||||
"dummyNumber:2
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"dummyTerm":"Term",
|
||||
"dummyDate":"2000-01-01T01:00:00"+(i hours),
|
||||
"dummyNumber:i
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"dummyTerm":"Term",
|
||||
"dummyDate":2000-01-01T01:00:00 + (5000 hours),
|
||||
"dummyNumber:5000
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<p align="justify">Now I ran the following set of search requests across both the indexes, while increasing the number of documents indexed from 2000 to 4000.</p>
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>Request 1</th>
|
||||
<th>Request 2</th>
|
||||
<th>... Request i</th>
|
||||
<th>Request 1000</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "term",
|
||||
"field":"dummyTerm"
|
||||
},
|
||||
"facets":{
|
||||
"myDate":{
|
||||
"field":"dummyDate",
|
||||
"size":100000,
|
||||
"date_ranges":[
|
||||
{
|
||||
"start":"2000-01-01T00:00:00",
|
||||
"end":"2000-01-01T01:00:00"
|
||||
}
|
||||
]
|
||||
},
|
||||
"myNum":{
|
||||
"field":"dummyNumber",
|
||||
"size":100000,
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"min": 1000,
|
||||
"max": 1001
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "term",
|
||||
"field":"dummyTerm"
|
||||
},
|
||||
"facets":{
|
||||
"myDate":{
|
||||
"field":"dummyDate",
|
||||
"size":100000,
|
||||
"date_ranges":[
|
||||
{
|
||||
"start":"2000-01-01T01:00:00",
|
||||
"end":"2000-01-01T02:00:00"
|
||||
}
|
||||
]
|
||||
},
|
||||
"myNum":{
|
||||
"field":"dummyNumber",
|
||||
"size":100000,
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"min": 999,
|
||||
"max": 1000
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "term",
|
||||
"field":"dummyTerm"
|
||||
},
|
||||
"facets":{
|
||||
"myDate":{
|
||||
"field":"dummyDate",
|
||||
"size":100000,
|
||||
"date_ranges":[
|
||||
{
|
||||
"start":"2000-01-01T00:00:00" + i hour
|
||||
"end":"2000-01-01T00:00:00" + (i+1) hour
|
||||
}
|
||||
]
|
||||
},
|
||||
"myNum":{
|
||||
"field":"dummyNumber",
|
||||
"size":100000,
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"min": 1000-i,
|
||||
"max": 1000-i+1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
<td style="vertical-align: top; width: 20%;">
|
||||
<pre>
|
||||
{
|
||||
"explain": true,
|
||||
"fields": [
|
||||
"*"
|
||||
],
|
||||
"highlight": {},
|
||||
"query": {
|
||||
"match": "term",
|
||||
"field":"dummyTerm"
|
||||
},
|
||||
"facets":{
|
||||
"myDate":{
|
||||
"field":"dummyDate",
|
||||
"size":100000,
|
||||
"date_ranges":[
|
||||
{
|
||||
"start":"2000-01-01T01:00:00" + 1000 hour,
|
||||
"end":"2000-01-01T02:00:00" + 1001 hour
|
||||
}
|
||||
]
|
||||
},
|
||||
"myNum":{
|
||||
"field":"dummyNumber",
|
||||
"size":100000,
|
||||
"numeric_ranges":[
|
||||
{
|
||||
"min": 0,
|
||||
"max": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"size": 10,
|
||||
"from": 0
|
||||
}
|
||||
</pre>
|
||||
</td>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th>Bleve index size growth with increase in indexed documents</th>
|
||||
<th>Total query time for 1000 queries with increase in number of indexed documents</th>
|
||||
</tr>
|
||||
<td><img src = "sort_facet_supporting_docs/indexSizeVsNumDocs.png" alt = "indexSizeVsNumDocs.png"/></td>
|
||||
<td><img src = "sort_facet_supporting_docs/queryTimevsNumDocs.png" alt = "queryTimevsNumDocs.png"/></td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div style="overflow-x: auto;">
|
||||
<table>
|
||||
<tr>
|
||||
<th style="width:50%">Average increase in index size (in bytes) by enabling DocValues</th>
|
||||
<th style="width:50%">Average reduction in time taken to perform 1000 queries (in milliseconds) by enabling DocValues</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="center"><code>7762.47</code></td>
|
||||
<td align="center"><code>27.034</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
Even at this small scale, with a small document size and a very limited number of indexed documents, we still observe a noticeable tradeoff. With just a slight increase in the index size (an average of 7KB) we obtain a 20ms reduction in the total execution time, on average, for only 1000 queries.
|
||||
|
||||
<h3>Technical Information</h3>
|
||||
|
||||
<p align="justify">When a search request involves facet or sorting operations on a field F, these operations occur after the main search query is executed. For instance, if the main query yields a result of 200 documents, the sorting and faceting processes will be applied to these 200 documents. However, the main query result only provides a set of document IDs, not the actual document contents.</p>
|
||||
|
||||
<p align="justify">Here's where docValues become essential. If the field mapping for F is docValue enabled, the system can directly access the values for the field from the stored docValue part in the index file. This means that for each document ID returned in the search result, the field values are readily available.</p>
|
||||
|
||||
<p align="justify">However, if docValues are not enabled for field F, the system must take a different approach. It needs to "fetch the document" from the index file, read the value for field F, and cache this field-document pair in memory for further processing. The issue becomes apparent in the latter scenario. By not enabling docValues for field F, you essentially retrieve all the documents in the search result (at the worst case), which can be a substantial amount of data. Moreover, you have to cache this information in memory, leading to increased memory usage. As a result, query latency significantly suffers because you're essentially fetching and processing all documents, which can be both time-consuming and resource-intensive. Enabling docValues for the relevant fields is, therefore, a crucial optimization to enhance query performance and reduce memory overhead in such situations.</p>
|
BIN
docs/sort_facet_supporting_docs/indexSizeVsNumDocs.png
Normal file
BIN
docs/sort_facet_supporting_docs/indexSizeVsNumDocs.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 30 KiB |
BIN
docs/sort_facet_supporting_docs/queryTimevsNumDocs.png
Normal file
BIN
docs/sort_facet_supporting_docs/queryTimevsNumDocs.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 31 KiB |
180
docs/synonyms.md
Normal file
180
docs/synonyms.md
Normal file
|
@ -0,0 +1,180 @@
|
|||
# Synonym search
|
||||
|
||||
* *v2.5.0* (and after) will come with support for **synonym definition indexing and search**.
|
||||
* We've achieved this by embedding synonym indexes within our bleve (scorch) indexes.
|
||||
* Usage of zap file format: [v16](https://github.com/blevesearch/zapx/blob/master/zap.md). Here we co-locate text, vector and synonym indexes as neighbors within segments, continuing to conform to the segmented architecture of *scorch*.
|
||||
|
||||
## Supported
|
||||
|
||||
* Indexing `Synonym Definitions` allows specifying equivalent terms that will be used to construct the synonym index. There are currently two types of `Synonym Definitions` supported:
|
||||
|
||||
1. Equivalent Mapping:
|
||||
|
||||
In this type, all terms in the *synonyms* list are considered equal and can replace one another. Any of these terms can match a query or document containing any other term in the group, ensuring full synonym coverage.
|
||||
|
||||
```json
|
||||
{
|
||||
"synonyms": [
|
||||
"tranquil",
|
||||
"peaceful",
|
||||
"calm",
|
||||
"relaxed",
|
||||
"unruffled"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. Explicit Mapping:
|
||||
|
||||
In this mapping, only the terms in the *input* list ("blazing") will have the terms in *synonyms* as their synonyms. The input terms are not equivalent to each other, and the synonym relationship is explicitly directional, applying only from the *input* to the *synonyms*.
|
||||
|
||||
```json
|
||||
{
|
||||
"input": [
|
||||
"blazing"
|
||||
],
|
||||
"synonyms": [
|
||||
"intense",
|
||||
"radiant",
|
||||
"burning",
|
||||
"fiery",
|
||||
"glowing"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
* The addition of `Synonym Sources` in the index mapping enables associating a set of `synonym definitions` (called a `synonym collection`) with a specific analyzer. This allows for preprocessing of terms in both the *input* and *synonyms* lists before the synonym index is created. By using an analyzer, you can normalize or transform terms (e.g., case folding, stemming) to improve synonym matching.
|
||||
|
||||
```json
|
||||
{
|
||||
"analysis": {
|
||||
"synonym_sources": {
|
||||
"english": {
|
||||
"collection": "en_thesaurus",
|
||||
"analyzer": "en"
|
||||
},
|
||||
"german": {
|
||||
"collection": "de_thesaurus",
|
||||
"analyzer": "de"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
There are two `synonym sources` named "english" and "german," each associated with its respective `synonym collection` and analyzer. In any text field mapping, a `synonym source` can be specified to enable synonym expansion when the field is queried. The analyzer of the synonym source must match the analyzer of the field mapping to which it is applied.
|
||||
|
||||
* Any text-based Bleve query (e.g., match, phrase, term, fuzzy, etc.) will use the `synonym source` (if available) for the queried field to expand the search terms using the thesaurus created from user-defined synonym definitions. The behavior for specific query types is as follows:
|
||||
|
||||
1. Queries with `fuzziness` parameter: For queries like match, phrase, and match-phrase that support the `fuzziness` parameter, the queried terms are fuzzily matched with the thesaurus's LHS terms to generate candidate terms. These terms are then combined with the results of fuzzy matching against the field dictionary, which contains the terms present in the queried field.
|
||||
|
||||
2. Wildcard, Regexp, and Prefix queries: These queries follow a similar approach. First, the thesaurus is used to expand terms (e.g., LHS terms that match the prefix or regex). The resulting terms are then combined with candidate terms from dictionary expansion.
|
||||
|
||||
## Indexing
|
||||
|
||||
Below is an example of using the Bleve API to define synonym sources, index synonym definitions, and associate them with a text field mapping:
|
||||
|
||||
```go
|
||||
// Define a document to be indexed.
|
||||
doc := struct {
|
||||
Text string `json:"text"`
|
||||
}{
|
||||
Text: "hardworking employee",
|
||||
}
|
||||
|
||||
// Define a synonym definition where "hardworking" has equivalent terms.
|
||||
synDef := &bleve.SynonymDefinition{
|
||||
Synonyms: []string{
|
||||
"hardworking",
|
||||
"industrious",
|
||||
"conscientious",
|
||||
"persistent",
|
||||
},
|
||||
}
|
||||
|
||||
// Define the name of the `synonym collection`.
|
||||
// This collection groups multiple synonym definitions.
|
||||
synonymCollection := "collection1"
|
||||
|
||||
// Define the name of the `synonym source`.
|
||||
// This source will be associated with specific field mappings.
|
||||
synonymSourceName := "english"
|
||||
|
||||
// Define the analyzer to process terms in the synonym definitions.
|
||||
// This analyzer must match the one applied to the field using the synonym source.
|
||||
analyzer := "en"
|
||||
|
||||
// Configure the synonym source by associating it with the synonym collection and analyzer.
|
||||
synonymSourceConfig := map[string]interface{}{
|
||||
"collection": synonymCollection,
|
||||
"analyzer": analyzer,
|
||||
}
|
||||
|
||||
// Create a new index mapping.
|
||||
bleveMapping := bleve.NewIndexMapping()
|
||||
|
||||
// Add the synonym source configuration to the index mapping.
|
||||
err := bleveMapping.AddSynonymSource(synonymSourceName, synonymSourceConfig)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// Create a text field mapping with the specified analyzer and synonym source.
|
||||
textFieldMapping := bleve.NewTextFieldMapping()
|
||||
textFieldMapping.Analyzer = analyzer
|
||||
textFieldMapping.SynonymSource = synonymSourceName
|
||||
|
||||
// Associate the text field mapping with the "text" field in the default document mapping.
|
||||
bleveMapping.DefaultMapping.AddFieldMappingsAt("text", textFieldMapping)
|
||||
|
||||
// Create a new index with the specified mapping.
|
||||
index, err := bleve.New("example.bleve", bleveMapping)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// Index the document into the created index.
|
||||
err = index.Index("doc1", doc)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// Check if the index supports synonym indexing and add the synonym definition.
|
||||
if synIndex, ok := index.(bleve.SynonymIndex); ok {
|
||||
err = synIndex.IndexSynonym("synDoc1", synonymCollection, synDef)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
} else {
|
||||
// If the index does not support synonym indexing, raise an error.
|
||||
panic("expected synonym index")
|
||||
}
|
||||
```
|
||||
|
||||
## Querying
|
||||
|
||||
```go
|
||||
// Query the index created above.
|
||||
// Create a match query for the term "persistent".
|
||||
query := bleve.NewMatchQuery("persistent")
|
||||
|
||||
// Specify the field to search within, in this case, the "text" field.
|
||||
query.SetField("text")
|
||||
|
||||
// Create a search request with the query and enable explanation to understand how results are scored.
|
||||
searchRequest := bleve.NewSearchRequest(query)
|
||||
searchRequest.Explain = true
|
||||
|
||||
// Execute the search on the index.
|
||||
searchResult, err := index.Search(searchRequest)
|
||||
if err != nil {
|
||||
// Handle any errors that occur during the search.
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// The search result will contain one match: "doc1". This document includes the term "hardworking",
|
||||
// which is a synonym for the queried term "persistent". The synonym relationship is based on
|
||||
// the user-defined thesaurus associated with the index.
|
||||
// Print the search results, which will include the explanation for the match.
|
||||
fmt.Println(searchResult)
|
||||
```
|
149
docs/vectors.md
Normal file
149
docs/vectors.md
Normal file
|
@ -0,0 +1,149 @@
|
|||
# Nearest neighbor (vector) search
|
||||
|
||||
* *v2.4.0* (and after) will come with support for **vectors' indexing and search**.
|
||||
* We've achieved this by embedding [FAISS](https://github.com/facebookresearch/faiss) indexes within our bleve (scorch) indexes.
|
||||
* Introduction of a new zap file format: [v16](https://github.com/blevesearch/zapx/blob/master/zap.md) - which will be the default going forward. Here we co-locate text and vector indexes as neighbors within segments, continuing to conform to the segmented architecture of *scorch*.
|
||||
|
||||
## Pre-requisite(s)
|
||||
|
||||
* Induction of [FAISS](https://github.com/blevesearch/faiss) into our eco system, which is a fork of the original [facebookresearch/faiss](https://github.com/facebookresearch/faiss)
|
||||
* FAISS is a C++ library that needs to be compiled and it's shared libraries need to be situated at an accessible path for your application.
|
||||
* A `vectors` GO TAG needs to be set for bleve to access all the supporting code. This TAG must be set only after the FAISS shared library is made available. Failure to do either will inhibit you from using this feature.
|
||||
* Please follow these [instructions](#setup-instructions) below for any assistance in the area.
|
||||
* Releases of `blevesearch/bleve` work with select checkpoints of `blevesearch/faiss` owing to API changes and improvements (tracking over the `bleve` branch):
|
||||
| bleve version(s) | blevesearch/faiss version |
|
||||
| --- | --- |
|
||||
| `v2.4.0` | [blevesearch/faiss@7b119f4](https://github.com/blevesearch/faiss/tree/7b119f4b9c408989b696b36f8cc53908e53de6db) (modified v1.7.4) |
|
||||
| `v2.4.1`, `v2.4.2` | [blevesearch/faiss@d9db66a](https://github.com/blevesearch/faiss/tree/d9db66a38518d99eb334218697e1df0732f3fdf8) (modified v1.7.4) |
|
||||
| `v2.4.3`, `v2.4.4` | [blevesearch/faiss@b747c55](https://github.com/blevesearch/faiss/tree/b747c55a93a9627039c34d44b081f375dca94e57) (modified v1.8.0) |
|
||||
| `v2.5.0`, `v2.5.1` | [blevesearch/faiss@352484e](https://github.com/blevesearch/faiss/tree/352484e0fc9d1f8f46737841efe5f26e0f383f71) (modified v1.10.0) |
|
||||
|
||||
## Supported
|
||||
|
||||
* The `vector` field type is an array that is to hold float32 values only.
|
||||
* The `vector_base64` field type to support base64 encoded strings using little endian byte ordering (v2.4.1+)
|
||||
* Supported similarity metrics are: [`"cosine"` (v2.4.3+), `"dot_product"`, `"l2_norm"`].
|
||||
* `cosine` paths will additionally normalize vectors before indexing and search.
|
||||
* Supported dimensionality is between 1 and 2048 (v2.4.0), and up to **4096** (v2.4.1+).
|
||||
* Supported vector index optimizations: `latency`, `memory_efficient` (v2.4.1+), `recall`.
|
||||
* Vectors from documents that do not conform to the index mapping dimensionality are simply discarded at index time.
|
||||
* The dimensionality of the query vector must match the dimensionality of the indexed vectors to obtain any results.
|
||||
* Pure kNN searches can be performed, but the `query` attribute within the search request must be set - to `{"match_none": {}}` in this case. The `query` attribute is made optional when `knn` is available with v2.4.1+.
|
||||
* Hybrid searches are supported, where results from `query` are unioned (for now) with results from `knn`. The tf-idf scores from exact searches are simply summed with the similarity distances to determine the aggregate scores.
|
||||
```
|
||||
aggregate_score = (query_boost * query_hit_score) + (knn_boost * knn_hit_distance)
|
||||
```
|
||||
* Multi kNN searches are supported - the `knn` object within the search request accepts an array of requests. These sub objects are unioned by default but this behavior can be overriden by setting `knn_operator` to `"and"`.
|
||||
* Previously supported pagination settings will work as they were, with size/limit being applied over the top-K hits combined with any exact search hits.
|
||||
|
||||
## Indexing
|
||||
|
||||
```go
|
||||
doc := struct{
|
||||
Id string `json:"id"`
|
||||
Text string `json:"text"`
|
||||
Vec []float32 `json:"vec"`
|
||||
}{
|
||||
Id: "example",
|
||||
Text: "hello from united states",
|
||||
Vec: []float32{0,1,2,3,4,5,6,7,8,9},
|
||||
}
|
||||
|
||||
textFieldMapping := mapping.NewTextFieldMapping()
|
||||
vectorFieldMapping := mapping.NewVectorFieldMapping()
|
||||
vectorFieldMapping.Dims = 10
|
||||
vectorFieldMapping.Similarity = "l2_norm" // euclidean distance
|
||||
|
||||
bleveMapping := bleve.NewIndexMapping()
|
||||
bleveMapping.DefaultMapping.Dynamic = false
|
||||
bleveMapping.DefaultMapping.AddFieldMappingsAt("text", textFieldMapping)
|
||||
bleveMapping.DefaultMapping.AddFieldMappingsAt("vec", vectorFieldMapping)
|
||||
|
||||
index, err := bleve.New("example.bleve", bleveMapping)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
index.Index(doc.Id, doc)
|
||||
```
|
||||
|
||||
## Querying
|
||||
|
||||
```go
|
||||
searchRequest := NewSearchRequest(query.NewMatchNoneQuery())
|
||||
searchRequest.AddKNN(
|
||||
"vec", // vector field name
|
||||
[]float32{10,11,12,13,14,15,16,17,18,19}, // query vector (same dims)
|
||||
5, // k
|
||||
0, // boost
|
||||
)
|
||||
searchResult, err := index.Search(searchRequest)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
fmt.Println(searchResult.Hits)
|
||||
```
|
||||
|
||||
## Querying with filters (v2.4.3+)
|
||||
|
||||
```go
|
||||
searchRequest := NewSearchRequest(query.NewMatchNoneQuery())
|
||||
filterQuery := NewTermQuery("hello")
|
||||
searchRequest.AddKNNWithFilter(
|
||||
"vec", // vector field name
|
||||
[]float32{10,11,12,13,14,15,16,17,18,19}, // query vector (same dims)
|
||||
5, // k
|
||||
0, // boost
|
||||
filterQuery, // filter query
|
||||
)
|
||||
searchResult, err := index.Search(searchRequest)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
fmt.Println(searchResult.Hits)
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
* Using `cmake` is a recommended approach by FAISS authors.
|
||||
* More details here - [faiss/INSTALL](https://github.com/blevesearch/faiss/blob/main/INSTALL.md).
|
||||
|
||||
### Linux
|
||||
|
||||
Also documented here - [go-faiss/README](https://github.com/blevesearch/go-faiss/blob/master/README.md).
|
||||
|
||||
```
|
||||
git clone https://github.com/blevesearch/faiss.git
|
||||
cd faiss
|
||||
cmake -B build -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_C_API=ON -DBUILD_SHARED_LIBS=ON .
|
||||
make -C build
|
||||
sudo make -C build install
|
||||
```
|
||||
|
||||
Building will produce the dynamic library `faiss_c`. You will need to install it in a place where your system will find it (e.g. /usr/lib). You can do this with:
|
||||
```
|
||||
sudo cp build/c_api/libfaiss_c.so /usr/local/lib
|
||||
```
|
||||
|
||||
### OSX
|
||||
|
||||
While you shouldn't need to do any different over osX x86_64, with aarch64 - some instructions need adjusting (see [facebookresearch/faiss#2111](https://github.com/facebookresearch/faiss/issues/2111)) ..
|
||||
|
||||
```
|
||||
LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -B build -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_C_API=ON -DBUILD_SHARED_LIBS=ON -DFAISS_ENABLE_PYTHON=OFF .
|
||||
make -C build
|
||||
sudo make -C build install
|
||||
sudo cp build/c_api/libfaiss_c.dylib /usr/local/lib
|
||||
```
|
||||
|
||||
### Sanity check
|
||||
|
||||
Once the supporting library is built and made available, a sanity run is recommended to make sure all unit tests and especially those accessing the vectors' code pass. Here's how ..
|
||||
|
||||
```
|
||||
export DYLD_LIBRARY_PATH=/usr/local/lib
|
||||
go test -v ./... --tags=vectors
|
||||
```
|
||||
-or-
|
||||
```
|
||||
go test -ldflags "-r /usr/local/lib" ./... -tags=vectors
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue