1
0
Fork 0

Adding upstream version 2.5.1.

Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
Daniel Baumann 2025-05-19 00:20:02 +02:00
parent c71cb8b61d
commit 982828099e
Signed by: daniel
GPG key ID: FBB4F0E80A80222F
783 changed files with 150650 additions and 0 deletions

24
.github/workflows/tests.yml vendored Normal file
View file

@ -0,0 +1,24 @@
on:
push:
branches:
- master
pull_request:
name: Tests
jobs:
test:
strategy:
matrix:
go-version: [1.22.x, 1.23.x, 1.24.x]
platform: [ubuntu-latest, macos-latest, windows-latest]
runs-on: ${{ matrix.platform }}
steps:
- name: Install Go
uses: actions/setup-go@v1
with:
go-version: ${{ matrix.go-version }}
- name: Checkout code
uses: actions/checkout@v2
- name: Test
run: |
go version
go test -race ./...

20
.gitignore vendored Normal file
View file

@ -0,0 +1,20 @@
#*
*.sublime-*
*~
.#*
.project
.settings
**/.idea/
**/*.iml
.DS_Store
query_string.y.go.tmp
/analysis/token_filters/cld2/cld2-read-only
/analysis/token_filters/cld2/libcld2_full.a
/cmd/bleve/bleve
vendor/**
!vendor/manifest
/y.output
/search/query/y.output
*.test
tags
go.sum

25
.travis.yml Normal file
View file

@ -0,0 +1,25 @@
sudo: false
language: go
go:
- "1.21.x"
- "1.22.x"
- "1.23.x"
script:
- go get golang.org/x/tools/cmd/cover
- go get github.com/mattn/goveralls
- go get github.com/kisielk/errcheck
- go get -u github.com/FiloSottile/gvt
- gvt restore
- go test -race -v $(go list ./... | grep -v vendor/)
- go vet $(go list ./... | grep -v vendor/)
- go test ./test -v -indexType scorch
- errcheck -ignorepkg fmt $(go list ./... | grep -v vendor/);
- scripts/project-code-coverage.sh
- scripts/build_children.sh
notifications:
email:
- fts-team@couchbase.com

16
CONTRIBUTING.md Normal file
View file

@ -0,0 +1,16 @@
# Contributing to Bleve
We look forward to your contributions, but ask that you first review these guidelines.
### Sign the CLA
As Bleve is a Couchbase project we require contributors accept the [Couchbase Contributor License Agreement](http://review.couchbase.org/static/individual_agreement.html). To sign this agreement log into the Couchbase [code review tool](http://review.couchbase.org/). The Bleve project does not use this code review tool but it is still used to track acceptance of the contributor license agreements.
### Submitting a Pull Request
All types of contributions are welcome, but please keep the following in mind:
- If you're planning a large change, you should really discuss it in a github issue or on the google group first. This helps avoid duplicate effort and spending time on something that may not be merged.
- Existing tests should continue to pass, new tests for the contribution are nice to have.
- All code should have gone through `go fmt`
- All code should pass `go vet`

202
LICENSE Normal file
View file

@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

121
README.md Normal file
View file

@ -0,0 +1,121 @@
# ![bleve](docs/bleve.png) bleve
[![Tests](https://github.com/blevesearch/bleve/actions/workflows/tests.yml/badge.svg?branch=master&event=push)](https://github.com/blevesearch/bleve/actions/workflows/tests.yml?query=event%3Apush+branch%3Amaster)
[![Coverage Status](https://coveralls.io/repos/github/blevesearch/bleve/badge.svg?branch=master)](https://coveralls.io/github/blevesearch/bleve?branch=master)
[![Go Reference](https://pkg.go.dev/badge/github.com/blevesearch/bleve/v2.svg)](https://pkg.go.dev/github.com/blevesearch/bleve/v2)
[![Join the chat](https://badges.gitter.im/join_chat.svg)](https://app.gitter.im/#/room/#blevesearch_bleve:gitter.im)
[![codebeat](https://codebeat.co/badges/38a7cbc9-9cf5-41c0-a315-0746178230f4)](https://codebeat.co/projects/github-com-blevesearch-bleve)
[![Go Report Card](https://goreportcard.com/badge/github.com/blevesearch/bleve/v2)](https://goreportcard.com/report/github.com/blevesearch/bleve/v2)
[![Sourcegraph](https://sourcegraph.com/github.com/blevesearch/bleve/-/badge.svg)](https://sourcegraph.com/github.com/blevesearch/bleve?badge)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
A modern indexing + search library in GO
## Features
* Index any GO data structure or JSON
* Intelligent defaults backed up by powerful configuration ([scorch](https://github.com/blevesearch/bleve/blob/master/index/scorch/README.md))
* Supported field types:
* `text`, `number`, `datetime`, `boolean`, `geopoint`, `geoshape`, `IP`, `vector`
* Supported query types:
* `term`, `phrase`, `match`, `match_phrase`, `prefix`, `regexp`, `wildcard`, `fuzzy`
* term range, numeric range, date range, boolean field
* compound queries: `conjuncts`, `disjuncts`, boolean (`must`/`should`/`must_not`)
* [query string syntax](http://www.blevesearch.com/docs/Query-String-Query/)
* [geo spatial search](https://github.com/blevesearch/bleve/blob/master/geo/README.md)
* approximate k-nearest neighbors via [vector search](https://github.com/blevesearch/bleve/blob/master/docs/vectors.md)
* [synonym search](https://github.com/blevesearch/bleve/blob/master/docs/synonyms.md)
* [tf-idf](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#tf-idf) / [bm25](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#bm25) scoring models
* Hybrid search: exact + semantic
* Query time boosting
* Search result match highlighting with document fragments
* Aggregations/faceting support:
* terms facet
* numeric range facet
* date range facet
## Indexing
```go
message := struct{
Id string
From string
Body string
}{
Id: "example",
From: "xyz@couchbase.com",
Body: "bleve indexing is easy",
}
mapping := bleve.NewIndexMapping()
index, err := bleve.New("example.bleve", mapping)
if err != nil {
panic(err)
}
index.Index(message.Id, message)
```
## Querying
```go
index, _ := bleve.Open("example.bleve")
query := bleve.NewQueryStringQuery("bleve")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)
```
## Command Line Interface
To install the CLI for the latest release of bleve, run:
```bash
$ go install github.com/blevesearch/bleve/v2/cmd/bleve@latest
```
```
$ bleve --help
Bleve is a command-line tool to interact with a bleve index.
Usage:
bleve [command]
Available Commands:
bulk bulk loads from newline delimited JSON files
check checks the contents of the index
count counts the number documents in the index
create creates a new index
dictionary prints the term dictionary for the specified field in the index
dump dumps the contents of the index
fields lists the fields in this index
help Help about any command
index adds the files to the index
mapping prints the mapping used for this index
query queries the index
registry registry lists the bleve components compiled into this executable
scorch command-line tool to interact with a scorch index
Flags:
-h, --help help for bleve
Use "bleve [command] --help" for more information about a command.
```
## Text Analysis
Bleve includes general-purpose analyzers (customizable) as well as pre-built text analyzers for the following languages:
Arabic (ar), Bulgarian (bg), Catalan (ca), Chinese-Japanese-Korean (cjk), Kurdish (ckb), Danish (da), German (de), Greek (el), English (en), Spanish - Castilian (es), Basque (eu), Persian (fa), Finnish (fi), French (fr), Gaelic (ga), Spanish - Galician (gl), Hindi (hi), Croatian (hr), Hungarian (hu), Armenian (hy), Indonesian (id, in), Italian (it), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Turkish (tr)
## Text Analysis Wizard
[bleveanalysis.couchbase.com](https://bleveanalysis.couchbase.com)
## Discussion/Issues
Discuss usage/development of bleve and/or report issues here:
* [Github issues](https://github.com/blevesearch/bleve/issues)
* [Google group](https://groups.google.com/forum/#!forum/bleve)
## License
Apache License Version 2.0

15
SECURITY.md Normal file
View file

@ -0,0 +1,15 @@
# Security Policy
## Supported Versions
We support the latest release (for example, bleve v2.3.x).
## Reporting a Vulnerability
All security issues for this project should be reported by email to security@couchbase.com and fts-team@couchbase.com.
This mail will be delivered to the owners of this project.
- To ensure your report is NOT marked as spam, please include the word "security/vulnerability" along with the project name (blevesearch/bleve) in the subject of the email.
- Please be as descriptive as possible while explaining the issue, and a testcase highlighting the issue is always welcome.
Your email will be acknowledged at the soonest possible.

View file

@ -0,0 +1,148 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package custom
import (
"fmt"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "custom"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
var err error
var charFilters []analysis.CharFilter
charFiltersValue, ok := config["char_filters"]
if ok {
switch charFiltersValue := charFiltersValue.(type) {
case []string:
charFilters, err = getCharFilters(charFiltersValue, cache)
if err != nil {
return nil, err
}
case []interface{}:
charFiltersNames, err := convertInterfaceSliceToStringSlice(charFiltersValue, "char filter")
if err != nil {
return nil, err
}
charFilters, err = getCharFilters(charFiltersNames, cache)
if err != nil {
return nil, err
}
default:
return nil, fmt.Errorf("unsupported type for char_filters, must be slice")
}
}
var tokenizerName string
tokenizerValue, ok := config["tokenizer"]
if ok {
tokenizerName, ok = tokenizerValue.(string)
if !ok {
return nil, fmt.Errorf("must specify tokenizer as string")
}
} else {
return nil, fmt.Errorf("must specify tokenizer")
}
tokenizer, err := cache.TokenizerNamed(tokenizerName)
if err != nil {
return nil, err
}
var tokenFilters []analysis.TokenFilter
tokenFiltersValue, ok := config["token_filters"]
if ok {
switch tokenFiltersValue := tokenFiltersValue.(type) {
case []string:
tokenFilters, err = getTokenFilters(tokenFiltersValue, cache)
if err != nil {
return nil, err
}
case []interface{}:
tokenFiltersNames, err := convertInterfaceSliceToStringSlice(tokenFiltersValue, "token filter")
if err != nil {
return nil, err
}
tokenFilters, err = getTokenFilters(tokenFiltersNames, cache)
if err != nil {
return nil, err
}
default:
return nil, fmt.Errorf("unsupported type for token_filters, must be slice")
}
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
}
if charFilters != nil {
rv.CharFilters = charFilters
}
if tokenFilters != nil {
rv.TokenFilters = tokenFilters
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}
func getCharFilters(charFilterNames []string, cache *registry.Cache) ([]analysis.CharFilter, error) {
charFilters := make([]analysis.CharFilter, len(charFilterNames))
for i, charFilterName := range charFilterNames {
charFilter, err := cache.CharFilterNamed(charFilterName)
if err != nil {
return nil, err
}
charFilters[i] = charFilter
}
return charFilters, nil
}
func getTokenFilters(tokenFilterNames []string, cache *registry.Cache) ([]analysis.TokenFilter, error) {
tokenFilters := make([]analysis.TokenFilter, len(tokenFilterNames))
for i, tokenFilterName := range tokenFilterNames {
tokenFilter, err := cache.TokenFilterNamed(tokenFilterName)
if err != nil {
return nil, err
}
tokenFilters[i] = tokenFilter
}
return tokenFilters, nil
}
func convertInterfaceSliceToStringSlice(interfaceSlice []interface{}, objType string) ([]string, error) {
stringSlice := make([]string, len(interfaceSlice))
for i, interfaceObj := range interfaceSlice {
stringObj, ok := interfaceObj.(string)
if ok {
stringSlice[i] = stringObj
} else {
return nil, fmt.Errorf(objType + " name must be a string")
}
}
return stringSlice, nil
}

View file

@ -0,0 +1,41 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package keyword
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/single"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "keyword"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
keywordTokenizer, err := cache.TokenizerNamed(single.Name)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: keywordTokenizer,
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,49 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package simple
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/letter"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "simple"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(letter.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package standard
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/lang/en"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "standard"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopEnFilter, err := cache.TokenFilterNamed(en.StopName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopEnFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package web
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/lang/en"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/web"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "web"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(web.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopEnFilter, err := cache.TokenFilterNamed(en.StopName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopEnFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

117
analysis/benchmark_test.go Normal file
View file

@ -0,0 +1,117 @@
// Copyright (c) 2015 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis_test
import (
index "github.com/blevesearch/bleve_index_api"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/analyzer/standard"
"github.com/blevesearch/bleve/v2/registry"
)
func BenchmarkAnalysis(b *testing.B) {
for i := 0; i < b.N; i++ {
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(standard.Name)
if err != nil {
b.Fatal(err)
}
ts := analyzer.Analyze(bleveWikiArticle)
freqs := analysis.TokenFrequency(ts, nil, index.IncludeTermVectors)
if len(freqs) != 511 {
b.Errorf("expected %d freqs, got %d", 511, len(freqs))
}
}
}
var bleveWikiArticle = []byte(`Boiling liquid expanding vapor explosion
From Wikipedia, the free encyclopedia
See also: Boiler explosion and Steam explosion
Flames subsequent to a flammable liquid BLEVE from a tanker. BLEVEs do not necessarily involve fire.
This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. See Wikipedia's guide to writing better articles for suggestions. (July 2013)
A boiling liquid expanding vapor explosion (BLEVE, /ˈblɛviː/ blev-ee) is an explosion caused by the rupture of a vessel containing a pressurized liquid above its boiling point.[1]
Contents [hide]
1 Mechanism
1.1 Water example
1.2 BLEVEs without chemical reactions
2 Fires
3 Incidents
4 Safety measures
5 See also
6 References
7 External links
Mechanism[edit]
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2013)
There are three characteristics of liquids which are relevant to the discussion of a BLEVE:
If a liquid in a sealed container is boiled, the pressure inside the container increases. As the liquid changes to a gas it expands - this expansion in a vented container would cause the gas and liquid to take up more space. In a sealed container the gas and liquid are not able to take up more space and so the pressure rises. Pressurized vessels containing liquids can reach an equilibrium where the liquid stops boiling and the pressure stops rising. This occurs when no more heat is being added to the system (either because it has reached ambient temperature or has had a heat source removed).
The boiling temperature of a liquid is dependent on pressure - high pressures will yield high boiling temperatures, and low pressures will yield low boiling temperatures. A common simple experiment is to place a cup of water in a vacuum chamber, and then reduce the pressure in the chamber until the water boils. By reducing the pressure the water will boil even at room temperature. This works both ways - if the pressure is increased beyond normal atmospheric pressures, the boiling of hot water could be suppressed far beyond normal temperatures. The cooling system of a modern internal combustion engine is a real-world example.
When a liquid boils it turns into a gas. The resulting gas takes up far more space than the liquid did.
Typically, a BLEVE starts with a container of liquid which is held above its normal, atmospheric-pressure boiling temperature. Many substances normally stored as liquids, such as CO2, propane, and other similar industrial gases have boiling temperatures, at atmospheric pressure, far below room temperature. In the case of water, a BLEVE could occur if a pressurized chamber of water is heated far beyond the standard 100 °C (212 °F). That container, because the boiling water pressurizes it, is capable of holding liquid water at very high temperatures.
If the pressurized vessel, containing liquid at high temperature (which may be room temperature, depending on the substance) ruptures, the pressure which prevents the liquid from boiling is lost. If the rupture is catastrophic, where the vessel is immediately incapable of holding any pressure at all, then there suddenly exists a large mass of liquid which is at very high temperature and very low pressure. This causes the entire volume of liquid to instantaneously boil, which in turn causes an extremely rapid expansion. Depending on temperatures, pressures and the substance involved, that expansion may be so rapid that it can be classified as an explosion, fully capable of inflicting severe damage on its surroundings.
Water example[edit]
Imagine, for example, a tank of pressurized liquid water held at 204.4 °C (400 °F). This tank would normally be pressurized to 1.7 MPa (250 psi) above atmospheric ("gauge") pressure. If the tank containing the water were to rupture, there would for a slight moment exist a volume of liquid water which would be
at atmospheric pressure, and
204.4 °C (400 °F).
At atmospheric pressure the boiling point of water is 100 °C (212 °F) - liquid water at atmospheric pressure cannot exist at temperatures higher than 100 °C (212 °F). At that moment, the water would boil and turn to vapour explosively, and the 204.4 °C (400 °F) liquid water turned to gas would take up a lot more volume than it did as liquid, causing a vapour explosion. Such explosions can happen when the superheated water of a steam engine escapes through a crack in a boiler, causing a boiler explosion.
BLEVEs without chemical reactions[edit]
It is important to note that a BLEVE need not be a chemical explosionnor does there need to be a firehowever if a flammable substance is subject to a BLEVE it may also be subject to intense heating, either from an external source of heat which may have caused the vessel to rupture in the first place or from an internal source of localized heating such as skin friction. This heating can cause a flammable substance to ignite, adding a secondary explosion caused by the primary BLEVE. While blast effects of any BLEVE can be devastating, a flammable substance such as propane can add significantly to the danger.
Bleve explosion.svg
While the term BLEVE is most often used to describe the results of a container of flammable liquid rupturing due to fire, a BLEVE can occur even with a non-flammable substance such as water,[2] liquid nitrogen,[3] liquid helium or other refrigerants or cryogens, and therefore is not usually considered a type of chemical explosion.
Fires[edit]
BLEVEs can be caused by an external fire near the storage vessel causing heating of the contents and pressure build-up. While tanks are often designed to withstand great pressure, constant heating can cause the metal to weaken and eventually fail. If the tank is being heated in an area where there is no liquid, it may rupture faster without the liquid to absorb the heat. Gas containers are usually equipped with relief valves that vent off excess pressure, but the tank can still fail if the pressure is not released quickly enough.[1] Relief valves are sized to release pressure fast enough to prevent the pressure from increasing beyond the strength of the vessel, but not so fast as to be the cause of an explosion. An appropriately sized relief valve will allow the liquid inside to boil slowly, maintaining a constant pressure in the vessel until all the liquid has boiled and the vessel empties.
If the substance involved is flammable, it is likely that the resulting cloud of the substance will ignite after the BLEVE has occurred, forming a fireball and possibly a fuel-air explosion, also termed a vapor cloud explosion (VCE). If the materials are toxic, a large area will be contaminated.[4]
Incidents[edit]
The term "BLEVE" was coined by three researchers at Factory Mutual, in the analysis of an accident there in 1957 involving a chemical reactor vessel.[5]
In August 1959 the Kansas City Fire Department suffered its largest ever loss of life in the line of duty, when a 25,000 gallon (95,000 litre) gas tank exploded during a fire on Southwest Boulevard killing five firefighters. This was the first time BLEVE was used to describe a burning fuel tank.[citation needed]
Later incidents included the Cheapside Street Whisky Bond Fire in Glasgow, Scotland in 1960; Feyzin, France in 1966; Crescent City, Illinois in 1970; Kingman, Arizona in 1973; a liquid nitrogen tank rupture[6] at Air Products and Chemicals and Mobay Chemical Company at New Martinsville, West Virginia on January 31, 1978 [1];Texas City, Texas in 1978; Murdock, Illinois in 1983; San Juan Ixhuatepec, Mexico City in 1984; and Toronto, Ontario in 2008.
Safety measures[edit]
[icon] This section requires expansion. (July 2013)
Some fire mitigation measures are listed under liquefied petroleum gas.
See also[edit]
Boiler explosion
Expansion ratio
Explosive boiling or phase explosion
Rapid phase transition
Viareggio train derailment
2008 Toronto explosions
Gas carriers
Los Alfaques Disaster
Lac-Mégantic derailment
References[edit]
^ Jump up to: a b Kletz, Trevor (March 1990). Critical Aspects of Safety and Loss Prevention. London: ButterworthHeinemann. pp. 4345. ISBN 0-408-04429-2.
Jump up ^ "Temperature Pressure Relief Valves on Water Heaters: test, inspect, replace, repair guide". Inspect-ny.com. Retrieved 2011-07-12.
Jump up ^ Liquid nitrogen BLEVE demo
Jump up ^ "Chemical Process Safety" (PDF). Retrieved 2011-07-12.
Jump up ^ David F. Peterson, BLEVE: Facts, Risk Factors, and Fallacies, Fire Engineering magazine (2002).
Jump up ^ "STATE EX REL. VAPOR CORP. v. NARICK". Supreme Court of Appeals of West Virginia. 1984-07-12. Retrieved 2014-03-16.
External links[edit]
Look up boiling liquid expanding vapor explosion in Wiktionary, the free dictionary.
Wikimedia Commons has media related to BLEVE.
BLEVE Demo on YouTube video of a controlled BLEVE demo
huge explosions on YouTube video of propane and isobutane BLEVEs from a train derailment at Murdock, Illinois (3 September 1983)
Propane BLEVE on YouTube video of BLEVE from the Toronto propane depot fire
Moscow Ring Road Accident on YouTube - Dozens of LPG tank BLEVEs after a road accident in Moscow
Kingman, AZ BLEVE An account of the 5 July 1973 explosion in Kingman, with photographs
Propane Tank Explosions Description of circumstances required to cause a propane tank BLEVE.
Analysis of BLEVE Events at DOE Sites - Details physics and mathematics of BLEVEs.
HID - SAFETY REPORT ASSESSMENT GUIDE: Whisky Maturation Warehouses - The liquor is aged in wooden barrels that can suffer BLEVE.
Categories: ExplosivesFirefightingFireTypes of fireGas technologiesIndustrial fires and explosions`)

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,124 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package asciifolding
import (
"fmt"
"reflect"
"testing"
)
func TestAsciiFoldingFilter(t *testing.T) {
tests := []struct {
input []byte
output []byte
}{
{
// empty input passes
input: []byte(``),
output: []byte(``),
},
{
// no modification for plain ASCII
input: []byte(`The quick brown fox jumps over the lazy dog`),
output: []byte(`The quick brown fox jumps over the lazy dog`),
},
{
// Umlauts are folded to plain ASCII
input: []byte(`The quick bröwn fox jümps over the läzy dog`),
output: []byte(`The quick brown fox jumps over the lazy dog`),
},
{
// composite unicode runes are folded to more than one ASCII rune
input: []byte(`ÆꜴ`),
output: []byte(`AEAO`),
},
{
// apples from https://issues.couchbase.com/browse/MB-33486
input: []byte(`Ápple Àpple Äpple Âpple Ãpple Åpple`),
output: []byte(`Apple Apple Apple Apple Apple Apple`),
},
{
// Fix ASCII folding of \u24A2
input: []byte(``),
output: []byte(`(g)`),
},
{
// Test folding of \u2053 (SWUNG DASH)
input: []byte(`ab`),
output: []byte(`a~b`),
},
{
// Test folding of \uFF5E (FULLWIDTH TILDE)
input: []byte(`cd`),
output: []byte(`c~d`),
},
{
// Test folding of \uFF3F (FULLWIDTH LOW LINE) - case before tilde
input: []byte(`e_f`),
output: []byte(`e_f`),
},
{
// Test mix including tilde and default fallthrough (using a character not explicitly folded)
input: []byte(`ab✅cd`),
output: []byte(`a~b✅c~d`),
},
{
// Test start of 'A' fallthrough block
input: []byte(`ÀBC`),
output: []byte(`ABC`),
},
{
// Test end of 'A' fallthrough block
input: []byte(`DEFẶ`),
output: []byte(`DEFA`),
},
{
// Test start of 'AE' fallthrough block
input: []byte(`Æ`),
output: []byte(`AE`),
},
{
// Test end of 'AE' fallthrough block
input: []byte(``),
output: []byte(`AE`),
},
{
// Test 'DZ' multi-rune output
input: []byte(`DŽebra`),
output: []byte(`DZebra`),
},
{
// Test start of 'a' fallthrough block
input: []byte(`àbc`),
output: []byte(`abc`),
},
{
// Test end of 'a' fallthrough block
input: []byte(`def`),
output: []byte(`defa`),
},
}
for _, test := range tests {
filter := New()
t.Run(fmt.Sprintf("on %s", test.input), func(t *testing.T) {
output := filter.Filter(test.input)
if !reflect.DeepEqual(output, test.output) {
t.Errorf("\nExpected:\n`%s`\ngot:\n`%s`\n", string(test.output), string(output))
}
})
}
}

View file

@ -0,0 +1,57 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package html
import (
"bytes"
"regexp"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "html"
var htmlCharFilterRegexp = regexp.MustCompile(`</?[!\w]+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>`)
type CharFilter struct {
r *regexp.Regexp
replacement []byte
}
func New() *CharFilter {
return &CharFilter{
r: htmlCharFilterRegexp,
replacement: []byte(" "),
}
}
func (s *CharFilter) Filter(input []byte) []byte {
return s.r.ReplaceAllFunc(
input, func(in []byte) []byte {
return bytes.Repeat(s.replacement, len(in))
})
}
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
return New(), nil
}
func init() {
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,65 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package regexp
import (
"fmt"
"regexp"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "regexp"
type CharFilter struct {
r *regexp.Regexp
replacement []byte
}
func New(r *regexp.Regexp, replacement []byte) *CharFilter {
return &CharFilter{
r: r,
replacement: replacement,
}
}
func (s *CharFilter) Filter(input []byte) []byte {
return s.r.ReplaceAll(input, s.replacement)
}
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
regexpStr, ok := config["regexp"].(string)
if !ok {
return nil, fmt.Errorf("must specify regexp")
}
r, err := regexp.Compile(regexpStr)
if err != nil {
return nil, fmt.Errorf("unable to build regexp char filter: %v", err)
}
replaceBytes := []byte(" ")
replaceStr, ok := config["replace"].(string)
if ok {
replaceBytes = []byte(replaceStr)
}
return New(r, replaceBytes), nil
}
func init() {
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,88 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package regexp
import (
"fmt"
"reflect"
"regexp"
"testing"
)
func TestRegexpCharFilter(t *testing.T) {
tests := []struct {
regexStr string
replace []byte
input []byte
output []byte
}{
{
regexStr: `</?[!\w]+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>`,
replace: []byte{' '},
input: []byte(`<html>test</html>`),
output: []byte(` test `),
},
{
regexStr: `\x{200C}`,
replace: []byte{' '},
input: []byte("water\u200Cunder\u200Cthe\u200Cbridge"),
output: []byte("water under the bridge"),
},
{
regexStr: `([a-z])\s+(\d)`,
replace: []byte(`$1-$2`),
input: []byte(`temp 1`),
output: []byte(`temp-1`),
},
{
regexStr: `foo.?`,
replace: []byte(`X`),
input: []byte(`seafood, fool`),
output: []byte(`seaX, X`),
},
{
regexStr: `def`,
replace: []byte(`_`),
input: []byte(`abcdefghi`),
output: []byte(`abc_ghi`),
},
{
regexStr: `456`,
replace: []byte(`000000`),
input: []byte(`123456789`),
output: []byte(`123000000789`),
},
{
regexStr: `“|”`,
replace: []byte(`"`),
input: []byte(`“hello”`),
output: []byte(`"hello"`),
},
}
for _, test := range tests {
t.Run(fmt.Sprintf("match %s replace %s", test.regexStr, string(test.replace)), func(t *testing.T) {
regex := regexp.MustCompile(test.regexStr)
filter := New(regex, test.replace)
output := filter.Filter(test.input)
if !reflect.DeepEqual(test.output, output) {
t.Errorf("Expected: `%s`, Got: `%s`\n", string(test.output), string(output))
}
})
}
}

View file

@ -0,0 +1,39 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zerowidthnonjoiner
import (
"regexp"
"github.com/blevesearch/bleve/v2/analysis"
regexpCharFilter "github.com/blevesearch/bleve/v2/analysis/char/regexp"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "zero_width_spaces"
var zeroWidthNonJoinerRegexp = regexp.MustCompile(`\x{200C}`)
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
replaceBytes := []byte(" ")
return regexpCharFilter.New(zeroWidthNonJoinerRegexp, replaceBytes), nil
}
func init() {
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,67 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package flexible
import (
"fmt"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "flexiblego"
type DateTimeParser struct {
layouts []string
}
func New(layouts []string) *DateTimeParser {
return &DateTimeParser{
layouts: layouts,
}
}
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
for _, layout := range p.layouts {
rv, err := time.Parse(layout, input)
if err == nil {
return rv, layout, nil
}
}
return time.Time{}, "", analysis.ErrInvalidDateTime
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
layouts, ok := config["layouts"].([]interface{})
if !ok {
return nil, fmt.Errorf("must specify layouts")
}
var layoutStrs []string
for _, layout := range layouts {
layoutStr, ok := layout.(string)
if ok {
layoutStrs = append(layoutStrs, layoutStr)
}
}
return New(layoutStrs), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,100 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package flexible
import (
"reflect"
"testing"
"time"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestFlexibleDateTimeParser(t *testing.T) {
testLocation := time.FixedZone("", -8*60*60)
rfc3339NoTimezone := "2006-01-02T15:04:05"
rfc3339NoTimezoneNoT := "2006-01-02 15:04:05"
rfc3339NoTime := "2006-01-02"
dateOptionalTimeParser := New(
[]string{
time.RFC3339Nano,
time.RFC3339,
rfc3339NoTimezone,
rfc3339NoTimezoneNoT,
rfc3339NoTime,
})
tests := []struct {
input string
expectedTime time.Time
expectedLayout string
expectedError error
}{
{
input: "2014-08-03",
expectedTime: time.Date(2014, 8, 3, 0, 0, 0, 0, time.UTC),
expectedLayout: rfc3339NoTime,
expectedError: nil,
},
{
input: "2014-08-03T15:59:30",
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, time.UTC),
expectedLayout: rfc3339NoTimezone,
expectedError: nil,
},
{
input: "2014-08-03 15:59:30",
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, time.UTC),
expectedLayout: rfc3339NoTimezoneNoT,
expectedError: nil,
},
{
input: "2014-08-03T15:59:30-08:00",
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, testLocation),
expectedLayout: time.RFC3339Nano,
expectedError: nil,
},
{
input: "2014-08-03T15:59:30.999999999-08:00",
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 999999999, testLocation),
expectedLayout: time.RFC3339Nano,
expectedError: nil,
},
{
input: "not a date time",
expectedTime: time.Time{},
expectedLayout: "",
expectedError: analysis.ErrInvalidDateTime,
},
}
for _, test := range tests {
t.Run(test.input, func(t *testing.T) {
actualTime, actualLayout, actualErr := dateOptionalTimeParser.ParseDateTime(test.input)
if actualErr != test.expectedError {
t.Fatalf("expected error %#v, got %#v", test.expectedError, actualErr)
}
if !reflect.DeepEqual(actualTime, test.expectedTime) {
t.Errorf("expected time %v, got %v", test.expectedTime, actualTime)
}
if !reflect.DeepEqual(actualLayout, test.expectedLayout) {
t.Errorf("expected layout %v, got %v", test.expectedLayout, actualLayout)
}
})
}
}

View file

@ -0,0 +1,250 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package iso
import (
"fmt"
"strings"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "isostyle"
var textLiteralDelimiter byte = '\'' // single quote
// ISO style date strings are represented in
// https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
//
// Some format specifiers are not specified in go time package, such as:
// - 'V' for timezone name, like 'Europe/Berlin' or 'America/New_York'.
// - 'Q' for quarter of year, like Q3 or 3rd Quarter.
// - 'zzzz' for full name of timezone like "Japan Standard Time" or "Eastern Standard Time".
// - 'O' for localized zone-offset, like GMT+8 or GMT+08:00.
// - '[]' for optional section of the format.
// - 'G' for era, like AD or BC.
// - 'W' for week of month.
// - 'D' for day of year.
// So date strings with these date elements cannot be parsed.
var timeElementToLayout = map[byte]map[int]string{
'M': {
4: "January", // MMMM = full month name
3: "Jan", // MMM = short month name
2: "01", // MM = month of year (2 digits) (01-12)
1: "1", // M = month of year (1 digit) (1-12)
},
'd': {
2: "02", // dd = day of month (2 digits) (01-31)
1: "2", // d = day of month (1 digit) (1-31)
},
'a': {
2: "pm", // aa = pm/am
1: "PM", // a = PM/AM
},
'H': {
2: "15", // HH = hour (24 hour clock) (2 digits)
1: "15", // H = hour (24 hour clock) (1 digit)
},
'm': {
2: "04", // mm = minute (2 digits)
1: "4", // m = minute (1 digit)
},
's': {
2: "05", // ss = seconds (2 digits)
1: "5", // s = seconds (1 digit)
},
// timezone offsets from UTC below
'X': {
5: "Z07:00:00", // XXXXX = timezone offset (+-hh:mm:ss)
4: "Z070000", // XXXX = timezone offset (+-hhmmss)
3: "Z07:00", // XXX = timezone offset (+-hh:mm)
2: "Z0700", // XX = timezone offset (+-hhmm)
1: "Z07", // X = timezone offset (+-hh)
},
'x': {
5: "-07:00:00", // xxxxx = timezone offset (+-hh:mm:ss)
4: "-070000", // xxxx = timezone offset (+-hhmmss)
3: "-07:00", // xxx = timezone offset (+-hh:mm)
2: "-0700", // xx = timezone offset (+-hhmm)
1: "-07", // x = timezone offset (+-hh)
},
}
type DateTimeParser struct {
layouts []string
}
func New(layouts []string) *DateTimeParser {
return &DateTimeParser{
layouts: layouts,
}
}
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
for _, layout := range p.layouts {
rv, err := time.Parse(layout, input)
if err == nil {
return rv, layout, nil
}
}
return time.Time{}, "", analysis.ErrInvalidDateTime
}
func letterCounter(layout string, idx int) int {
count := 1
for idx+count < len(layout) {
if layout[idx+count] == layout[idx] {
count++
} else {
break
}
}
return count
}
func invalidFormatError(character byte, count int) error {
return fmt.Errorf("invalid format string, unknown format specifier: " + strings.Repeat(string(character), count))
}
func parseISOString(layout string) (string, error) {
var dateTimeLayout strings.Builder
for idx := 0; idx < len(layout); {
// check if the character is a text literal delimiter (')
if layout[idx] == textLiteralDelimiter {
if idx+1 < len(layout) && layout[idx+1] == textLiteralDelimiter {
// if the next character is also a text literal delimiter, then
// copy the character as is
dateTimeLayout.WriteByte(textLiteralDelimiter)
idx += 2
continue
}
// find the next text literal delimiter
for idx++; idx < len(layout); idx++ {
if layout[idx] == textLiteralDelimiter {
break
}
dateTimeLayout.WriteByte(layout[idx])
}
// idx can either be equal to len(layout) if the text literal delimiter is not found
// after the first text literal delimiter or it will be equal to the index of the
// second text literal delimiter
if idx == len(layout) {
// text literal delimiter not found error
return "", fmt.Errorf("invalid format string, expected text literal delimiter: " + string(textLiteralDelimiter))
}
// increment idx to skip the second text literal delimiter
idx++
continue
}
// check if character is a letter in english alphabet - a-zA-Z which are reserved
// for format specifiers
if (layout[idx] >= 'a' && layout[idx] <= 'z') || (layout[idx] >= 'A' && layout[idx] <= 'Z') {
// find the number of times the character occurs consecutively
count := letterCounter(layout, idx)
character := layout[idx]
// first check the table
if layout, ok := timeElementToLayout[character][count]; ok {
dateTimeLayout.WriteString(layout)
} else {
switch character {
case 'y', 'u', 'Y':
// year
if count == 2 {
dateTimeLayout.WriteString("06")
} else {
format := fmt.Sprintf("%%0%ds", count)
dateTimeLayout.WriteString(fmt.Sprintf(format, "2006"))
}
case 'h', 'K':
// hour (1-12)
switch count {
case 2:
// hh, KK -> 03
dateTimeLayout.WriteString("03")
case 1:
// h, K -> 3
dateTimeLayout.WriteString("3")
default:
// e.g., hhh
return "", invalidFormatError(character, count)
}
case 'E':
// day of week
if count == 4 {
dateTimeLayout.WriteString("Monday") // EEEE -> Monday
} else if count <= 3 {
dateTimeLayout.WriteString("Mon") // E, EE, EEE -> Mon
} else {
return "", invalidFormatError(character, count) // e.g., EEEEE
}
case 'S':
// fraction of second
// .SSS = millisecond
// .SSSSSS = microsecond
// .SSSSSSSSS = nanosecond
if count > 9 {
return "", invalidFormatError(character, count)
}
dateTimeLayout.WriteString(strings.Repeat(string('0'), count))
case 'z':
// timezone id
if count < 5 {
dateTimeLayout.WriteString("MST")
} else {
return "", invalidFormatError(character, count)
}
default:
return "", invalidFormatError(character, count)
}
}
idx += count
} else {
// copy the character as is
dateTimeLayout.WriteByte(layout[idx])
idx++
}
}
return dateTimeLayout.String(), nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
layouts, ok := config["layouts"].([]interface{})
if !ok {
return nil, fmt.Errorf("must specify layouts")
}
var layoutStrs []string
for _, layout := range layouts {
layoutStr, ok := layout.(string)
if ok {
layout, err := parseISOString(layoutStr)
if err != nil {
return nil, err
}
layoutStrs = append(layoutStrs, layout)
}
}
return New(layoutStrs), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,182 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package iso
import (
"fmt"
"testing"
)
func TestConversionFromISOStyle(t *testing.T) {
tests := []struct {
input string
output string
err error
}{
{
input: "yyyy-MM-dd",
output: "2006-01-02",
err: nil,
},
{
input: "uuu/M''''dd'T'HH:m:ss.SSS",
output: "2006/1''02T15:4:05.000",
err: nil,
},
{
input: "YYYY-MM-dd'T'H:mm:ss zzz",
output: "2006-01-02T15:04:05 MST",
err: nil,
},
{
input: "MMMM dd yyyy', 'HH:mm:ss.SSS",
output: "January 02 2006, 15:04:05.000",
err: nil,
},
{
input: "h 'o'''' clock' a, XXX",
output: "3 o' clock PM, Z07:00",
err: nil,
},
{
input: "YYYY-MM-dd'T'HH:mm:ss'Z'",
output: "2006-01-02T15:04:05Z",
err: nil,
},
{
input: "E MMM d H:mm:ss z Y",
output: "Mon Jan 2 15:04:05 MST 2006",
err: nil,
},
{
input: "E MMM DD H:m:s z Y",
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: DD"),
},
{
input: "E MMM''''' H:m:s z Y",
output: "",
err: fmt.Errorf("invalid format string, expected text literal delimiter: '"),
},
{
input: "MMMMM dd yyyy', 'HH:mm:ss.SSS",
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: MMMMM"),
},
{
input: "yy", // year (2 digits)
output: "06",
err: nil,
},
{
input: "yyyyy", // year (5 digits, padded)
output: "02006",
err: nil,
},
{
input: "h", // hour 1-12 (1 digit)
output: "3",
err: nil,
},
{
input: "hh", // hour 1-12 (2 digits)
output: "03",
err: nil,
},
{
input: "KK", // hour 1-12 (2 digits, alt)
output: "03",
err: nil,
},
{
input: "hhh", // invalid hour count
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: hhh"),
},
{
input: "E", // Day of week (short)
output: "Mon",
err: nil,
},
{
input: "EEE", // Day of week (short)
output: "Mon",
err: nil,
},
{
input: "EEEE", // Day of week (long)
output: "Monday",
err: nil,
},
{
input: "EEEEE", // Day of week (long)
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: EEEEE"),
},
{
input: "S", // Fraction of second (1 digit)
output: "0",
err: nil,
},
{
input: "SSSSSSSSS", // Fraction of second (9 digits)
output: "000000000",
err: nil,
},
{
input: "SSSSSSSSSS", // Invalid fraction of second count
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: SSSSSSSSSS"),
},
{
input: "z", // Timezone name (short)
output: "MST",
err: nil,
},
{
input: "zzz", // Timezone name (short) - Corrected expectation
output: "MST", // Should output MST
err: nil, // Should not produce an error
},
{
input: "zzzz", // Timezone name (long) - Corrected expectation
output: "MST", // Should output MST
err: nil, // Should not produce an error
},
{
input: "G", // Era designator (unsupported)
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: G"),
},
{
input: "W", // Week of month (unsupported)
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: W"),
},
}
for i, test := range tests {
t.Run(fmt.Sprintf("test %d: %s", i, test.input), func(t *testing.T) {
out, err := parseISOString(test.input)
// Check error matching
if (err != nil && test.err == nil) || (err == nil && test.err != nil) || (err != nil && test.err != nil && err.Error() != test.err.Error()) {
t.Fatalf("expected error %v, got error %v", test.err, err)
}
// Check output matching only if no error was expected/occurred
if err == nil && test.err == nil && out != test.output {
t.Fatalf("expected output '%v', got '%v'", test.output, out)
}
})
}
}

View file

@ -0,0 +1,50 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package optional
import (
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/datetime/flexible"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "dateTimeOptional"
const rfc3339NoTimezone = "2006-01-02T15:04:05"
const rfc3339NoTimezoneNoT = "2006-01-02 15:04:05"
const rfc3339Offset = "2006-01-02 15:04:05 -0700"
const rfc3339NoTime = "2006-01-02"
var layouts = []string{
time.RFC3339Nano,
time.RFC3339,
rfc3339NoTimezone,
rfc3339NoTimezoneNoT,
rfc3339Offset,
rfc3339NoTime,
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return flexible.New(layouts), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,205 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package percent
import (
"fmt"
"strings"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "percentstyle"
var formatDelimiter byte = '%'
// format specifiers as per strftime in the C standard library
// https://man7.org/linux/man-pages/man3/strftime.3.html
var formatSpecifierToLayout = map[byte]string{
formatDelimiter: string(formatDelimiter), // %% = % (literal %)
'a': "Mon", // %a = short weekday name
'A': "Monday", // %A = full weekday name
'd': "02", // %d = day of month (2 digits) (01-31)
'e': "2", // %e = day of month (1 digit) (1-31)
'b': "Jan", // %b = short month name
'B': "January", // %B = full month name
'm': "01", // %m = month of year (2 digits) (01-12)
'y': "06", // %y = year without century
'Y': "2006", // %Y = year with century
'H': "15", // %H = hour (24 hour clock) (2 digits)
'I': "03", // %I = hour (12 hour clock) (2 digits)
'l': "3", // %l = hour (12 hour clock) (1 digit)
'p': "PM", // %p = PM/AM
'P': "pm", // %P = pm/am (lowercase)
'M': "04", // %M = minute (2 digits)
'S': "05", // %S = seconds (2 digits)
'f': "999999", // .%f = fraction of seconds - up to microseconds (6 digits) - deci/milli/micro
'Z': "MST", // %Z = timezone name (GMT, JST, UTC etc)
// %z is present in timezone options
// some additional options not in strftime to support additional options such as
// disallow 0 padding in minute and seconds, nanosecond precision, etc
'o': "1", // %o = month of year (1 digit) (1-12)
'i': "4", // %i = minute (1 digit)
's': "5", // %s = seconds (1 digit)
'N': "999999999", // .%N = fraction of seconds - up to microseconds (9 digits) - milli/micro/nano
}
// some additional options for timezone
// such as allowing colon in timezone offset and specifying the seconds
// timezone offsets are from UTC
var timezoneOptions = map[string]string{
"z": "Z0700", // %z = timezone offset in +-hhmm / +-(2 digit hour)(2 digit minute) +0500, -0600 etc
"z:M": "Z07:00", // %z:M = timezone offset(+-hh:mm) / +-(2 digit hour):(2 digit minute) +05:00, -06:00 etc
"z:S": "Z07:00:00", // %z:M = timezone offset(+-hh:mm:ss) / +-(2 digit hour):(2 digit minute):(2 digit second) +05:20:00, -06:30:00 etc
"zH": "Z07", // %zH = timezone offset(+-hh) / +-(2 digit hour) +05, -06 etc
"zS": "Z070000", // %zS = timezone offset(+-hhmmss) / +-(2 digit hour)(2 digit minute)(2 digit second) +052000, -063000 etc
}
type DateTimeParser struct {
layouts []string
}
func New(layouts []string) *DateTimeParser {
return &DateTimeParser{
layouts: layouts,
}
}
func checkTZOptions(formatString string, idx int) (string, int) {
// idx points to '%'
// We know formatString[idx+1] == 'z'
nextIdx := idx + 2 // Index of the character immediately after 'z'
// Default values assume only '%z' is present
layout := timezoneOptions["z"]
finalIdx := nextIdx // Index after '%z'
if nextIdx < len(formatString) {
switch formatString[nextIdx] {
case ':':
// Check for modifier after the colon ':'
colonModifierIdx := nextIdx + 1
if colonModifierIdx < len(formatString) {
switch formatString[colonModifierIdx] {
case 'M':
// Found %z:M
layout = timezoneOptions["z:M"]
finalIdx = colonModifierIdx + 1 // Index after %z:M
case 'S':
// Found %z:S
layout = timezoneOptions["z:S"]
finalIdx = colonModifierIdx + 1 // Index after %z:S
// default: If %z: is followed by something else, or just %z: at the end.
// Keep the default layout ("z") and finalIdx (idx + 2).
// The ':' will be treated as a literal by the main loop.
}
}
// else: %z: is at the very end of the string.
// Keep the default layout ("z") and finalIdx (idx + 2).
// The ':' will be treated as a literal by the main loop.
case 'H':
// Found %zH
layout = timezoneOptions["zH"]
finalIdx = nextIdx + 1 // Index after %zH
case 'S':
// Found %zS
layout = timezoneOptions["zS"]
finalIdx = nextIdx + 1 // Index after %zS
// default: If %z is followed by something other than ':', 'H', or 'S'.
// Keep the default layout ("z") and finalIdx (idx + 2).
// The character formatString[nextIdx] will be handled by the main loop.
}
}
// else: %z is at the very end of the string.
// Keep the default layout ("z") and finalIdx (idx + 2).
return layout, finalIdx
}
func parseFormatString(formatString string) (string, error) {
var dateTimeLayout strings.Builder
// iterate over the format string and replace the format specifiers with
// the corresponding golang constants
for idx := 0; idx < len(formatString); {
// check if the character is a format delimiter (%)
if formatString[idx] == formatDelimiter {
// check if there is a character after the format delimiter (%)
if idx+1 >= len(formatString) {
return "", fmt.Errorf("invalid format string, expected character after %s", string(formatDelimiter))
}
formatSpecifier := formatString[idx+1]
if layout, ok := formatSpecifierToLayout[formatSpecifier]; ok {
dateTimeLayout.WriteString(layout)
idx += 2
} else if formatSpecifier == 'z' {
// did not find a valid specifier
// check if it is for timezone
var tzLayout string
tzLayout, idx = checkTZOptions(formatString, idx)
dateTimeLayout.WriteString(tzLayout)
} else {
return "", fmt.Errorf("invalid format string, unknown format specifier: %s", string(formatSpecifier))
}
continue
}
// copy the character as is
dateTimeLayout.WriteByte(formatString[idx])
idx++
}
return dateTimeLayout.String(), nil
}
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
for _, layout := range p.layouts {
rv, err := time.Parse(layout, input)
if err == nil {
return rv, layout, nil
}
}
return time.Time{}, "", analysis.ErrInvalidDateTime
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
layouts, ok := config["layouts"].([]interface{})
if !ok {
return nil, fmt.Errorf("must specify layouts")
}
layoutStrs := make([]string, 0, len(layouts))
for _, layout := range layouts {
layoutStr, ok := layout.(string)
if ok {
layout, err := parseFormatString(layoutStr)
if err != nil {
return nil, err
}
layoutStrs = append(layoutStrs, layout)
}
}
return New(layoutStrs), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,474 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package percent
import (
"fmt"
"reflect"
"testing"
"time"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestConversionFromPercentStyle(t *testing.T) {
tests := []struct {
name string // Added name field
input string
output string
err error
}{
{
name: "basic YMD",
input: "%Y-%m-%d",
output: "2006-01-02",
err: nil,
},
{
name: "YMD with double percent and literal T",
input: "%Y/%m%%%%%dT%H%M:%S",
output: "2006/01%%02T1504:05",
err: nil,
},
{
name: "YMD T HMS Z z",
input: "%Y-%m-%dT%H:%M:%S %Z%z",
output: "2006-01-02T15:04:05 MSTZ0700",
err: nil,
},
{
name: "Full month, padded day/hour, am/pm, z:M",
input: "%B %e, %Y %l:%i %P %z:M",
output: "January 2, 2006 3:4 pm Z07:00",
err: nil,
},
{
name: "Long format with literals and timezone literal :S",
input: "Hour %H Minute %Mseconds %S.%N Timezone:%Z:S, Weekday %a; Day %d Month %b, Year %y",
output: "Hour 15 Minute 04seconds 05.999999999 Timezone:MST:S, Weekday Mon; Day 02 Month Jan, Year 06",
err: nil,
},
{
name: "YMD T HMS with nanoseconds",
input: "%Y-%m-%dT%H:%M:%S.%N",
output: "2006-01-02T15:04:05.999999999",
err: nil,
},
{
name: "HMS Z z",
input: "%H:%M:%S %Z %z",
output: "15:04:05 MST Z0700",
err: nil,
},
{
name: "HMS Z z literal colon",
input: "%H:%M:%S %Z %z:",
output: "15:04:05 MST Z0700:",
err: nil,
},
{
name: "HMS Z z:M",
input: "%H:%M:%S %Z %z:M",
output: "15:04:05 MST Z07:00",
err: nil,
},
{
name: "HMS Z z:S",
input: "%H:%M:%S %Z %z:S",
output: "15:04:05 MST Z07:00:00",
err: nil,
},
{
name: "HMS Z z: literal A",
input: "%H:%M:%S %Z %z:A",
output: "15:04:05 MST Z0700:A",
err: nil,
},
{
name: "HMS Z z literal M",
input: "%H:%M:%S %Z %zM",
output: "15:04:05 MST Z0700M",
err: nil,
},
{
name: "HMS Z zH",
input: "%H:%M:%S %Z %zH",
output: "15:04:05 MST Z07",
err: nil,
},
{
name: "HMS Z zS",
input: "%H:%M:%S %Z %zS",
output: "15:04:05 MST Z070000",
err: nil,
},
{
name: "Complex combination z zS z: zH",
input: "%H:%M:%S %Z %z%Z %zS%z:%zH",
output: "15:04:05 MST Z0700MST Z070000Z0700:Z07",
err: nil,
},
{
name: "z at end",
input: "%Y-%m-%d %z",
output: "2006-01-02 Z0700",
err: nil,
},
{
name: "z: at end",
input: "%Y-%m-%d %z:",
output: "2006-01-02 Z0700:",
err: nil,
},
{
name: "zH at end",
input: "%Y-%m-%d %zH",
output: "2006-01-02 Z07",
err: nil,
},
{
name: "zS at end",
input: "%Y-%m-%d %zS",
output: "2006-01-02 Z070000",
err: nil,
},
{
name: "z:M at end",
input: "%Y-%m-%d %z:M",
output: "2006-01-02 Z07:00",
err: nil,
},
{
name: "z:S at end",
input: "%Y-%m-%d %z:S",
output: "2006-01-02 Z07:00:00",
err: nil,
},
{
name: "z followed by literal X",
input: "%Y-%m-%d %zX",
output: "2006-01-02 Z0700X",
err: nil,
},
{
name: "z: followed by literal X",
input: "%Y-%m-%d %z:X",
output: "2006-01-02 Z0700:X",
err: nil,
},
{
name: "Invalid specifier T",
input: "%Y-%m-%d%T%H:%M:%S %ZM",
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: T"),
},
{
name: "Ends with %",
input: "%Y-%m-%dT%H:%M:%S %ZM%",
output: "",
err: fmt.Errorf("invalid format string, expected character after %%"),
},
{
name: "Just %",
input: "%",
output: "",
err: fmt.Errorf("invalid format string, expected character after %%"),
},
{
name: "Just %%",
input: "%%",
output: "%",
err: nil,
},
{
name: "Unknown specifier x",
input: "%x",
output: "",
err: fmt.Errorf("invalid format string, unknown format specifier: x"),
},
{
name: "Literal prefix",
input: "literal %Y",
output: "literal 2006",
err: nil,
},
{
name: "Literal suffix",
input: "%Y literal",
output: "2006 literal",
err: nil,
},
}
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
out, err := parseFormatString(test.input)
// Enhanced Error Check:
expectedErrStr := ""
if test.err != nil {
expectedErrStr = test.err.Error()
}
actualErrStr := ""
if err != nil {
actualErrStr = err.Error()
}
if expectedErrStr != actualErrStr {
// Provide more detailed output if errors don't match as strings
t.Fatalf("error mismatch:\nExpected error: %q\nGot error : %q", expectedErrStr, actualErrStr)
}
// Original error presence check (redundant if string check passes, but safe to keep)
if (err != nil && test.err == nil) || (err == nil && test.err != nil) {
t.Fatalf("presence mismatch: expected error %v, got error %v", test.err, err)
}
// Check output matching only if no error was expected/occurred
if err == nil && test.err == nil && out != test.output {
t.Fatalf("output mismatch: expected '%v', got '%v'", test.output, out)
}
})
}
}
func TestDateTimeParser_ParseDateTime(t *testing.T) {
// Pre-create some parsers with known Go layouts
parser1 := New([]string{"2006-01-02", "01/02/2006"}) // YYYY-MM-DD, MM/DD/YYYY
parser2 := New([]string{"15:04:05"}) // HH:MM:SS
parserEmpty := New([]string{}) // No layouts
// Define expected time values
time1, _ := time.Parse("2006-01-02", "2023-10-27")
time2, _ := time.Parse("01/02/2006", "10/27/2023")
time3, _ := time.Parse("15:04:05", "14:30:00")
tests := []struct {
name string
parser *DateTimeParser
input string
expectTime time.Time
expectLayout string
expectErr error
}{
{
name: "match first layout",
parser: parser1,
input: "2023-10-27",
expectTime: time1,
expectLayout: "2006-01-02",
expectErr: nil,
},
{
name: "match second layout",
parser: parser1,
input: "10/27/2023",
expectTime: time2,
expectLayout: "01/02/2006",
expectErr: nil,
},
{
name: "no matching layout",
parser: parser1,
input: "14:30:00", // Matches parser2's layout, not parser1's
expectTime: time.Time{},
expectLayout: "",
expectErr: analysis.ErrInvalidDateTime,
},
{
name: "match only layout",
parser: parser2,
input: "14:30:00",
expectTime: time3,
expectLayout: "15:04:05",
expectErr: nil,
},
{
name: "invalid date format for layout",
parser: parser1,
input: "27-10-2023", // Wrong separators
expectTime: time.Time{},
expectLayout: "",
expectErr: analysis.ErrInvalidDateTime, // time.Parse fails on all, returns ErrInvalidDateTime
},
{
name: "empty input",
parser: parser1,
input: "",
expectTime: time.Time{},
expectLayout: "",
expectErr: analysis.ErrInvalidDateTime,
},
{
name: "parser with no layouts",
parser: parserEmpty,
input: "2023-10-27",
expectTime: time.Time{},
expectLayout: "",
expectErr: analysis.ErrInvalidDateTime,
},
{
name: "not a date string",
parser: parser1,
input: "hello world",
expectTime: time.Time{},
expectLayout: "",
expectErr: analysis.ErrInvalidDateTime,
},
}
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
gotTime, gotLayout, gotErr := test.parser.ParseDateTime(test.input)
// Check error
if !reflect.DeepEqual(gotErr, test.expectErr) {
t.Fatalf("error mismatch:\nExpected: %v\nGot: %v", test.expectErr, gotErr)
}
// Check time only if no error expected
if test.expectErr == nil {
if !gotTime.Equal(test.expectTime) {
t.Errorf("time mismatch:\nExpected: %v\nGot: %v", test.expectTime, gotTime)
}
if gotLayout != test.expectLayout {
t.Errorf("layout mismatch:\nExpected: %q\nGot: %q", test.expectLayout, gotLayout)
}
}
})
}
}
func TestDateTimeParserConstructor(t *testing.T) {
tests := []struct {
name string
config map[string]interface{}
expectLayouts []string // Expected Go layouts after parsing
expectErr error
}{
{
name: "valid config with multiple layouts",
config: map[string]interface{}{
"layouts": []interface{}{"%Y-%m-%d", "%H:%M:%S %Z"},
},
expectLayouts: []string{"2006-01-02", "15:04:05 MST"},
expectErr: nil,
},
{
name: "valid config with single layout",
config: map[string]interface{}{
"layouts": []interface{}{"%Y/%m/%d %z:M"},
},
expectLayouts: []string{"2006/01/02 Z07:00"},
expectErr: nil,
},
{
name: "valid config with complex layout",
config: map[string]interface{}{
"layouts": []interface{}{"%a, %d %b %Y %H:%M:%S %zH"},
},
expectLayouts: []string{"Mon, 02 Jan 2006 15:04:05 Z07"},
expectErr: nil,
},
{
name: "config missing layouts key",
config: map[string]interface{}{
"other_key": "value",
},
expectLayouts: nil,
expectErr: fmt.Errorf("must specify layouts"),
},
{
name: "config layouts not a slice",
config: map[string]interface{}{
"layouts": "not-a-slice", // Value is a string
},
expectLayouts: nil,
// Update the expected error message
expectErr: fmt.Errorf("must specify layouts"),
},
{
name: "config layouts contains non-string",
config: map[string]interface{}{
"layouts": []interface{}{"%Y-%m-%d", 123},
},
// Should process the valid string, ignore the int
expectLayouts: []string{"2006-01-02"},
expectErr: nil,
},
{
name: "config layouts contains invalid percent format",
config: map[string]interface{}{
"layouts": []interface{}{"%Y-%m-%d", "%x"}, // %x is invalid
},
expectLayouts: nil,
expectErr: fmt.Errorf("invalid format string, unknown format specifier: x"),
},
{
name: "config layouts contains format ending in %",
config: map[string]interface{}{
"layouts": []interface{}{"%Y-%m-%d", "%H:%M:%"},
},
expectLayouts: nil,
expectErr: fmt.Errorf("invalid format string, expected character after %%"),
},
{
name: "config with empty layouts slice",
config: map[string]interface{}{
"layouts": []interface{}{},
},
expectLayouts: []string{}, // Expect an empty slice, not nil
expectErr: nil,
},
{
name: "nil config",
config: nil,
expectLayouts: nil,
expectErr: fmt.Errorf("must specify layouts"),
},
}
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
// Cache is not used by this constructor, so nil is fine
parserIntf, err := DateTimeParserConstructor(test.config, nil)
// Check error
// Use string comparison for errors as they might be created differently
expectedErrStr := ""
if test.expectErr != nil {
expectedErrStr = test.expectErr.Error()
}
actualErrStr := ""
if err != nil {
actualErrStr = err.Error()
}
if expectedErrStr != actualErrStr {
t.Fatalf("error mismatch:\nExpected: %q\nGot: %q", expectedErrStr, actualErrStr)
}
// Check layouts only if no error expected
if test.expectErr == nil {
// Type assert to access the layouts field
parser, ok := parserIntf.(*DateTimeParser)
if !ok {
t.Fatalf("constructor did not return a *DateTimeParser")
}
if !reflect.DeepEqual(parser.layouts, test.expectLayouts) {
t.Errorf("layouts mismatch:\nExpected: %v\nGot: %v", test.expectLayouts, parser.layouts)
}
}
})
}
}

View file

@ -0,0 +1,130 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package sanitized
import (
"fmt"
"regexp"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "sanitizedgo"
var validMagicNumbers = map[string]struct{}{
"2006": {},
"06": {}, // Year
"01": {},
"1": {},
"_1": {},
"January": {},
"Jan": {}, // Month
"02": {},
"2": {},
"_2": {},
"__2": {},
"002": {},
"Monday": {},
"Mon": {}, // Day
"15": {},
"3": {},
"03": {}, // Hour
"4": {},
"04": {}, // Minute
"5": {},
"05": {}, // Second
"0700": {},
"070000": {},
"07": {},
"00": {},
"": {},
}
var layoutSplitRegex = regexp.MustCompile("[\\+\\-= :T,Z\\.<>;\\?!`~@#$%\\^&\\*|'\"\\(\\){}\\[\\]/\\\\]")
var layoutStripRegex = regexp.MustCompile(`PM|pm|\.9+|\.0+|MST`)
type DateTimeParser struct {
layouts []string
}
func New(layouts []string) *DateTimeParser {
return &DateTimeParser{
layouts: layouts,
}
}
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
for _, layout := range p.layouts {
rv, err := time.Parse(layout, input)
if err == nil {
return rv, layout, nil
}
}
return time.Time{}, "", analysis.ErrInvalidDateTime
}
// date time layouts must be a combination of constants specified in golang time package
// https://pkg.go.dev/time#pkg-constants
// this validation verifies that only these constants are used in the custom layout
// for compatibility with the golang time package
func validateLayout(layout string) bool {
// first we strip out commonly used constants
// such as "PM" which can be present in the layout
// right after a time component, e.g. 03:04PM;
// because regex split cannot separate "03:04PM" into
// "03:04" and "PM". We also strip out ".9+" and ".0+"
// which represent fractional seconds.
layout = layoutStripRegex.ReplaceAllString(layout, "")
// then we split the layout by non-constant characters
// which is a regex and verify that each split is a valid magic number
split := layoutSplitRegex.Split(layout, -1)
for i := range split {
_, found := validMagicNumbers[split[i]]
if !found {
return false
}
}
return true
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
layouts, ok := config["layouts"].([]interface{})
if !ok {
return nil, fmt.Errorf("must specify layouts")
}
var layoutStrs []string
for _, layout := range layouts {
layoutStr, ok := layout.(string)
if ok {
if !validateLayout(layoutStr) {
return nil, fmt.Errorf("invalid datetime parser layout: %s,"+
" please refer to https://pkg.go.dev/time#pkg-constants for supported"+
" layouts", layoutStr)
}
layoutStrs = append(layoutStrs, layoutStr)
}
}
return New(layoutStrs), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,109 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package sanitized
import (
"reflect"
"testing"
)
func TestLayoutValidatorRegex(t *testing.T) {
splitRegexTests := []struct {
input string
output []string
}{
{
input: "2014-08-03",
output: []string{"2014", "08", "03"},
},
{
input: "2014-08-03T15:59:30",
output: []string{"2014", "08", "03", "15", "59", "30"},
},
{
input: "2014.08-03 15/59`30",
output: []string{"2014", "08", "03", "15", "59", "30"},
},
{
input: "2014/08/03T15:59:30Z08:00",
output: []string{"2014", "08", "03", "15", "59", "30", "08", "00"},
},
{
input: "2014\\08|03T15=59.30.999999999+08*00",
output: []string{"2014", "08", "03", "15", "59", "30", "999999999", "08", "00"},
},
{
input: "2006-01-02T15:04:05.999999999Z07:00",
output: []string{"2006", "01", "02", "15", "04", "05", "999999999", "07", "00"},
},
{
input: "A-B C:DTE,FZG.H<I>J;K?L!M`N~O@P#Q$R%S^U&V*W|X'Y\"A(B)C{D}E[F]G/H\\I+J=L",
output: []string{"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "U", "V", "W", "X", "Y", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "L"},
},
}
regex := layoutSplitRegex
for _, test := range splitRegexTests {
t.Run(test.input, func(t *testing.T) {
actualOutput := regex.Split(test.input, -1)
if !reflect.DeepEqual(actualOutput, test.output) {
t.Fatalf("expected output %v, got %v", test.output, actualOutput)
}
})
}
stripRegexTests := []struct {
input string
output string
}{
{
input: "3PM",
output: "3",
},
{
input: "3.0PM",
output: "3",
},
{
input: "3.9AM",
output: "3AM",
},
{
input: "3.999999999pm",
output: "3",
},
{
input: "2006-01-02T15:04:05.999999999Z07:00MST",
output: "2006-01-02T15:04:05Z07:00",
},
{
input: "Jan _2 15:04:05.0000000+07:00MST",
output: "Jan _2 15:04:05+07:00",
},
{
input: "15:04:05.99PM+07:00MST",
output: "15:04:05+07:00",
},
}
regex = layoutStripRegex
for _, test := range stripRegexTests {
t.Run(test.input, func(t *testing.T) {
actualOutput := layoutStripRegex.ReplaceAllString(test.input, "")
if !reflect.DeepEqual(actualOutput, test.output) {
t.Fatalf("expected output %v, got %v", test.output, actualOutput)
}
})
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package microseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_micro"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000
var maxBound int64 = math.MaxInt64 / 1000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.UnixMicro(timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package milliseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_milli"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000000
var maxBound int64 = math.MaxInt64 / 1000000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.UnixMilli(timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package nanoseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_nano"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64
var maxBound int64 = math.MaxInt64
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.Unix(0, timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package seconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_sec"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000000000
var maxBound int64 = math.MaxInt64 / 1000000000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is seconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.Unix(timestamp, 0), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

70
analysis/freq.go Normal file
View file

@ -0,0 +1,70 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
index "github.com/blevesearch/bleve_index_api"
)
func TokenFrequency(tokens TokenStream, arrayPositions []uint64, options index.FieldIndexingOptions) index.TokenFrequencies {
rv := make(map[string]*index.TokenFreq, len(tokens))
if options.IncludeTermVectors() {
tls := make([]index.TokenLocation, len(tokens))
tlNext := 0
for _, token := range tokens {
tls[tlNext] = index.TokenLocation{
ArrayPositions: arrayPositions,
Start: token.Start,
End: token.End,
Position: token.Position,
}
curr, ok := rv[string(token.Term)]
if ok {
curr.Locations = append(curr.Locations, &tls[tlNext])
} else {
curr = &index.TokenFreq{
Term: token.Term,
Locations: []*index.TokenLocation{&tls[tlNext]},
}
rv[string(token.Term)] = curr
}
if !options.SkipFreqNorm() {
curr.SetFrequency(curr.Frequency() + 1)
}
tlNext++
}
} else {
for _, token := range tokens {
curr, exists := rv[string(token.Term)]
if !exists {
curr = &index.TokenFreq{
Term: token.Term,
}
rv[string(token.Term)] = curr
}
if !options.SkipFreqNorm() {
curr.SetFrequency(curr.Frequency() + 1)
}
}
}
return rv
}

60
analysis/freq_test.go Normal file
View file

@ -0,0 +1,60 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
index "github.com/blevesearch/bleve_index_api"
"reflect"
"testing"
)
func TestTokenFrequency(t *testing.T) {
tokens := TokenStream{
&Token{
Term: []byte("water"),
Position: 1,
Start: 0,
End: 5,
},
&Token{
Term: []byte("water"),
Position: 2,
Start: 6,
End: 11,
},
}
expectedResult := index.TokenFrequencies{
"water": &index.TokenFreq{
Term: []byte("water"),
Locations: []*index.TokenLocation{
{
Position: 1,
Start: 0,
End: 5,
},
{
Position: 2,
Start: 6,
End: 11,
},
},
},
}
expectedResult["water"].SetFrequency(2)
result := TokenFrequency(tokens, nil, index.IncludeTermVectors)
if !reflect.DeepEqual(result, expectedResult) {
t.Errorf("expected %#v, got %#v", expectedResult, result)
}
}

View file

@ -0,0 +1,68 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/token/unicodenorm"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
)
const AnalyzerName = "ar"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
normalizeFilter := unicodenorm.MustNewUnicodeNormalizeFilter(unicodenorm.NFKC)
stopArFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
normalizeArFilter, err := cache.TokenFilterNamed(NormalizeName)
if err != nil {
return nil, err
}
stemmerArFilter, err := cache.TokenFilterNamed(StemmerName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
normalizeFilter,
stopArFilter,
normalizeArFilter,
stemmerArFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,184 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestArabicAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
{
input: []byte("كبير"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("كبير"),
Position: 1,
Start: 0,
End: 8,
},
},
},
// feminine marker
{
input: []byte("كبيرة"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("كبير"),
Position: 1,
Start: 0,
End: 10,
},
},
},
{
input: []byte("مشروب"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("مشروب"),
Position: 1,
Start: 0,
End: 10,
},
},
},
// plural -at
{
input: []byte("مشروبات"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("مشروب"),
Position: 1,
Start: 0,
End: 14,
},
},
},
// plural -in
{
input: []byte("أمريكيين"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("امريك"),
Position: 1,
Start: 0,
End: 16,
},
},
},
// singular with bare alif
{
input: []byte("امريكي"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("امريك"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{
input: []byte("كتاب"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("كتاب"),
Position: 1,
Start: 0,
End: 8,
},
},
},
// definite article
{
input: []byte("الكتاب"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("كتاب"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{
input: []byte("ما ملكت أيمانكم"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ملكت"),
Position: 2,
Start: 5,
End: 13,
},
&analysis.Token{
Term: []byte("ايمانكم"),
Position: 3,
Start: 14,
End: 28,
},
},
},
// stopwords
{
input: []byte("الذين ملكت أيمانكم"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ملكت"),
Position: 2,
Start: 11,
End: 19,
},
&analysis.Token{
Term: []byte("ايمانكم"),
Position: 3,
Start: 20,
End: 34,
},
},
},
// presentation form normalization
{
input: []byte("ﺍﻟﺴﻼﻢ"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
Position: 1,
Start: 0,
End: 15,
},
},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,88 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const NormalizeName = "normalize_ar"
const (
Alef = '\u0627'
AlefMadda = '\u0622'
AlefHamzaAbove = '\u0623'
AlefHamzaBelow = '\u0625'
Yeh = '\u064A'
DotlessYeh = '\u0649'
TehMarbuta = '\u0629'
Heh = '\u0647'
Tatweel = '\u0640'
Fathatan = '\u064B'
Dammatan = '\u064C'
Kasratan = '\u064D'
Fatha = '\u064E'
Damma = '\u064F'
Kasra = '\u0650'
Shadda = '\u0651'
Sukun = '\u0652'
)
type ArabicNormalizeFilter struct {
}
func NewArabicNormalizeFilter() *ArabicNormalizeFilter {
return &ArabicNormalizeFilter{}
}
func (s *ArabicNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
term := normalize(token.Term)
token.Term = term
}
return input
}
func normalize(input []byte) []byte {
runes := bytes.Runes(input)
for i := 0; i < len(runes); i++ {
switch runes[i] {
case AlefMadda, AlefHamzaAbove, AlefHamzaBelow:
runes[i] = Alef
case DotlessYeh:
runes[i] = Yeh
case TehMarbuta:
runes[i] = Heh
case Tatweel, Kasratan, Dammatan, Fathatan, Fatha, Damma, Kasra, Shadda, Sukun:
runes = analysis.DeleteRune(runes, i)
i--
}
}
return analysis.BuildTermFromRunes(runes)
}
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewArabicNormalizeFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,234 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestArabicNormalizeFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
// AlifMadda
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("آجن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("اجن"),
},
},
},
// AlifHamzaAbove
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("أحمد"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("احمد"),
},
},
},
// AlifHamzaBelow
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("إعاذ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("اعاذ"),
},
},
},
// AlifMaksura
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("بنى"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("بني"),
},
},
},
// TehMarbuta
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("فاطمة"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("فاطمه"),
},
},
},
// Tatweel
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("روبرـــــت"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("روبرت"),
},
},
},
// Fatha
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("مَبنا"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("مبنا"),
},
},
},
// Kasra
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("علِي"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("علي"),
},
},
},
// Damma
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("بُوات"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("بوات"),
},
},
},
// Fathatan
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولداً"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولدا"),
},
},
},
// Kasratan
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولدٍ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولد"),
},
},
},
// Dammatan
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولدٌ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ولد"),
},
},
},
// Sukun
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("نلْسون"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("نلسون"),
},
},
},
// Shaddah
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("هتميّ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("هتمي"),
},
},
},
// empty
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
}
arabicNormalizeFilter := NewArabicNormalizeFilter()
for _, test := range tests {
actual := arabicNormalizeFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %#v, got %#v", test.output, actual)
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,121 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StemmerName = "stemmer_ar"
// These were obtained from org.apache.lucene.analysis.ar.ArabicStemmer
var prefixes = [][]rune{
[]rune("ال"),
[]rune("وال"),
[]rune("بال"),
[]rune("كال"),
[]rune("فال"),
[]rune("لل"),
[]rune("و"),
}
var suffixes = [][]rune{
[]rune("ها"),
[]rune("ان"),
[]rune("ات"),
[]rune("ون"),
[]rune("ين"),
[]rune("يه"),
[]rune("ية"),
[]rune("ه"),
[]rune("ة"),
[]rune("ي"),
}
type ArabicStemmerFilter struct{}
func NewArabicStemmerFilter() *ArabicStemmerFilter {
return &ArabicStemmerFilter{}
}
func (s *ArabicStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
term := stem(token.Term)
token.Term = term
}
return input
}
func canStemPrefix(input, prefix []rune) bool {
// Wa- prefix requires at least 3 characters.
if len(prefix) == 1 && len(input) < 4 {
return false
}
// Other prefixes require only 2.
if len(input)-len(prefix) < 2 {
return false
}
for i := range prefix {
if prefix[i] != input[i] {
return false
}
}
return true
}
func canStemSuffix(input, suffix []rune) bool {
// All suffixes require at least 2 characters after stemming.
if len(input)-len(suffix) < 2 {
return false
}
stemEnd := len(input) - len(suffix)
for i := range suffix {
if suffix[i] != input[stemEnd+i] {
return false
}
}
return true
}
func stem(input []byte) []byte {
runes := bytes.Runes(input)
// Strip a single prefix.
for _, p := range prefixes {
if canStemPrefix(runes, p) {
runes = runes[len(p):]
break
}
}
// Strip off multiple suffixes, in their order in the suffixes array.
for _, s := range suffixes {
if canStemSuffix(runes, s) {
runes = runes[:len(runes)-len(s)]
}
}
return analysis.BuildTermFromRunes(runes)
}
func StemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewArabicStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(StemmerName, StemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,397 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestArabicStemmerFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
// AlPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("الحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// WalPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("والحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// BalPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("بالحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// KalPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("كالحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// FalPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("فالحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// LlPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("للاخر"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("اخر"),
},
},
},
// WaPrefix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("وحسن"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("حسن"),
},
},
},
// AhSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("زوجها"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("زوج"),
},
},
},
// AnSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدان"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// AtSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدات"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// WnSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدون"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// YnSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدين"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// YhSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهديه"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// YpSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدية"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// HSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهده"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// PSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدة"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// YSuffix
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدي"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// ComboPrefSuf
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("وساهدون"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// ComboSuf
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهدهات"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ساهد"),
},
},
},
// Shouldn't Stem
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("الو"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("الو"),
},
},
},
// NonArabic
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("English"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("English"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("السلام"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلامة"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("السلامة"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سلام"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("الوصل"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("وصل"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("والصل"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("صل"),
},
},
},
// Empty
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
}
arabicStemmerFilter := NewArabicStemmerFilter()
for _, test := range tests {
actual := arabicStemmerFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %#v, got %#v", test.output, actual)
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ar
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,152 @@
package ar
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_ar"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis
// ` was changed to ' to allow for literal string
var ArabicStopWords = []byte(`# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
# Cleaned on October 11, 2009 (not normalized, so use before normalization)
# This means that when modifying this list, you might need to add some
# redundant entries, for example containing forms with both أ and ا
من
ومن
منها
منه
في
وفي
فيها
فيه
و
ف
ثم
او
أو
ب
بها
به
ا
أ
اى
اي
أي
أى
لا
ولا
الا
ألا
إلا
لكن
ما
وما
كما
فما
عن
مع
اذا
إذا
ان
أن
إن
انها
أنها
إنها
انه
أنه
إنه
بان
بأن
فان
فأن
وان
وأن
وإن
التى
التي
الذى
الذي
الذين
الى
الي
إلى
إلي
على
عليها
عليه
اما
أما
إما
ايضا
أيضا
كل
وكل
لم
ولم
لن
ولن
هى
هي
هو
وهى
وهي
وهو
فهى
فهي
فهو
انت
أنت
لك
لها
له
هذه
هذا
تلك
ذلك
هناك
كانت
كان
يكون
تكون
وكانت
وكان
غير
بعض
قد
نحو
بين
بينما
منذ
ضمن
حيث
الان
الآن
خلال
بعد
قبل
حتى
عند
عندما
لدى
جميع
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(ArabicStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package bg
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,220 @@
package bg
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_bg"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
// ` was changed to ' to allow for literal string
var BulgarianStopWords = []byte(`# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
а
аз
ако
ала
бе
без
беше
би
бил
била
били
било
близо
бъдат
бъде
бяха
в
вас
ваш
ваша
вероятно
вече
взема
ви
вие
винаги
все
всеки
всички
всичко
всяка
във
въпреки
върху
г
ги
главно
го
д
да
дали
до
докато
докога
дори
досега
доста
е
едва
един
ето
за
зад
заедно
заради
засега
затова
защо
защото
и
из
или
им
има
имат
иска
й
каза
как
каква
какво
както
какъв
като
кога
когато
което
които
кой
който
колко
която
къде
където
към
ли
м
ме
между
мен
ми
мнозина
мога
могат
може
моля
момента
му
н
на
над
назад
най
направи
напред
например
нас
не
него
нея
ни
ние
никой
нито
но
някои
някой
няма
обаче
около
освен
особено
от
отгоре
отново
още
пак
по
повече
повечето
под
поне
поради
после
почти
прави
пред
преди
през
при
пък
първо
с
са
само
се
сега
си
скоро
след
сме
според
сред
срещу
сте
съм
със
също
т
тази
така
такива
такъв
там
твой
те
тези
ти
тн
то
това
тогава
този
той
толкова
точно
трябва
тук
тъй
тя
тях
у
харесва
ч
че
често
чрез
ще
щом
я
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(BulgarianStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,33 @@
package ca
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const ArticlesName = "articles_ca"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis
var CatalanArticles = []byte(`
d
l
m
n
s
t
`)
func ArticlesTokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(CatalanArticles)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(ArticlesName, ArticlesTokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,40 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ca
import (
"fmt"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/elision"
"github.com/blevesearch/bleve/v2/registry"
)
const ElisionName = "elision_ca"
func ElisionFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
articlesTokenMap, err := cache.TokenMapNamed(ArticlesName)
if err != nil {
return nil, fmt.Errorf("error building elision filter: %v", err)
}
return elision.NewElisionFilter(articlesTokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(ElisionName, ElisionFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,61 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ca
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestFrenchElision(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("l'Institut"),
},
&analysis.Token{
Term: []byte("d'Estudis"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Institut"),
},
&analysis.Token{
Term: []byte("Estudis"),
},
},
},
}
cache := registry.NewCache()
elisionFilter, err := cache.TokenFilterNamed(ElisionName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := elisionFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ca
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,247 @@
package ca
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_ca"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
// ` was changed to ' to allow for literal string
var CatalanStopWords = []byte(`# Catalan stopwords from http://github.com/vcl/cue.language (Apache 2 Licensed)
a
abans
ací
ah
així
això
al
als
aleshores
algun
alguna
algunes
alguns
alhora
allà
allí
allò
altra
altre
altres
amb
ambdós
ambdues
apa
aquell
aquella
aquelles
aquells
aquest
aquesta
aquestes
aquests
aquí
baix
cada
cadascú
cadascuna
cadascunes
cadascuns
com
contra
d'un
d'una
d'unes
d'uns
dalt
de
del
dels
des
després
dins
dintre
donat
doncs
durant
e
eh
el
els
em
en
encara
ens
entre
érem
eren
éreu
es
és
esta
està
estàvem
estaven
estàveu
esteu
et
etc
ets
fins
fora
gairebé
ha
han
has
havia
he
hem
heu
hi
ho
i
igual
iguals
ja
l'hi
la
les
li
li'n
llavors
m'he
ma
mal
malgrat
mateix
mateixa
mateixes
mateixos
me
mentre
més
meu
meus
meva
meves
molt
molta
moltes
molts
mon
mons
n'he
n'hi
ne
ni
no
nogensmenys
només
nosaltres
nostra
nostre
nostres
o
oh
oi
on
pas
pel
pels
per
però
perquè
poc
poca
pocs
poques
potser
propi
qual
quals
quan
quant
que
què
quelcom
qui
quin
quina
quines
quins
s'ha
s'han
sa
semblant
semblants
ses
seu
seus
seva
seva
seves
si
sobre
sobretot
sóc
solament
sols
son
són
sons
sota
sou
t'ha
t'han
t'he
ta
tal
també
tampoc
tan
tant
tanta
tantes
teu
teus
teva
teves
ton
tons
tot
tota
totes
tots
un
una
unes
uns
us
va
vaig
vam
van
vas
veu
vosaltres
vostra
vostre
vostres
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(CatalanStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,60 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
)
const AnalyzerName = "cjk"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
widthFilter, err := cache.TokenFilterNamed(WidthName)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
bigramFilter, err := cache.TokenFilterNamed(BigramName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
widthFilter,
toLowerFilter,
bigramFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,642 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestCJKAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
{
input: []byte("こんにちは世界"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こん"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("んに"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("にち"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ちは"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("は世"),
Type: analysis.Double,
Position: 5,
Start: 12,
End: 18,
},
&analysis.Token{
Term: []byte("世界"),
Type: analysis.Double,
Position: 6,
Start: 15,
End: 21,
},
},
},
{
input: []byte("一二三四五六七八九十"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一二"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("二三"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("三四"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("四五"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("五六"),
Type: analysis.Double,
Position: 5,
Start: 12,
End: 18,
},
&analysis.Token{
Term: []byte("六七"),
Type: analysis.Double,
Position: 6,
Start: 15,
End: 21,
},
&analysis.Token{
Term: []byte("七八"),
Type: analysis.Double,
Position: 7,
Start: 18,
End: 24,
},
&analysis.Token{
Term: []byte("八九"),
Type: analysis.Double,
Position: 8,
Start: 21,
End: 27,
},
&analysis.Token{
Term: []byte("九十"),
Type: analysis.Double,
Position: 9,
Start: 24,
End: 30,
},
},
},
{
input: []byte("一 二三四 五六七八九 十"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("二三"),
Type: analysis.Double,
Position: 2,
Start: 4,
End: 10,
},
&analysis.Token{
Term: []byte("三四"),
Type: analysis.Double,
Position: 3,
Start: 7,
End: 13,
},
&analysis.Token{
Term: []byte("五六"),
Type: analysis.Double,
Position: 4,
Start: 14,
End: 20,
},
&analysis.Token{
Term: []byte("六七"),
Type: analysis.Double,
Position: 5,
Start: 17,
End: 23,
},
&analysis.Token{
Term: []byte("七八"),
Type: analysis.Double,
Position: 6,
Start: 20,
End: 26,
},
&analysis.Token{
Term: []byte("八九"),
Type: analysis.Double,
Position: 7,
Start: 23,
End: 29,
},
&analysis.Token{
Term: []byte("十"),
Type: analysis.Single,
Position: 8,
Start: 30,
End: 33,
},
},
},
{
input: []byte("abc defgh ijklmn opqrstu vwxy z"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("abc"),
Type: analysis.AlphaNumeric,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("defgh"),
Type: analysis.AlphaNumeric,
Position: 2,
Start: 4,
End: 9,
},
&analysis.Token{
Term: []byte("ijklmn"),
Type: analysis.AlphaNumeric,
Position: 3,
Start: 10,
End: 16,
},
&analysis.Token{
Term: []byte("opqrstu"),
Type: analysis.AlphaNumeric,
Position: 4,
Start: 17,
End: 24,
},
&analysis.Token{
Term: []byte("vwxy"),
Type: analysis.AlphaNumeric,
Position: 5,
Start: 25,
End: 29,
},
&analysis.Token{
Term: []byte("z"),
Type: analysis.AlphaNumeric,
Position: 6,
Start: 30,
End: 31,
},
},
},
{
input: []byte("あい"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
},
},
{
input: []byte("あい "),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
},
},
{
input: []byte("test"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("test"),
Type: analysis.AlphaNumeric,
Position: 1,
Start: 0,
End: 4,
},
},
},
{
input: []byte("test "),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("test"),
Type: analysis.AlphaNumeric,
Position: 1,
Start: 0,
End: 4,
},
},
},
{
input: []byte("あいtest"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("test"),
Type: analysis.AlphaNumeric,
Position: 2,
Start: 6,
End: 10,
},
},
},
{
input: []byte("testあい "),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("test"),
Type: analysis.AlphaNumeric,
Position: 1,
Start: 0,
End: 4,
},
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 2,
Start: 4,
End: 10,
},
},
},
{
input: []byte("あいうえおabcかきくけこ"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("いう"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("うえ"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("えお"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("abc"),
Type: analysis.AlphaNumeric,
Position: 5,
Start: 15,
End: 18,
},
&analysis.Token{
Term: []byte("かき"),
Type: analysis.Double,
Position: 6,
Start: 18,
End: 24,
},
&analysis.Token{
Term: []byte("きく"),
Type: analysis.Double,
Position: 7,
Start: 21,
End: 27,
},
&analysis.Token{
Term: []byte("くけ"),
Type: analysis.Double,
Position: 8,
Start: 24,
End: 30,
},
&analysis.Token{
Term: []byte("けこ"),
Type: analysis.Double,
Position: 9,
Start: 27,
End: 33,
},
},
},
{
input: []byte("あいうえおabんcかきくけ こ"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("あい"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("いう"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("うえ"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("えお"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("ab"),
Type: analysis.AlphaNumeric,
Position: 5,
Start: 15,
End: 17,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Single,
Position: 6,
Start: 17,
End: 20,
},
&analysis.Token{
Term: []byte("c"),
Type: analysis.AlphaNumeric,
Position: 7,
Start: 20,
End: 21,
},
&analysis.Token{
Term: []byte("かき"),
Type: analysis.Double,
Position: 8,
Start: 21,
End: 27,
},
&analysis.Token{
Term: []byte("きく"),
Type: analysis.Double,
Position: 9,
Start: 24,
End: 30,
},
&analysis.Token{
Term: []byte("くけ"),
Type: analysis.Double,
Position: 10,
Start: 27,
End: 33,
},
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Single,
Position: 11,
Start: 34,
End: 37,
},
},
},
{
input: []byte("一 روبرت موير"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("روبرت"),
Type: analysis.AlphaNumeric,
Position: 2,
Start: 4,
End: 14,
},
&analysis.Token{
Term: []byte("موير"),
Type: analysis.AlphaNumeric,
Position: 3,
Start: 15,
End: 23,
},
},
},
{
input: []byte("一 رُوبرت موير"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("رُوبرت"),
Type: analysis.AlphaNumeric,
Position: 2,
Start: 4,
End: 16,
},
&analysis.Token{
Term: []byte("موير"),
Type: analysis.AlphaNumeric,
Position: 3,
Start: 17,
End: 25,
},
},
},
{
input: []byte("𩬅艱鍟䇹愯瀛"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("𩬅艱"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 7,
},
&analysis.Token{
Term: []byte("艱鍟"),
Type: analysis.Double,
Position: 2,
Start: 4,
End: 10,
},
&analysis.Token{
Term: []byte("鍟䇹"),
Type: analysis.Double,
Position: 3,
Start: 7,
End: 13,
},
&analysis.Token{
Term: []byte("䇹愯"),
Type: analysis.Double,
Position: 4,
Start: 10,
End: 16,
},
&analysis.Token{
Term: []byte("愯瀛"),
Type: analysis.Double,
Position: 5,
Start: 13,
End: 19,
},
},
},
{
input: []byte("一"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
},
},
{
input: []byte("一丁丂"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("一丁"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("丁丂"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
},
},
}
cache := registry.NewCache()
for _, test := range tests {
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}
func BenchmarkCJKAnalyzer(b *testing.B) {
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
b.Fatal(err)
}
for i := 0; i < b.N; i++ {
analyzer.Analyze(bleveWikiArticleJapanese)
}
}
var bleveWikiArticleJapanese = []byte(`加圧容器に貯蔵されている液体物質はその時の気液平衡状態にあるが火災により容器が加熱されていると容器内の液体はその物質の大気圧のもとでの沸点より十分に高い温度まで加熱され圧力も高くなるこの状態で容器が破裂すると容器内部の圧力は瞬間的に大気圧にまで低下する
この時に容器内の平衡状態が破られ液体は突沸し気体になることで爆発現象を起こす液化石油ガスなどではさらに拡散して空気と混ざったガスが自由空間蒸気雲爆発を起こす液化石油ガスなどの常温常圧で気体になる物を高い圧力で液化して収納している容器あるいはそのような液体を輸送するためのパイプラインや配管などが火災などによって破壊されたときに起きる
ブリーブという現象が明らかになったのはフランスリヨンの郊外にあるフェザンという町のフェザン製油所ウニオンゼネラルペトロールで大規模な爆発火災事故が発生したときだと言われている
中身の液体が高温高圧の水である場合には水蒸気爆発と呼ばれる`)

View file

@ -0,0 +1,210 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"bytes"
"container/ring"
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const BigramName = "cjk_bigram"
type CJKBigramFilter struct {
outputUnigram bool
}
func NewCJKBigramFilter(outputUnigram bool) *CJKBigramFilter {
return &CJKBigramFilter{
outputUnigram: outputUnigram,
}
}
func (s *CJKBigramFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
r := ring.New(2)
itemsInRing := 0
pos := 1
outputPos := 1
rv := make(analysis.TokenStream, 0, len(input))
for _, tokout := range input {
if tokout.Type == analysis.Ideographic {
runes := bytes.Runes(tokout.Term)
sofar := 0
for _, run := range runes {
rlen := utf8.RuneLen(run)
token := &analysis.Token{
Term: tokout.Term[sofar : sofar+rlen],
Start: tokout.Start + sofar,
End: tokout.Start + sofar + rlen,
Position: pos,
Type: tokout.Type,
KeyWord: tokout.KeyWord,
}
pos++
sofar += rlen
if itemsInRing > 0 {
// if items already buffered
// check to see if this is aligned
curr := r.Value.(*analysis.Token)
if token.Start-curr.End != 0 {
// not aligned flush
flushToken := s.flush(r, &itemsInRing, outputPos)
if flushToken != nil {
outputPos++
rv = append(rv, flushToken)
}
}
}
// now we can add this token to the buffer
r = r.Next()
r.Value = token
if itemsInRing < 2 {
itemsInRing++
}
builtUnigram := false
if itemsInRing > 1 && s.outputUnigram {
unigram := s.buildUnigram(r, &itemsInRing, outputPos)
if unigram != nil {
builtUnigram = true
rv = append(rv, unigram)
}
}
bigramToken := s.outputBigram(r, &itemsInRing, outputPos)
if bigramToken != nil {
rv = append(rv, bigramToken)
outputPos++
}
// prev token should be removed if unigram was built
if builtUnigram {
itemsInRing--
}
}
} else {
// flush anything already buffered
flushToken := s.flush(r, &itemsInRing, outputPos)
if flushToken != nil {
rv = append(rv, flushToken)
outputPos++
}
// output this token as is
tokout.Position = outputPos
rv = append(rv, tokout)
outputPos++
}
}
// deal with possible trailing unigram
if itemsInRing == 1 || s.outputUnigram {
if itemsInRing == 2 {
r = r.Next()
}
unigram := s.buildUnigram(r, &itemsInRing, outputPos)
if unigram != nil {
rv = append(rv, unigram)
}
}
return rv
}
func (s *CJKBigramFilter) flush(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
var rv *analysis.Token
if *itemsInRing == 1 {
rv = s.buildUnigram(r, itemsInRing, pos)
}
r.Value = nil
*itemsInRing = 0
return rv
}
func (s *CJKBigramFilter) outputBigram(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
if *itemsInRing == 2 {
thisShingleRing := r.Move(-1)
shingledBytes := make([]byte, 0)
// do first token
prev := thisShingleRing.Value.(*analysis.Token)
shingledBytes = append(shingledBytes, prev.Term...)
// do second token
thisShingleRing = thisShingleRing.Next()
curr := thisShingleRing.Value.(*analysis.Token)
shingledBytes = append(shingledBytes, curr.Term...)
token := analysis.Token{
Type: analysis.Double,
Term: shingledBytes,
Position: pos,
Start: prev.Start,
End: curr.End,
}
return &token
}
return nil
}
func (s *CJKBigramFilter) buildUnigram(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
switch *itemsInRing {
case 2:
thisShingleRing := r.Move(-1)
// do first token
prev := thisShingleRing.Value.(*analysis.Token)
token := analysis.Token{
Type: analysis.Single,
Term: prev.Term,
Position: pos,
Start: prev.Start,
End: prev.End,
}
return &token
case 1:
// do first token
prev := r.Value.(*analysis.Token)
token := analysis.Token{
Type: analysis.Single,
Term: prev.Term,
Position: pos,
Start: prev.Start,
End: prev.End,
}
return &token
}
return nil
}
func CJKBigramFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
outputUnigram := false
outVal, ok := config["output_unigram"].(bool)
if ok {
outputUnigram = outVal
}
return NewCJKBigramFilter(outputUnigram), nil
}
func init() {
err := registry.RegisterTokenFilter(BigramName, CJKBigramFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,848 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"container/ring"
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
// Helper function to create a token
func makeToken(term string, start, end, pos int) *analysis.Token {
return &analysis.Token{
Term: []byte(term),
Start: start,
End: end,
Position: pos, // Note: buildUnigram uses the 'pos' argument, not the token's original pos
Type: analysis.Ideographic,
}
}
func TestCJKBigramFilter_buildUnigram(t *testing.T) {
filter := NewCJKBigramFilter(false)
tests := []struct {
name string
ringSetup func() (*ring.Ring, int) // Function to set up the ring and itemsInRing
inputPos int // Position to pass to buildUnigram
expectToken *analysis.Token
}{
{
name: "itemsInRing == 2",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("一", 0, 3, 1) // Original pos 1
token2 := makeToken("二", 3, 6, 2) // Original pos 2
r.Value = token1
r = r.Next()
r.Value = token2
// r currently points to token2, r.Move(-1) points to token1
return r, 2
},
inputPos: 10, // Expected output position
expectToken: &analysis.Token{
Type: analysis.Single,
Term: []byte("一"),
Position: 10, // Should use inputPos
Start: 0,
End: 3,
},
},
{
name: "itemsInRing == 1 (ring points to the single item)",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("三", 6, 9, 3)
r.Value = token1
// r points to token1
return r, 1
},
inputPos: 11,
expectToken: &analysis.Token{
Type: analysis.Single,
Term: []byte("三"),
Position: 11, // Should use inputPos
Start: 6,
End: 9,
},
},
{
name: "itemsInRing == 1 (ring points to nil, next is the single item)",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("四", 9, 12, 4)
r = r.Next() // r points to nil initially
r.Value = token1
// r points to token1
return r, 1
},
inputPos: 12,
expectToken: &analysis.Token{
Type: analysis.Single,
Term: []byte("四"),
Position: 12, // Should use inputPos
Start: 9,
End: 12,
},
},
{
name: "itemsInRing == 0",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
// Ring is empty
return r, 0
},
inputPos: 13,
expectToken: nil, // Expect nil when itemsInRing is not 1 or 2
},
{
name: "itemsInRing > 2 (should behave like 0)",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("五", 12, 15, 5)
token2 := makeToken("六", 15, 18, 6)
r.Value = token1
r = r.Next()
r.Value = token2
// Simulate incorrect itemsInRing count
return r, 3
},
inputPos: 14,
expectToken: nil, // Expect nil when itemsInRing is not 1 or 2
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
ringPtr, itemsInRing := tt.ringSetup()
itemsInRingCopy := itemsInRing // Pass a pointer to a copy
gotToken := filter.buildUnigram(ringPtr, &itemsInRingCopy, tt.inputPos)
if !reflect.DeepEqual(gotToken, tt.expectToken) {
t.Errorf("buildUnigram() got = %v, want %v", gotToken, tt.expectToken)
}
// Check if itemsInRing was modified (it shouldn't be by buildUnigram)
if itemsInRingCopy != itemsInRing {
t.Errorf("buildUnigram() modified itemsInRing, got = %d, want %d", itemsInRingCopy, itemsInRing)
}
})
}
}
func TestCJKBigramFilter_outputBigram(t *testing.T) {
// Create a filter instance (outputUnigram value doesn't matter for outputBigram)
filter := NewCJKBigramFilter(false)
tests := []struct {
name string
ringSetup func() (*ring.Ring, int) // Function to set up the ring and itemsInRing
inputPos int // Position to pass to outputBigram
expectToken *analysis.Token
}{
{
name: "itemsInRing == 2",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("一", 0, 3, 1) // Original pos 1
token2 := makeToken("二", 3, 6, 2) // Original pos 2
r.Value = token1
r = r.Next()
r.Value = token2
// r currently points to token2, r.Move(-1) points to token1
return r, 2
},
inputPos: 10, // Expected output position
expectToken: &analysis.Token{
Type: analysis.Double,
Term: []byte("一二"), // Combined term
Position: 10, // Should use inputPos
Start: 0, // Start of first token
End: 6, // End of second token
},
},
{
name: "itemsInRing == 2 with different terms",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("你好", 0, 6, 1)
token2 := makeToken("世界", 6, 12, 2)
r.Value = token1
r = r.Next()
r.Value = token2
return r, 2
},
inputPos: 5,
expectToken: &analysis.Token{
Type: analysis.Double,
Term: []byte("你好世界"),
Position: 5,
Start: 0,
End: 12,
},
},
{
name: "itemsInRing == 1",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("三", 6, 9, 3)
r.Value = token1
return r, 1
},
inputPos: 11,
expectToken: nil, // Expect nil when itemsInRing is not 2
},
{
name: "itemsInRing == 0",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
// Ring is empty
return r, 0
},
inputPos: 13,
expectToken: nil, // Expect nil when itemsInRing is not 2
},
{
name: "itemsInRing > 2 (should behave like 0)",
ringSetup: func() (*ring.Ring, int) {
r := ring.New(2)
token1 := makeToken("五", 12, 15, 5)
token2 := makeToken("六", 15, 18, 6)
r.Value = token1
r = r.Next()
r.Value = token2
// Simulate incorrect itemsInRing count
return r, 3
},
inputPos: 14,
expectToken: nil, // Expect nil when itemsInRing is not 2
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
ringPtr, itemsInRing := tt.ringSetup()
itemsInRingCopy := itemsInRing // Pass a pointer to a copy
gotToken := filter.outputBigram(ringPtr, &itemsInRingCopy, tt.inputPos)
if !reflect.DeepEqual(gotToken, tt.expectToken) {
t.Errorf("outputBigram() got = %v, want %v", gotToken, tt.expectToken)
}
// Check if itemsInRing was modified (it shouldn't be by outputBigram)
if itemsInRingCopy != itemsInRing {
t.Errorf("outputBigram() modified itemsInRing, got = %d, want %d", itemsInRingCopy, itemsInRing)
}
})
}
}
func TestCJKBigramFilter(t *testing.T) {
tests := []struct {
outputUnigram bool
input analysis.TokenStream
output analysis.TokenStream
}{
// first test that non-adjacent terms are not combined
{
outputUnigram: false,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Ideographic,
Position: 2,
Start: 5,
End: 8,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Single,
Position: 2,
Start: 5,
End: 8,
},
},
},
{
outputUnigram: false,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Ideographic,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Ideographic,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Ideographic,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Ideographic,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Ideographic,
Position: 6,
Start: 15,
End: 18,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Ideographic,
Position: 7,
Start: 18,
End: 21,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こん"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("んに"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("にち"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ちは"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("は世"),
Type: analysis.Double,
Position: 5,
Start: 12,
End: 18,
},
&analysis.Token{
Term: []byte("世界"),
Type: analysis.Double,
Position: 6,
Start: 15,
End: 21,
},
},
},
{
outputUnigram: true,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Ideographic,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Ideographic,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Ideographic,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Ideographic,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Ideographic,
Position: 6,
Start: 15,
End: 18,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Ideographic,
Position: 7,
Start: 18,
End: 21,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("こん"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Single,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("んに"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Single,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("にち"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Single,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("ちは"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Single,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("は世"),
Type: analysis.Double,
Position: 5,
Start: 12,
End: 18,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Single,
Position: 6,
Start: 15,
End: 18,
},
&analysis.Token{
Term: []byte("世界"),
Type: analysis.Double,
Position: 6,
Start: 15,
End: 21,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Single,
Position: 7,
Start: 18,
End: 21,
},
},
},
{
// Assuming that `、` is removed by unicode tokenizer from `こんにちは、世界`
outputUnigram: true,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Ideographic,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Ideographic,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Ideographic,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Ideographic,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Ideographic,
Position: 7,
Start: 18,
End: 21,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Ideographic,
Position: 8,
Start: 21,
End: 24,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Single,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("こん"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Single,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("んに"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Single,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("にち"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Single,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("ちは"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Single,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Single,
Position: 6,
Start: 18,
End: 21,
},
&analysis.Token{
Term: []byte("世界"),
Type: analysis.Double,
Position: 6,
Start: 18,
End: 24,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Single,
Position: 7,
Start: 21,
End: 24,
},
},
},
{
outputUnigram: false,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("こ"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 3,
},
&analysis.Token{
Term: []byte("ん"),
Type: analysis.Ideographic,
Position: 2,
Start: 3,
End: 6,
},
&analysis.Token{
Term: []byte("に"),
Type: analysis.Ideographic,
Position: 3,
Start: 6,
End: 9,
},
&analysis.Token{
Term: []byte("ち"),
Type: analysis.Ideographic,
Position: 4,
Start: 9,
End: 12,
},
&analysis.Token{
Term: []byte("は"),
Type: analysis.Ideographic,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("cat"),
Type: analysis.AlphaNumeric,
Position: 6,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世"),
Type: analysis.Ideographic,
Position: 7,
Start: 18,
End: 21,
},
&analysis.Token{
Term: []byte("界"),
Type: analysis.Ideographic,
Position: 8,
Start: 21,
End: 24,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("こん"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("んに"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("にち"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ちは"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("cat"),
Type: analysis.AlphaNumeric,
Position: 5,
Start: 12,
End: 15,
},
&analysis.Token{
Term: []byte("世界"),
Type: analysis.Double,
Position: 6,
Start: 18,
End: 24,
},
},
},
{
outputUnigram: false,
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("パイプライン"),
Type: analysis.Ideographic,
Position: 1,
Start: 0,
End: 18,
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("パイ"),
Type: analysis.Double,
Position: 1,
Start: 0,
End: 6,
},
&analysis.Token{
Term: []byte("イプ"),
Type: analysis.Double,
Position: 2,
Start: 3,
End: 9,
},
&analysis.Token{
Term: []byte("プラ"),
Type: analysis.Double,
Position: 3,
Start: 6,
End: 12,
},
&analysis.Token{
Term: []byte("ライ"),
Type: analysis.Double,
Position: 4,
Start: 9,
End: 15,
},
&analysis.Token{
Term: []byte("イン"),
Type: analysis.Double,
Position: 5,
Start: 12,
End: 18,
},
},
},
}
for _, test := range tests {
cjkBigramFilter := NewCJKBigramFilter(test.outputUnigram)
actual := cjkBigramFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output, actual)
}
}
}

View file

@ -0,0 +1,104 @@
// Copyright (c) 2016 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"bytes"
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const WidthName = "cjk_width"
type CJKWidthFilter struct{}
func NewCJKWidthFilter() *CJKWidthFilter {
return &CJKWidthFilter{}
}
func (s *CJKWidthFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
runeCount := utf8.RuneCount(token.Term)
runes := bytes.Runes(token.Term)
for i := 0; i < runeCount; i++ {
ch := runes[i]
if ch >= 0xFF01 && ch <= 0xFF5E {
// fullwidth ASCII variants
runes[i] -= 0xFEE0
} else if ch >= 0xFF65 && ch <= 0xFF9F {
// halfwidth Katakana variants
if (ch == 0xFF9E || ch == 0xFF9F) && i > 0 && combine(runes, i, ch) {
runes = analysis.DeleteRune(runes, i)
i--
runeCount = len(runes)
} else {
runes[i] = kanaNorm[ch-0xFF65]
}
}
}
token.Term = analysis.BuildTermFromRunes(runes)
}
return input
}
var kanaNorm = []rune{
0x30fb, 0x30f2, 0x30a1, 0x30a3, 0x30a5, 0x30a7, 0x30a9, 0x30e3, 0x30e5,
0x30e7, 0x30c3, 0x30fc, 0x30a2, 0x30a4, 0x30a6, 0x30a8, 0x30aa, 0x30ab,
0x30ad, 0x30af, 0x30b1, 0x30b3, 0x30b5, 0x30b7, 0x30b9, 0x30bb, 0x30bd,
0x30bf, 0x30c1, 0x30c4, 0x30c6, 0x30c8, 0x30ca, 0x30cb, 0x30cc, 0x30cd,
0x30ce, 0x30cf, 0x30d2, 0x30d5, 0x30d8, 0x30db, 0x30de, 0x30df, 0x30e0,
0x30e1, 0x30e2, 0x30e4, 0x30e6, 0x30e8, 0x30e9, 0x30ea, 0x30eb, 0x30ec,
0x30ed, 0x30ef, 0x30f3, 0x3099, 0x309A,
}
var kanaCombineVoiced = []rune{
78, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
}
var kanaCombineHalfVoiced = []rune{
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2,
0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
}
func combine(text []rune, pos int, r rune) bool {
prev := text[pos-1]
if prev >= 0x30A6 && prev <= 0x30FD {
if r == 0xFF9F {
text[pos-1] += kanaCombineHalfVoiced[prev-0x30A6]
} else {
text[pos-1] += kanaCombineVoiced[prev-0x30A6]
}
return text[pos-1] != prev
}
return false
}
func CJKWidthFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewCJKWidthFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(WidthName, CJKWidthFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,93 @@
// Copyright (c) 2016 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cjk
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestCJKWidthFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Test"),
},
&analysis.Token{
Term: []byte("1234"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("カタカナ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("カタカナ"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("ヴィッツ"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("ヴィッツ"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("パナソニック"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("パナソニック"),
},
},
},
}
for _, test := range tests {
cjkWidthFilter := NewCJKWidthFilter()
actual := cjkWidthFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output, actual)
}
}
}

View file

@ -0,0 +1,64 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/registry"
)
const AnalyzerName = "ckb"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
normCkbFilter, err := cache.TokenFilterNamed(NormalizeName)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopCkbFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
stemmerCkbFilter, err := cache.TokenFilterNamed(StemmerName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: unicodeTokenizer,
TokenFilters: []analysis.TokenFilter{
normCkbFilter,
toLowerFilter,
stopCkbFilter,
stemmerCkbFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,77 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestSoraniAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
// stop word removal
{
input: []byte("ئەم پیاوە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 2,
Start: 7,
End: 17,
},
},
},
{
input: []byte("پیاوە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 10,
},
},
},
{
input: []byte("پیاو"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 8,
},
},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}

View file

@ -0,0 +1,121 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"bytes"
"unicode"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const NormalizeName = "normalize_ckb"
const (
Yeh = '\u064A'
DotlessYeh = '\u0649'
FarsiYeh = '\u06CC'
Kaf = '\u0643'
Keheh = '\u06A9'
Heh = '\u0647'
Ae = '\u06D5'
Zwnj = '\u200C'
HehDoachashmee = '\u06BE'
TehMarbuta = '\u0629'
Reh = '\u0631'
Rreh = '\u0695'
RrehAbove = '\u0692'
Tatweel = '\u0640'
Fathatan = '\u064B'
Dammatan = '\u064C'
Kasratan = '\u064D'
Fatha = '\u064E'
Damma = '\u064F'
Kasra = '\u0650'
Shadda = '\u0651'
Sukun = '\u0652'
)
type SoraniNormalizeFilter struct {
}
func NewSoraniNormalizeFilter() *SoraniNormalizeFilter {
return &SoraniNormalizeFilter{}
}
func (s *SoraniNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
term := normalize(token.Term)
token.Term = term
}
return input
}
func normalize(input []byte) []byte {
runes := bytes.Runes(input)
for i := 0; i < len(runes); i++ {
switch runes[i] {
case Yeh, DotlessYeh:
runes[i] = FarsiYeh
case Kaf:
runes[i] = Keheh
case Zwnj:
if i > 0 && runes[i-1] == Heh {
runes[i-1] = Ae
}
runes = analysis.DeleteRune(runes, i)
i--
case Heh:
if i == len(runes)-1 {
runes[i] = Ae
}
case TehMarbuta:
runes[i] = Ae
case HehDoachashmee:
runes[i] = Heh
case Reh:
if i == 0 {
runes[i] = Rreh
}
case RrehAbove:
runes[i] = Rreh
case Tatweel, Kasratan, Dammatan, Fathatan, Fatha, Damma, Kasra, Shadda, Sukun:
runes = analysis.DeleteRune(runes, i)
i--
default:
if unicode.In(runes[i], unicode.Cf) {
runes = analysis.DeleteRune(runes, i)
i--
}
}
}
return analysis.BuildTermFromRunes(runes)
}
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewSoraniNormalizeFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,323 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestSoraniNormalizeFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
// test Y
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064A"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06CC"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0649"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06CC"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06CC"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06CC"),
},
},
},
// test K
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0643"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06A9"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06A9"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06A9"),
},
},
},
// test H
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0647\u200C"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06D5"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0647\u200C\u06A9"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06D5\u06A9"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06BE"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0647"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0629"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u06D5"),
},
},
},
// test final H
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0647\u0647\u0647"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0647\u0647\u06D5"),
},
},
},
// test RR
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0692"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0695"),
},
},
},
// test initial RR
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0631\u0631\u0631"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0695\u0631\u0631"),
},
},
},
// test remove
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0640"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064B"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064C"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064D"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064E"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u064F"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0650"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0651"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u0652"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("\u200C"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
// empty
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
}
soraniNormalizeFilter := NewSoraniNormalizeFilter()
for _, test := range tests {
actual := soraniNormalizeFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %#v, got %#v", test.output, actual)
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,151 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"bytes"
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StemmerName = "stemmer_ckb"
type SoraniStemmerFilter struct {
}
func NewSoraniStemmerFilter() *SoraniStemmerFilter {
return &SoraniStemmerFilter{}
}
func (s *SoraniStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
// if not protected keyword, stem it
if !token.KeyWord {
stemmed := stem(token.Term)
token.Term = stemmed
}
}
return input
}
func stem(input []byte) []byte {
inputLen := utf8.RuneCount(input)
// postposition
if inputLen > 5 && bytes.HasSuffix(input, []byte("دا")) {
input = truncateRunes(input, 2)
inputLen = utf8.RuneCount(input)
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("نا")) {
input = truncateRunes(input, 1)
inputLen = utf8.RuneCount(input)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("ەوە")) {
input = truncateRunes(input, 3)
inputLen = utf8.RuneCount(input)
}
// possessive pronoun
if inputLen > 6 &&
(bytes.HasSuffix(input, []byte("مان")) ||
bytes.HasSuffix(input, []byte("یان")) ||
bytes.HasSuffix(input, []byte("تان"))) {
input = truncateRunes(input, 3)
inputLen = utf8.RuneCount(input)
}
// indefinite singular ezafe
if inputLen > 6 && bytes.HasSuffix(input, []byte("ێکی")) {
return truncateRunes(input, 3)
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یەکی")) {
return truncateRunes(input, 4)
}
if inputLen > 5 && bytes.HasSuffix(input, []byte("ێک")) {
// indefinite singular
return truncateRunes(input, 2)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("یەک")) {
// indefinite singular
return truncateRunes(input, 3)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("ەکە")) {
// definite singular
return truncateRunes(input, 3)
} else if inputLen > 5 && bytes.HasSuffix(input, []byte("کە")) {
// definite singular
return truncateRunes(input, 2)
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("ەکان")) {
// definite plural
return truncateRunes(input, 4)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("کان")) {
// definite plural
return truncateRunes(input, 3)
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یانی")) {
// indefinite plural ezafe
return truncateRunes(input, 4)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("انی")) {
// indefinite plural ezafe
return truncateRunes(input, 3)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("یان")) {
// indefinite plural
return truncateRunes(input, 3)
} else if inputLen > 5 && bytes.HasSuffix(input, []byte("ان")) {
// indefinite plural
return truncateRunes(input, 2)
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یانە")) {
// demonstrative plural
return truncateRunes(input, 4)
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("انە")) {
// demonstrative plural
return truncateRunes(input, 3)
} else if inputLen > 5 && (bytes.HasSuffix(input, []byte("ایە")) || bytes.HasSuffix(input, []byte("ەیە"))) {
// demonstrative singular
return truncateRunes(input, 2)
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("ە")) {
// demonstrative singular
return truncateRunes(input, 1)
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("ی")) {
// absolute singular ezafe
return truncateRunes(input, 1)
}
return input
}
func truncateRunes(input []byte, num int) []byte {
runes := bytes.Runes(input)
runes = runes[:len(runes)-num]
out := buildTermFromRunes(runes)
return out
}
func buildTermFromRunes(runes []rune) []byte {
rv := make([]byte, 0, len(runes)*4)
for _, r := range runes {
runeBytes := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(runeBytes, r)
rv = append(rv, runeBytes...)
}
return rv
}
func StemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewSoraniStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(StemmerName, StemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,299 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/single"
)
func TestSoraniStemmerFilter(t *testing.T) {
// in order to match the lucene tests
// we will test with an analyzer, not just the stemmer
analyzer := analysis.DefaultAnalyzer{
Tokenizer: single.NewSingleTokenTokenizer(),
TokenFilters: []analysis.TokenFilter{
NewSoraniNormalizeFilter(),
NewSoraniStemmerFilter(),
},
}
tests := []struct {
input []byte
output analysis.TokenStream
}{
{ // -ek
input: []byte("پیاوێک"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{ // -yek
input: []byte("دەرگایەک"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // -aka
input: []byte("پیاوەكە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -ka
input: []byte("دەرگاكە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -a
input: []byte("کتاویە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("کتاوی"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{ // -ya
input: []byte("دەرگایە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -An
input: []byte("پیاوان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{ // -yAn
input: []byte("دەرگایان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // -akAn
input: []byte("پیاوەکان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // -kAn
input: []byte("دەرگاکان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // -Ana
input: []byte("پیاوانە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پیاو"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -yAna
input: []byte("دەرگایانە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دەرگا"),
Position: 1,
Start: 0,
End: 18,
},
},
},
{ // Ezafe singular
input: []byte("هۆتیلی"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("هۆتیل"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{ // Ezafe indefinite
input: []byte("هۆتیلێکی"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("هۆتیل"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // Ezafe plural
input: []byte("هۆتیلانی"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("هۆتیل"),
Position: 1,
Start: 0,
End: 16,
},
},
},
{ // -awa
input: []byte("دوورەوە"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("دوور"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -dA
input: []byte("نیوەشەودا"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("نیوەشەو"),
Position: 1,
Start: 0,
End: 18,
},
},
},
{ // -A
input: []byte("سۆرانا"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("سۆران"),
Position: 1,
Start: 0,
End: 12,
},
},
},
{ // -mAn
input: []byte("پارەمان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پارە"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -tAn
input: []byte("پارەتان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پارە"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // -yAn
input: []byte("پارەیان"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("پارە"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{ // empty
input: []byte(""),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
Position: 1,
Start: 0,
End: 0,
},
},
},
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("for input %s(% x)", test.input, test.input)
t.Errorf("\texpected:")
for _, token := range test.output {
t.Errorf("\t\t%v %s(% x)", token, token.Term, token.Term)
}
t.Errorf("\tactual:")
for _, token := range actual {
t.Errorf("\t\t%v %s(% x)", token, token.Term, token.Term)
}
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package ckb
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,163 @@
package ckb
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_ckb"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
// ` was changed to ' to allow for literal string
var SoraniStopWords = []byte(`# set of kurdish stopwords
# note these have been normalized with our scheme (e represented with U+06D5, etc)
# constructed from:
# * Fig 5 of "Building A Test Collection For Sorani Kurdish" (Esmaili et al)
# * "Sorani Kurdish: A Reference Grammar with selected readings" (Thackston)
# * Corpus-based analysis of 77M word Sorani collection: wikipedia, news, blogs, etc
# and
و
# which
کە
# of
ی
# made/did
کرد
# that/which
ئەوەی
# on/head
سەر
# two
دوو
# also
هەروەها
# from/that
لەو
# makes/does
دەکات
# some
چەند
# every
هەر
# demonstratives
# that
ئەو
# this
ئەم
# personal pronouns
# I
من
# we
ئێمە
# you
تۆ
# you
ئێوە
# he/she/it
ئەو
# they
ئەوان
# prepositions
# to/with/by
بە
پێ
# without
بەبێ
# along with/while/during
بەدەم
# in the opinion of
بەلای
# according to
بەپێی
# before
بەرلە
# in the direction of
بەرەوی
# in front of/toward
بەرەوە
# before/in the face of
بەردەم
# without
بێ
# except for
بێجگە
# for
بۆ
# on/in
دە
تێ
# with
دەگەڵ
# after
دوای
# except for/aside from
جگە
# in/from
لە
لێ
# in front of/before/because of
لەبەر
# between/among
لەبەینی
# concerning/about
لەبابەت
# concerning
لەبارەی
# instead of
لەباتی
# beside
لەبن
# instead of
لەبرێتی
# behind
لەدەم
# with/together with
لەگەڵ
# by
لەلایەن
# within
لەناو
# between/among
لەنێو
# for the sake of
لەپێناوی
# with respect to
لەرەوی
# by means of/for
لەرێ
# for the sake of
لەرێگا
# on/on top of/according to
لەسەر
# under
لەژێر
# between/among
ناو
# between/among
نێوان
# after
پاش
# before
پێش
# like
وەک
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(SoraniStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package cs
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,199 @@
package cs
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_cs"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
// ` was changed to ' to allow for literal string
var CzechStopWords = []byte(`a
s
k
o
i
u
v
z
dnes
cz
tímto
budeš
budem
byli
jseš
můj
svým
ta
tomto
tohle
tuto
tyto
jej
zda
proč
máte
tato
kam
tohoto
kdo
kteří
mi
nám
tom
tomuto
mít
nic
proto
kterou
byla
toho
protože
asi
ho
naši
napište
re
což
tím
takže
svých
její
svými
jste
aj
tu
tedy
teto
bylo
kde
ke
pravé
ji
nad
nejsou
či
pod
téma
mezi
přes
ty
pak
vám
ani
když
však
neg
jsem
tento
článku
články
aby
jsme
před
pta
jejich
byl
ještě
bez
také
pouze
první
vaše
která
nás
nový
tipy
pokud
může
strana
jeho
své
jiné
zprávy
nové
není
vás
jen
podle
zde
být
více
bude
již
než
který
by
které
co
nebo
ten
tak
při
od
po
jsou
jak
další
ale
si
se
ve
to
jako
za
zpět
ze
do
pro
je
na
atd
atp
jakmile
přičemž
on
ona
ono
oni
ony
my
vy
ji
mne
jemu
tomu
těm
těmu
němu
němuž
jehož
jíž
jelikož
jež
jakož
načež
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(CzechStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,59 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package da
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/registry"
)
const AnalyzerName = "da"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopDaFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
stemmerDaFilter, err := cache.TokenFilterNamed(SnowballStemmerName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: unicodeTokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopDaFilter,
stemmerDaFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,71 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package da
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestDanishAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
// stemming
{
input: []byte("undersøg"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("undersøg"),
Position: 1,
Start: 0,
End: 9,
},
},
},
{
input: []byte("undersøgelse"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("undersøg"),
Position: 1,
Start: 0,
End: 13,
},
},
},
// stop word
{
input: []byte("på"),
output: analysis.TokenStream{},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}

View file

@ -0,0 +1,52 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package da
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/snowballstem"
"github.com/blevesearch/snowballstem/danish"
)
const SnowballStemmerName = "stemmer_da_snowball"
type DanishStemmerFilter struct {
}
func NewDanishStemmerFilter() *DanishStemmerFilter {
return &DanishStemmerFilter{}
}
func (s *DanishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
env := snowballstem.NewEnv(string(token.Term))
danish.Stem(env)
token.Term = []byte(env.Current())
}
return input
}
func DanishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewDanishStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(SnowballStemmerName, DanishStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package da
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,137 @@
package da
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_da"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
// ` was changed to ' to allow for literal string
var DanishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| A Danish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| This is a ranked list (commonest to rarest) of stopwords derived from
| a large text sample.
og | and
i | in
jeg | I
det | that (dem. pronoun)/it (pers. pronoun)
at | that (in front of a sentence)/to (with infinitive)
en | a/an
den | it (pers. pronoun)/that (dem. pronoun)
til | to/at/for/until/against/by/of/into, more
er | present tense of "to be"
som | who, as
| on/upon/in/on/at/to/after/of/with/for, on
de | they
med | with/by/in, along
han | he
af | of/by/from/off/for/in/with/on, off
for | at/for/to/from/by/of/ago, in front/before, because
ikke | not
der | who/which, there/those
var | past tense of "to be"
mig | me/myself
sig | oneself/himself/herself/itself/themselves
men | but
et | a/an/one, one (number), someone/somebody/one
har | present tense of "to have"
om | round/about/for/in/a, about/around/down, if
vi | we
min | my
havde | past tense of "to have"
ham | him
hun | she
nu | now
over | over/above/across/by/beyond/past/on/about, over/past
da | then, when/as/since
fra | from/off/since, off, since
du | you
ud | out
sin | his/her/its/one's
dem | them
os | us/ourselves
op | up
man | you/one
hans | his
hvor | where
eller | or
hvad | what
skal | must/shall etc.
selv | myself/youself/herself/ourselves etc., even
her | here
alle | all/everyone/everybody etc.
vil | will (verb)
blev | past tense of "to stay/to remain/to get/to become"
kunne | could
ind | in
når | when
være | present tense of "to be"
dog | however/yet/after all
noget | something
ville | would
jo | you know/you see (adv), yes
deres | their/theirs
efter | after/behind/according to/for/by/from, later/afterwards
ned | down
skulle | should
denne | this
end | than
dette | this
mit | my/mine
også | also
under | under/beneath/below/during, below/underneath
have | have
dig | you
anden | other
hende | her
mine | my
alt | everything
meget | much/very, plenty of
sit | his, her, its, one's
sine | his, her, its, one's
vor | our
mod | against
disse | these
hvis | if
din | your/yours
nogle | some
hos | by/at
blive | be/become
mange | many
ad | by/through
bliver | present tense of "to be/to become"
hendes | her/hers
været | be
thi | for (conj)
jer | you
sådan | such, like this/like that
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(DanishStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,64 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/registry"
)
const AnalyzerName = "de"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopDeFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
normalizeDeFilter, err := cache.TokenFilterNamed(NormalizeName)
if err != nil {
return nil, err
}
lightStemmerDeFilter, err := cache.TokenFilterNamed(LightStemmerName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: unicodeTokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopDeFilter,
normalizeDeFilter,
lightStemmerDeFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,155 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestGermanAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
{
input: []byte("Tisch"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("tisch"),
Position: 1,
Start: 0,
End: 5,
},
},
},
{
input: []byte("Tische"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("tisch"),
Position: 1,
Start: 0,
End: 6,
},
},
},
{
input: []byte("Tischen"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("tisch"),
Position: 1,
Start: 0,
End: 7,
},
},
},
// german specials
{
input: []byte("Schaltflächen"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("schaltflach"),
Position: 1,
Start: 0,
End: 14,
},
},
},
{
input: []byte("Schaltflaechen"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("schaltflach"),
Position: 1,
Start: 0,
End: 14,
},
},
},
// tests added by marty to increase coverage
{
input: []byte("Blechern"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("blech"),
Position: 1,
Start: 0,
End: 8,
},
},
},
{
input: []byte("Klecks"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("kleck"),
Position: 1,
Start: 0,
End: 6,
},
},
},
{
input: []byte("Mindestens"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("mindest"),
Position: 1,
Start: 0,
End: 10,
},
},
},
{
input: []byte("Kugelfest"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("kugelf"),
Position: 1,
Start: 0,
End: 9,
},
},
},
{
input: []byte("Baldigst"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("baldig"),
Position: 1,
Start: 0,
End: 8,
},
},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}

View file

@ -0,0 +1,98 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const NormalizeName = "normalize_de"
const (
N = 0 /* ordinary state */
V = 1 /* stops 'u' from entering umlaut state */
U = 2 /* umlaut state, allows e-deletion */
)
type GermanNormalizeFilter struct {
}
func NewGermanNormalizeFilter() *GermanNormalizeFilter {
return &GermanNormalizeFilter{}
}
func (s *GermanNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
term := normalize(token.Term)
token.Term = term
}
return input
}
func normalize(input []byte) []byte {
state := N
runes := bytes.Runes(input)
for i := 0; i < len(runes); i++ {
switch runes[i] {
case 'a', 'o':
state = U
case 'u':
if state == N {
state = U
} else {
state = V
}
case 'e':
if state == U {
runes = analysis.DeleteRune(runes, i)
i--
}
state = V
case 'i', 'q', 'y':
state = V
case 'ä':
runes[i] = 'a'
state = V
case 'ö':
runes[i] = 'o'
state = V
case 'ü':
runes[i] = 'u'
state = V
case 'ß':
runes[i] = 's'
i++
runes = analysis.InsertRune(runes, i, 's')
state = N
default:
state = N
}
}
return analysis.BuildTermFromRunes(runes)
}
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewGermanNormalizeFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,103 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestGermanNormalizeFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
// Tests that a/o/u + e is equivalent to the umlaut form
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("Schaltflächen"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Schaltflachen"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("Schaltflaechen"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Schaltflachen"),
},
},
},
// Tests the specific heuristic that ue is not folded after a vowel or q.
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("dauer"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("dauer"),
},
},
},
// Tests german specific folding of sharp-s
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("weißbier"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("weissbier"),
},
},
},
// empty
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
}
germanNormalizeFilter := NewGermanNormalizeFilter()
for _, test := range tests {
actual := germanNormalizeFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %#v, got %#v", test.output, actual)
t.Errorf("expected %s(% x), got %s(% x)", test.output[0].Term, test.output[0].Term, actual[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,119 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const LightStemmerName = "stemmer_de_light"
type GermanLightStemmerFilter struct {
}
func NewGermanLightStemmerFilter() *GermanLightStemmerFilter {
return &GermanLightStemmerFilter{}
}
func (s *GermanLightStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
runes := bytes.Runes(token.Term)
runes = stem(runes)
token.Term = analysis.BuildTermFromRunes(runes)
}
return input
}
func stem(input []rune) []rune {
for i, r := range input {
switch r {
case 'ä', 'à', 'á', 'â':
input[i] = 'a'
case 'ö', 'ò', 'ó', 'ô':
input[i] = 'o'
case 'ï', 'ì', 'í', 'î':
input[i] = 'i'
case 'ü', 'ù', 'ú', 'û':
input[i] = 'u'
}
}
input = step1(input)
return step2(input)
}
func stEnding(ch rune) bool {
switch ch {
case 'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't':
return true
}
return false
}
func step1(s []rune) []rune {
l := len(s)
if l > 5 && s[l-3] == 'e' && s[l-2] == 'r' && s[l-1] == 'n' {
return s[:l-3]
}
if l > 4 && s[l-2] == 'e' {
switch s[l-1] {
case 'm', 'n', 'r', 's':
return s[:l-2]
}
}
if l > 3 && s[l-1] == 'e' {
return s[:l-1]
}
if l > 3 && s[l-1] == 's' && stEnding(s[l-2]) {
return s[:l-1]
}
return s
}
func step2(s []rune) []rune {
l := len(s)
if l > 5 && s[l-3] == 'e' && s[l-2] == 's' && s[l-1] == 't' {
return s[:l-3]
}
if l > 4 && s[l-2] == 'e' && (s[l-1] == 'r' || s[l-1] == 'n') {
return s[:l-2]
}
if l > 4 && s[l-2] == 's' && s[l-1] == 't' && stEnding(s[l-3]) {
return s[:l-2]
}
return s
}
func GermanLightStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewGermanLightStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(LightStemmerName, GermanLightStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,52 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/snowballstem"
"github.com/blevesearch/snowballstem/german"
)
const SnowballStemmerName = "stemmer_de_snowball"
type GermanStemmerFilter struct {
}
func NewGermanStemmerFilter() *GermanStemmerFilter {
return &GermanStemmerFilter{}
}
func (s *GermanStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
env := snowballstem.NewEnv(string(token.Term))
german.Stem(env)
token.Term = []byte(env.Current())
}
return input
}
func GermanStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewGermanStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(SnowballStemmerName, GermanStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,91 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestSnowballGermanStemmer(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("abzuschrecken"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("abzuschreck"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("abzuwarten"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("abzuwart"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("zwirnfabrik"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("zwirnfabr"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("zyniker"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("zynik"),
},
},
},
}
cache := registry.NewCache()
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := filter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package de
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,321 @@
package de
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_de"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
// ` was changed to ' to allow for literal string
var GermanStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| A German stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| The number of forms in this list is reduced significantly by passing it
| through the German stemmer.
aber | but
alle | all
allem
allen
aller
alles
als | than, as
also | so
am | an + dem
an | at
ander | other
andere
anderem
anderen
anderer
anderes
anderm
andern
anderr
anders
auch | also
auf | on
aus | out of
bei | by
bin | am
bis | until
bist | art
da | there
damit | with it
dann | then
der | the
den
des
dem
die
das
daß | that
derselbe | the same
derselben
denselben
desselben
demselben
dieselbe
dieselben
dasselbe
dazu | to that
dein | thy
deine
deinem
deinen
deiner
deines
denn | because
derer | of those
dessen | of him
dich | thee
dir | to thee
du | thou
dies | this
diese
diesem
diesen
dieser
dieses
doch | (several meanings)
dort | (over) there
durch | through
ein | a
eine
einem
einen
einer
eines
einig | some
einige
einigem
einigen
einiger
einiges
einmal | once
er | he
ihn | him
ihm | to him
es | it
etwas | something
euer | your
eure
eurem
euren
eurer
eures
für | for
gegen | towards
gewesen | p.p. of sein
hab | have
habe | have
haben | have
hat | has
hatte | had
hatten | had
hier | here
hin | there
hinter | behind
ich | I
mich | me
mir | to me
ihr | you, to her
ihre
ihrem
ihren
ihrer
ihres
euch | to you
im | in + dem
in | in
indem | while
ins | in + das
ist | is
jede | each, every
jedem
jeden
jeder
jedes
jene | that
jenem
jenen
jener
jenes
jetzt | now
kann | can
kein | no
keine
keinem
keinen
keiner
keines
können | can
könnte | could
machen | do
man | one
manche | some, many a
manchem
manchen
mancher
manches
mein | my
meine
meinem
meinen
meiner
meines
mit | with
muss | must
musste | had to
nach | to(wards)
nicht | not
nichts | nothing
noch | still, yet
nun | now
nur | only
ob | whether
oder | or
ohne | without
sehr | very
sein | his
seine
seinem
seinen
seiner
seines
selbst | self
sich | herself
sie | they, she
ihnen | to them
sind | are
so | so
solche | such
solchem
solchen
solcher
solches
soll | shall
sollte | should
sondern | but
sonst | else
über | over
um | about, around
und | and
uns | us
unse
unsem
unsen
unser
unses
unter | under
viel | much
vom | von + dem
von | from
vor | before
während | while
war | was
waren | were
warst | wast
was | what
weg | away, off
weil | because
weiter | further
welche | which
welchem
welchen
welcher
welches
wenn | when
werde | will
werden | will
wie | how
wieder | again
will | want
wir | we
wird | will
wirst | willst
wo | where
wollen | want
wollte | wanted
würde | would
würden | would
zu | to
zum | zu + dem
zur | zu + der
zwar | indeed
zwischen | between
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(GermanStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package el
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,105 @@
package el
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_el"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
// ` was changed to ' to allow for literal string
var GreekStopWords = []byte(`# Lucene Greek Stopwords list
# Note: by default this file is used after GreekLowerCaseFilter,
# so when modifying this file use 'σ' instead of 'ς'
ο
η
το
οι
τα
του
τησ
των
τον
την
και
κι
κ
ειμαι
εισαι
ειναι
ειμαστε
ειστε
στο
στον
στη
στην
μα
αλλα
απο
για
προσ
με
σε
ωσ
παρα
αντι
κατα
μετα
θα
να
δε
δεν
μη
μην
επι
ενω
εαν
αν
τοτε
που
πωσ
ποιοσ
ποια
ποιο
ποιοι
ποιεσ
ποιων
ποιουσ
αυτοσ
αυτη
αυτο
αυτοι
αυτων
αυτουσ
αυτεσ
αυτα
εκεινοσ
εκεινη
εκεινο
εκεινοι
εκεινεσ
εκεινα
εκεινων
εκεινουσ
οπωσ
ομωσ
ισωσ
οσο
οτι
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(GreekStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,73 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package en implements an analyzer with reasonable defaults for processing
// English text.
//
// It strips possessive suffixes ('s), transforms tokens to lower case,
// removes stopwords from a built-in list, and applies porter stemming.
//
// The built-in stopwords list is defined in EnglishStopWords.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/token/porter"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
)
const AnalyzerName = "en"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
possEnFilter, err := cache.TokenFilterNamed(PossessiveName)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopEnFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
stemmerEnFilter, err := cache.TokenFilterNamed(porter.Name)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
possEnFilter,
toLowerFilter,
stopEnFilter,
stemmerEnFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,105 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestEnglishAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
// stemming
{
input: []byte("books"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("book"),
Position: 1,
Start: 0,
End: 5,
},
},
},
{
input: []byte("book"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("book"),
Position: 1,
Start: 0,
End: 4,
},
},
},
// stop word removal
{
input: []byte("the"),
output: analysis.TokenStream{},
},
// possessive removal
{
input: []byte("steven's"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("steven"),
Position: 1,
Start: 0,
End: 8,
},
},
},
{
input: []byte("steven\u2019s"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("steven"),
Position: 1,
Start: 0,
End: 10,
},
},
},
{
input: []byte("steven\uFF07s"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("steven"),
Position: 1,
Start: 0,
End: 10,
},
},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}

View file

@ -0,0 +1,177 @@
/*
This code was ported from the Open Search Project
https://github.com/opensearch-project/OpenSearch/blob/main/modules/analysis-common/src/main/java/org/opensearch/analysis/common/EnglishPluralStemFilter.java
The algorithm itself was created by Mark Harwood
https://github.com/markharwood
*/
/*
* SPDX-License-Identifier: Apache-2.0
*
* The OpenSearch Contributors require contributions made to
* this file be licensed under the Apache-2.0 license or a
* compatible open source license.
*/
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package en
import (
"strings"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const PluralStemmerName = "stemmer_en_plural"
type EnglishPluralStemmerFilter struct {
}
func NewEnglishPluralStemmerFilter() *EnglishPluralStemmerFilter {
return &EnglishPluralStemmerFilter{}
}
func (s *EnglishPluralStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
token.Term = []byte(stem(string(token.Term)))
}
return input
}
func EnglishPluralStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewEnglishPluralStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(PluralStemmerName, EnglishPluralStemmerFilterConstructor)
if err != nil {
panic(err)
}
}
// ----------------------------------------------------------------------------
// Words ending in oes that retain the e when stemmed
var oesExceptions = []string{"shoes", "canoes", "oboes"}
// Words ending in ches that retain the e when stemmed
var chesExceptions = []string{
"cliches",
"avalanches",
"mustaches",
"moustaches",
"quiches",
"headaches",
"heartaches",
"porsches",
"tranches",
"caches",
}
func stem(word string) string {
runes := []rune(strings.ToLower(word))
if len(runes) < 3 || runes[len(runes)-1] != 's' {
return string(runes)
}
switch runes[len(runes)-2] {
case 'u':
fallthrough
case 's':
return string(runes)
case 'e':
// Modified ies->y logic from original s-stemmer - only work on strings > 4
// so spies -> spy still but pies->pie.
// The original code also special-cased aies and eies for no good reason as far as I can tell.
// ( no words of consequence - eg http://www.thefreedictionary.com/words-that-end-in-aies )
if len(runes) > 4 && runes[len(runes)-3] == 'i' {
runes[len(runes)-3] = 'y'
return string(runes[0 : len(runes)-2])
}
// Suffix rules to remove any dangling "e"
if len(runes) > 3 {
// xes (but >1 prefix so we can stem "boxes->box" but keep "axes->axe")
if len(runes) > 4 && runes[len(runes)-3] == 'x' {
return string(runes[0 : len(runes)-2])
}
// oes
if len(runes) > 3 && runes[len(runes)-3] == 'o' {
if isException(runes, oesExceptions) {
// Only remove the S
return string(runes[0 : len(runes)-1])
}
// Remove the es
return string(runes[0 : len(runes)-2])
}
if len(runes) > 4 {
// shes/sses
if runes[len(runes)-4] == 's' && (runes[len(runes)-3] == 'h' || runes[len(runes)-3] == 's') {
return string(runes[0 : len(runes)-2])
}
// ches
if len(runes) > 4 {
if runes[len(runes)-4] == 'c' && runes[len(runes)-3] == 'h' {
if isException(runes, chesExceptions) {
// Only remove the S
return string(runes[0 : len(runes)-1])
}
// Remove the es
return string(runes[0 : len(runes)-2])
}
}
}
}
fallthrough
default:
return string(runes[0 : len(runes)-1])
}
}
func isException(word []rune, exceptions []string) bool {
for _, exception := range exceptions {
exceptionRunes := []rune(exception)
exceptionPos := len(exceptionRunes) - 1
wordPos := len(word) - 1
matched := true
for exceptionPos >= 0 && wordPos >= 0 {
if exceptionRunes[exceptionPos] != word[wordPos] {
matched = false
break
}
exceptionPos--
wordPos--
}
if matched {
return true
}
}
return false
}

View file

@ -0,0 +1,46 @@
package en
import "testing"
func TestEnglishPluralStemmer(t *testing.T) {
data := []struct {
In, Out string
}{
{"dresses", "dress"},
{"dress", "dress"},
{"axes", "axe"},
{"ad", "ad"},
{"ads", "ad"},
{"gas", "ga"},
{"sass", "sass"},
{"berries", "berry"},
{"dresses", "dress"},
{"spies", "spy"},
{"shoes", "shoe"},
{"headaches", "headache"},
{"computer", "computer"},
{"dressing", "dressing"},
{"clothes", "clothe"},
{"DRESSES", "dress"},
{"frog", "frog"},
{"dress", "dress"},
{"runs", "run"},
{"pies", "pie"},
{"foxes", "fox"},
{"axes", "axe"},
{"foes", "fo"},
{"dishes", "dish"},
{"snitches", "snitch"},
{"cliches", "cliche"},
{"forests", "forest"},
{"yes", "ye"},
}
for _, datum := range data {
stemmed := stem(datum.In)
if stemmed != datum.Out {
t.Errorf("expected %v but got %v", datum.Out, stemmed)
}
}
}

View file

@ -0,0 +1,70 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
// PossessiveName is the name PossessiveFilter is registered as
// in the bleve registry.
const PossessiveName = "possessive_en"
const rightSingleQuotationMark = ''
const apostrophe = '\''
const fullWidthApostrophe = ''
const apostropheChars = rightSingleQuotationMark + apostrophe + fullWidthApostrophe
// PossessiveFilter implements a TokenFilter which
// strips the English possessive suffix ('s) from tokens.
// It handle a variety of apostrophe types, is case-insensitive
// and doesn't distinguish between possessive and contraction.
// (ie "She's So Rad" becomes "She So Rad")
type PossessiveFilter struct {
}
func NewPossessiveFilter() *PossessiveFilter {
return &PossessiveFilter{}
}
func (s *PossessiveFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
lastRune, lastRuneSize := utf8.DecodeLastRune(token.Term)
if lastRune == 's' || lastRune == 'S' {
nextLastRune, nextLastRuneSize := utf8.DecodeLastRune(token.Term[:len(token.Term)-lastRuneSize])
if nextLastRune == rightSingleQuotationMark ||
nextLastRune == apostrophe ||
nextLastRune == fullWidthApostrophe {
token.Term = token.Term[:len(token.Term)-lastRuneSize-nextLastRuneSize]
}
}
}
return input
}
func PossessiveFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewPossessiveFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(PossessiveName, PossessiveFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,142 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestEnglishPossessiveFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("marty's"),
},
&analysis.Token{
Term: []byte("MARTY'S"),
},
&analysis.Token{
Term: []byte("martys"),
},
&analysis.Token{
Term: []byte("MARTYS"),
},
&analysis.Token{
Term: []byte("martys"),
},
&analysis.Token{
Term: []byte("MARTYS"),
},
&analysis.Token{
Term: []byte("m"),
},
&analysis.Token{
Term: []byte("s"),
},
&analysis.Token{
Term: []byte("'s"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("marty"),
},
&analysis.Token{
Term: []byte("MARTY"),
},
&analysis.Token{
Term: []byte("marty"),
},
&analysis.Token{
Term: []byte("MARTY"),
},
&analysis.Token{
Term: []byte("marty"),
},
&analysis.Token{
Term: []byte("MARTY"),
},
&analysis.Token{
Term: []byte("m"),
},
&analysis.Token{
Term: []byte("s"),
},
&analysis.Token{
Term: []byte(""),
},
},
},
}
cache := registry.NewCache()
stemmerFilter, err := cache.TokenFilterNamed(PossessiveName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := stemmerFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output, actual)
}
}
}
func BenchmarkEnglishPossessiveFilter(b *testing.B) {
input := analysis.TokenStream{
&analysis.Token{
Term: []byte("marty's"),
},
&analysis.Token{
Term: []byte("MARTY'S"),
},
&analysis.Token{
Term: []byte("martys"),
},
&analysis.Token{
Term: []byte("MARTYS"),
},
&analysis.Token{
Term: []byte("martys"),
},
&analysis.Token{
Term: []byte("MARTYS"),
},
&analysis.Token{
Term: []byte("m"),
},
}
cache := registry.NewCache()
stemmerFilter, err := cache.TokenFilterNamed(PossessiveName)
if err != nil {
b.Fatal(err)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
stemmerFilter.Filter(input)
}
}

View file

@ -0,0 +1,52 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/snowballstem"
"github.com/blevesearch/snowballstem/english"
)
const SnowballStemmerName = "stemmer_en_snowball"
type EnglishStemmerFilter struct {
}
func NewEnglishStemmerFilter() *EnglishStemmerFilter {
return &EnglishStemmerFilter{}
}
func (s *EnglishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
env := snowballstem.NewEnv(string(token.Term))
english.Stem(env)
token.Term = []byte(env.Current())
}
return input
}
func EnglishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewEnglishStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(SnowballStemmerName, EnglishStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,79 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestSnowballEnglishStemmer(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoy"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoy"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoyed"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoy"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoyable"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("enjoy"),
},
},
},
}
cache := registry.NewCache()
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := filter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,347 @@
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_en"
// EnglishStopWords is the built-in list of stopwords used by the "stop_en" TokenFilter.
//
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
// ` was changed to ' to allow for literal string
var EnglishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| An English stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| Many of the forms below are quite rare (e.g. "yourselves") but included for
| completeness.
| PRONOUNS FORMS
| 1st person sing
i | subject, always in upper case of course
me | object
my | possessive adjective
| the possessive pronoun 'mine' is best suppressed, because of the
| sense of coal-mine etc.
myself | reflexive
| 1st person plural
we | subject
| us | object
| care is required here because US = United States. It is usually
| safe to remove it if it is in lower case.
our | possessive adjective
ours | possessive pronoun
ourselves | reflexive
| second person (archaic 'thou' forms not included)
you | subject and object
your | possessive adjective
yours | possessive pronoun
yourself | reflexive (singular)
yourselves | reflexive (plural)
| third person singular
he | subject
him | object
his | possessive adjective and pronoun
himself | reflexive
she | subject
her | object and possessive adjective
hers | possessive pronoun
herself | reflexive
it | subject and object
its | possessive adjective
itself | reflexive
| third person plural
they | subject
them | object
their | possessive adjective
theirs | possessive pronoun
themselves | reflexive
| other forms (demonstratives, interrogatives)
what
which
who
whom
this
that
these
those
| VERB FORMS (using F.R. Palmer's nomenclature)
| BE
am | 1st person, present
is | -s form (3rd person, present)
are | present
was | 1st person, past
were | past
be | infinitive
been | past participle
being | -ing form
| HAVE
have | simple
has | -s form
had | past
having | -ing form
| DO
do | simple
does | -s form
did | past
doing | -ing form
| The forms below are, I believe, best omitted, because of the significant
| homonym forms:
| He made a WILL
| old tin CAN
| merry month of MAY
| a smell of MUST
| fight the good fight with all thy MIGHT
| would, could, should, ought might however be included
| | AUXILIARIES
| | WILL
|will
would
| | SHALL
|shall
should
| | CAN
|can
could
| | MAY
|may
|might
| | MUST
|must
| | OUGHT
ought
| COMPOUND FORMS, increasingly encountered nowadays in 'formal' writing
| pronoun + verb
i'm
you're
he's
she's
it's
we're
they're
i've
you've
we've
they've
i'd
you'd
he'd
she'd
we'd
they'd
i'll
you'll
he'll
she'll
we'll
they'll
| verb + negation
isn't
aren't
wasn't
weren't
hasn't
haven't
hadn't
doesn't
don't
didn't
| auxiliary + negation
won't
wouldn't
shan't
shouldn't
can't
cannot
couldn't
mustn't
| miscellaneous forms
let's
that's
who's
what's
here's
there's
when's
where's
why's
how's
| rarer forms
| daren't needn't
| doubtful forms
| oughtn't mightn't
| ARTICLES
a
an
the
| THE REST (Overlap among prepositions, conjunctions, adverbs etc is so
| high, that classification is pointless.)
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
| Just for the record, the following words are among the commonest in English
| one
| every
| least
| less
| many
| now
| ever
| never
| say
| says
| said
| also
| get
| go
| goes
| just
| made
| make
| put
| see
| seen
| whether
| like
| well
| back
| even
| still
| way
| take
| since
| another
| however
| two
| three
| four
| five
| first
| second
| new
| old
| high
| long
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(EnglishStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,66 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
)
const AnalyzerName = "es"
func AnalyzerConstructor(config map[string]interface{},
cache *registry.Cache) (analysis.Analyzer, error) {
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
normalizeEsFilter, err := cache.TokenFilterNamed(NormalizeName)
if err != nil {
return nil, err
}
stopEsFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
lightStemmerEsFilter, err := cache.TokenFilterNamed(LightStemmerName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: unicodeTokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopEsFilter,
normalizeEsFilter,
lightStemmerEsFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,122 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestSpanishAnalyzer(t *testing.T) {
tests := []struct {
input []byte
output analysis.TokenStream
}{
// stemming
{
input: []byte("chicana"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("chican"),
Position: 1,
Start: 0,
End: 7,
},
},
},
{
input: []byte("chicano"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("chican"),
Position: 1,
Start: 0,
End: 7,
},
},
},
// added by marty for better coverage
{
input: []byte("yeses"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("yes"),
Position: 1,
Start: 0,
End: 5,
},
},
},
{
input: []byte("jaeces"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("jaez"),
Position: 1,
Start: 0,
End: 6,
},
},
},
{
input: []byte("arcos"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("arc"),
Position: 1,
Start: 0,
End: 5,
},
},
},
{
input: []byte("caos"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("caos"),
Position: 1,
Start: 0,
End: 4,
},
},
},
{
input: []byte("parecer"),
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("parecer"),
Position: 1,
Start: 0,
End: 7,
},
},
},
}
cache := registry.NewCache()
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := analyzer.Analyze(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %v, got %v", test.output, actual)
}
}
}

View file

@ -0,0 +1,78 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const LightStemmerName = "stemmer_es_light"
type SpanishLightStemmerFilter struct {
}
func NewSpanishLightStemmerFilter() *SpanishLightStemmerFilter {
return &SpanishLightStemmerFilter{}
}
func (s *SpanishLightStemmerFilter) Filter(
input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
runes := bytes.Runes(token.Term)
runes = stem(runes)
token.Term = analysis.BuildTermFromRunes(runes)
}
return input
}
func stem(input []rune) []rune {
l := len(input)
if l < 5 {
return input
}
switch input[l-1] {
case 'o', 'a', 'e':
return input[:l-1]
case 's':
if input[l-2] == 'e' && input[l-3] == 's' && input[l-4] == 'e' {
return input[:l-2]
}
if input[l-2] == 'e' && input[l-3] == 'c' {
input[l-3] = 'z'
return input[:l-2]
}
if input[l-2] == 'o' || input[l-2] == 'a' || input[l-2] == 'e' {
return input[:l-2]
}
}
return input
}
func SpanishLightStemmerFilterConstructor(config map[string]interface{},
cache *registry.Cache) (analysis.TokenFilter, error) {
return NewSpanishLightStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(LightStemmerName, SpanishLightStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,70 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const NormalizeName = "normalize_es"
type SpanishNormalizeFilter struct {
}
func NewSpanishNormalizeFilter() *SpanishNormalizeFilter {
return &SpanishNormalizeFilter{}
}
func (s *SpanishNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
term := normalize(token.Term)
token.Term = term
}
return input
}
func normalize(input []byte) []byte {
runes := bytes.Runes(input)
for i := 0; i < len(runes); i++ {
switch runes[i] {
case 'à', 'á', 'â', 'ä':
runes[i] = 'a'
case 'ò', 'ó', 'ô', 'ö':
runes[i] = 'o'
case 'è', 'é', 'ê', 'ë':
runes[i] = 'e'
case 'ù', 'ú', 'û', 'ü':
runes[i] = 'u'
case 'ì', 'í', 'î', 'ï':
runes[i] = 'i'
}
}
return analysis.BuildTermFromRunes(runes)
}
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewSpanishNormalizeFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,112 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
)
func TestSpanishNormalizeFilter(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("Guía"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Guia"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("Belcebú"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Belcebu"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("Limón"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("Limon"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("agüero"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("aguero"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("laúd"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("laud"),
},
},
},
// empty
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte(""),
},
},
},
}
spanishNormalizeFilter := NewSpanishNormalizeFilter()
for _, test := range tests {
actual := spanishNormalizeFilter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %#v, got %#v", test.output, actual)
t.Errorf("expected %s(% x), got %s(% x)", test.output[0].Term, test.output[0].Term, actual[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,52 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/snowballstem"
"github.com/blevesearch/snowballstem/spanish"
)
const SnowballStemmerName = "stemmer_es_snowball"
type SpanishStemmerFilter struct {
}
func NewSpanishStemmerFilter() *SpanishStemmerFilter {
return &SpanishStemmerFilter{}
}
func (s *SpanishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
env := snowballstem.NewEnv(string(token.Term))
spanish.Stem(env)
token.Term = []byte(env.Current())
}
return input
}
func SpanishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewSpanishStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(SnowballStemmerName, SpanishStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,79 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"reflect"
"testing"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
func TestSnowballSpanishStemmer(t *testing.T) {
tests := []struct {
input analysis.TokenStream
output analysis.TokenStream
}{
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("agresivos"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("agres"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("agresivamente"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("agres"),
},
},
},
{
input: analysis.TokenStream{
&analysis.Token{
Term: []byte("agresividad"),
},
},
output: analysis.TokenStream{
&analysis.Token{
Term: []byte("agres"),
},
},
},
}
cache := registry.NewCache()
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
if err != nil {
t.Fatal(err)
}
for _, test := range tests {
actual := filter.Filter(test.input)
if !reflect.DeepEqual(actual, test.output) {
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
}
}
}

View file

@ -0,0 +1,36 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package es
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{},
cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View file

@ -0,0 +1,383 @@
package es
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_es"
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
// ` was changed to ' to allow for literal string
var SpanishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| A Spanish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| The following is a ranked list (commonest to rarest) of stopwords
| deriving from a large sample of text.
| Extra words have been added at the end.
de | from, of
la | the, her
que | who, that
el | the
en | in
y | and
a | to
los | the, them
del | de + el
se | himself, from him etc
las | the, them
por | for, by, etc
un | a
para | for
con | with
no | no
una | a
su | his, her
al | a + el
| es from SER
lo | him
como | how
más | more
pero | pero
sus | su plural
le | to him, her
ya | already
o | or
| fue from SER
este | this
| ha from HABER
| himself etc
porque | because
esta | this
| son from SER
entre | between
| está from ESTAR
cuando | when
muy | very
sin | without
sobre | on
| ser from SER
| tiene from TENER
también | also
me | me
hasta | until
hay | there is/are
donde | where
| han from HABER
quien | whom, that
| están from ESTAR
| estado from ESTAR
desde | from
todo | all
nos | us
durante | during
| estados from ESTAR
todos | all
uno | a
les | to them
ni | nor
contra | against
otros | other
| fueron from SER
ese | that
eso | that
| había from HABER
ante | before
ellos | they
e | and (variant of y)
esto | this
| me
antes | before
algunos | some
qué | what?
unos | a
yo | I
otro | other
otras | other
otra | other
él | he
tanto | so much, many
esa | that
estos | these
mucho | much, many
quienes | who
nada | nothing
muchos | many
cual | who
| sea from SER
poco | few
ella | she
estar | to be
| haber from HABER
estas | these
| estaba from ESTAR
| estamos from ESTAR
algunas | some
algo | something
nosotros | we
| other forms
mi | me
mis | mi plural
| thou
te | thee
ti | thee
tu | thy
tus | tu plural
ellas | they
nosotras | we
vosotros | you
vosotras | you
os | you
mío | mine
mía |
míos |
mías |
tuyo | thine
tuya |
tuyos |
tuyas |
suyo | his, hers, theirs
suya |
suyos |
suyas |
nuestro | ours
nuestra |
nuestros |
nuestras |
vuestro | yours
vuestra |
vuestros |
vuestras |
esos | those
esas | those
| forms of estar, to be (not including the infinitive):
estoy
estás
está
estamos
estáis
están
esté
estés
estemos
estéis
estén
estaré
estarás
estará
estaremos
estaréis
estarán
estaría
estarías
estaríamos
estaríais
estarían
estaba
estabas
estábamos
estabais
estaban
estuve
estuviste
estuvo
estuvimos
estuvisteis
estuvieron
estuviera
estuvieras
estuviéramos
estuvierais
estuvieran
estuviese
estuvieses
estuviésemos
estuvieseis
estuviesen
estando
estado
estada
estados
estadas
estad
| forms of haber, to have (not including the infinitive):
he
has
ha
hemos
habéis
han
haya
hayas
hayamos
hayáis
hayan
habré
habrás
habrá
habremos
habréis
habrán
habría
habrías
habríamos
habríais
habrían
había
habías
habíamos
habíais
habían
hube
hubiste
hubo
hubimos
hubisteis
hubieron
hubiera
hubieras
hubiéramos
hubierais
hubieran
hubiese
hubieses
hubiésemos
hubieseis
hubiesen
habiendo
habido
habida
habidos
habidas
| forms of ser, to be (not including the infinitive):
soy
eres
es
somos
sois
son
sea
seas
seamos
seáis
sean
seré
serás
será
seremos
seréis
serán
sería
serías
seríamos
seríais
serían
era
eras
éramos
erais
eran
fui
fuiste
fue
fuimos
fuisteis
fueron
fuera
fueras
fuéramos
fuerais
fueran
fuese
fueses
fuésemos
fueseis
fuesen
siendo
sido
| sed also means 'thirst'
| forms of tener, to have (not including the infinitive):
tengo
tienes
tiene
tenemos
tenéis
tienen
tenga
tengas
tengamos
tengáis
tengan
tendré
tendrás
tendrá
tendremos
tendréis
tendrán
tendría
tendrías
tendríamos
tendríais
tendrían
tenía
tenías
teníamos
teníais
tenían
tuve
tuviste
tuvo
tuvimos
tuvisteis
tuvieron
tuviera
tuvieras
tuviéramos
tuvierais
tuvieran
tuviese
tuvieses
tuviésemos
tuvieseis
tuviesen
teniendo
tenido
tenida
tenidos
tenidas
tened
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(SpanishStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

Some files were not shown because too many files have changed in this diff Show more