Adding upstream version 2.5.1.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
c71cb8b61d
commit
982828099e
783 changed files with 150650 additions and 0 deletions
24
.github/workflows/tests.yml
vendored
Normal file
24
.github/workflows/tests.yml
vendored
Normal file
|
@ -0,0 +1,24 @@
|
|||
on:
|
||||
push:
|
||||
branches:
|
||||
- master
|
||||
pull_request:
|
||||
name: Tests
|
||||
jobs:
|
||||
test:
|
||||
strategy:
|
||||
matrix:
|
||||
go-version: [1.22.x, 1.23.x, 1.24.x]
|
||||
platform: [ubuntu-latest, macos-latest, windows-latest]
|
||||
runs-on: ${{ matrix.platform }}
|
||||
steps:
|
||||
- name: Install Go
|
||||
uses: actions/setup-go@v1
|
||||
with:
|
||||
go-version: ${{ matrix.go-version }}
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v2
|
||||
- name: Test
|
||||
run: |
|
||||
go version
|
||||
go test -race ./...
|
20
.gitignore
vendored
Normal file
20
.gitignore
vendored
Normal file
|
@ -0,0 +1,20 @@
|
|||
#*
|
||||
*.sublime-*
|
||||
*~
|
||||
.#*
|
||||
.project
|
||||
.settings
|
||||
**/.idea/
|
||||
**/*.iml
|
||||
.DS_Store
|
||||
query_string.y.go.tmp
|
||||
/analysis/token_filters/cld2/cld2-read-only
|
||||
/analysis/token_filters/cld2/libcld2_full.a
|
||||
/cmd/bleve/bleve
|
||||
vendor/**
|
||||
!vendor/manifest
|
||||
/y.output
|
||||
/search/query/y.output
|
||||
*.test
|
||||
tags
|
||||
go.sum
|
25
.travis.yml
Normal file
25
.travis.yml
Normal file
|
@ -0,0 +1,25 @@
|
|||
sudo: false
|
||||
|
||||
language: go
|
||||
|
||||
go:
|
||||
- "1.21.x"
|
||||
- "1.22.x"
|
||||
- "1.23.x"
|
||||
|
||||
script:
|
||||
- go get golang.org/x/tools/cmd/cover
|
||||
- go get github.com/mattn/goveralls
|
||||
- go get github.com/kisielk/errcheck
|
||||
- go get -u github.com/FiloSottile/gvt
|
||||
- gvt restore
|
||||
- go test -race -v $(go list ./... | grep -v vendor/)
|
||||
- go vet $(go list ./... | grep -v vendor/)
|
||||
- go test ./test -v -indexType scorch
|
||||
- errcheck -ignorepkg fmt $(go list ./... | grep -v vendor/);
|
||||
- scripts/project-code-coverage.sh
|
||||
- scripts/build_children.sh
|
||||
|
||||
notifications:
|
||||
email:
|
||||
- fts-team@couchbase.com
|
16
CONTRIBUTING.md
Normal file
16
CONTRIBUTING.md
Normal file
|
@ -0,0 +1,16 @@
|
|||
# Contributing to Bleve
|
||||
|
||||
We look forward to your contributions, but ask that you first review these guidelines.
|
||||
|
||||
### Sign the CLA
|
||||
|
||||
As Bleve is a Couchbase project we require contributors accept the [Couchbase Contributor License Agreement](http://review.couchbase.org/static/individual_agreement.html). To sign this agreement log into the Couchbase [code review tool](http://review.couchbase.org/). The Bleve project does not use this code review tool but it is still used to track acceptance of the contributor license agreements.
|
||||
|
||||
### Submitting a Pull Request
|
||||
|
||||
All types of contributions are welcome, but please keep the following in mind:
|
||||
|
||||
- If you're planning a large change, you should really discuss it in a github issue or on the google group first. This helps avoid duplicate effort and spending time on something that may not be merged.
|
||||
- Existing tests should continue to pass, new tests for the contribution are nice to have.
|
||||
- All code should have gone through `go fmt`
|
||||
- All code should pass `go vet`
|
202
LICENSE
Normal file
202
LICENSE
Normal file
|
@ -0,0 +1,202 @@
|
|||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
121
README.md
Normal file
121
README.md
Normal file
|
@ -0,0 +1,121 @@
|
|||
#  bleve
|
||||
|
||||
[](https://github.com/blevesearch/bleve/actions/workflows/tests.yml?query=event%3Apush+branch%3Amaster)
|
||||
[](https://coveralls.io/github/blevesearch/bleve?branch=master)
|
||||
[](https://pkg.go.dev/github.com/blevesearch/bleve/v2)
|
||||
[](https://app.gitter.im/#/room/#blevesearch_bleve:gitter.im)
|
||||
[](https://codebeat.co/projects/github-com-blevesearch-bleve)
|
||||
[](https://goreportcard.com/report/github.com/blevesearch/bleve/v2)
|
||||
[](https://sourcegraph.com/github.com/blevesearch/bleve?badge)
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
|
||||
A modern indexing + search library in GO
|
||||
|
||||
## Features
|
||||
|
||||
* Index any GO data structure or JSON
|
||||
* Intelligent defaults backed up by powerful configuration ([scorch](https://github.com/blevesearch/bleve/blob/master/index/scorch/README.md))
|
||||
* Supported field types:
|
||||
* `text`, `number`, `datetime`, `boolean`, `geopoint`, `geoshape`, `IP`, `vector`
|
||||
* Supported query types:
|
||||
* `term`, `phrase`, `match`, `match_phrase`, `prefix`, `regexp`, `wildcard`, `fuzzy`
|
||||
* term range, numeric range, date range, boolean field
|
||||
* compound queries: `conjuncts`, `disjuncts`, boolean (`must`/`should`/`must_not`)
|
||||
* [query string syntax](http://www.blevesearch.com/docs/Query-String-Query/)
|
||||
* [geo spatial search](https://github.com/blevesearch/bleve/blob/master/geo/README.md)
|
||||
* approximate k-nearest neighbors via [vector search](https://github.com/blevesearch/bleve/blob/master/docs/vectors.md)
|
||||
* [synonym search](https://github.com/blevesearch/bleve/blob/master/docs/synonyms.md)
|
||||
* [tf-idf](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#tf-idf) / [bm25](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#bm25) scoring models
|
||||
* Hybrid search: exact + semantic
|
||||
* Query time boosting
|
||||
* Search result match highlighting with document fragments
|
||||
* Aggregations/faceting support:
|
||||
* terms facet
|
||||
* numeric range facet
|
||||
* date range facet
|
||||
|
||||
## Indexing
|
||||
|
||||
```go
|
||||
message := struct{
|
||||
Id string
|
||||
From string
|
||||
Body string
|
||||
}{
|
||||
Id: "example",
|
||||
From: "xyz@couchbase.com",
|
||||
Body: "bleve indexing is easy",
|
||||
}
|
||||
|
||||
mapping := bleve.NewIndexMapping()
|
||||
index, err := bleve.New("example.bleve", mapping)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
index.Index(message.Id, message)
|
||||
```
|
||||
|
||||
## Querying
|
||||
|
||||
```go
|
||||
index, _ := bleve.Open("example.bleve")
|
||||
query := bleve.NewQueryStringQuery("bleve")
|
||||
searchRequest := bleve.NewSearchRequest(query)
|
||||
searchResult, _ := index.Search(searchRequest)
|
||||
```
|
||||
|
||||
## Command Line Interface
|
||||
|
||||
To install the CLI for the latest release of bleve, run:
|
||||
|
||||
```bash
|
||||
$ go install github.com/blevesearch/bleve/v2/cmd/bleve@latest
|
||||
```
|
||||
|
||||
```
|
||||
$ bleve --help
|
||||
Bleve is a command-line tool to interact with a bleve index.
|
||||
|
||||
Usage:
|
||||
bleve [command]
|
||||
|
||||
Available Commands:
|
||||
bulk bulk loads from newline delimited JSON files
|
||||
check checks the contents of the index
|
||||
count counts the number documents in the index
|
||||
create creates a new index
|
||||
dictionary prints the term dictionary for the specified field in the index
|
||||
dump dumps the contents of the index
|
||||
fields lists the fields in this index
|
||||
help Help about any command
|
||||
index adds the files to the index
|
||||
mapping prints the mapping used for this index
|
||||
query queries the index
|
||||
registry registry lists the bleve components compiled into this executable
|
||||
scorch command-line tool to interact with a scorch index
|
||||
|
||||
Flags:
|
||||
-h, --help help for bleve
|
||||
|
||||
Use "bleve [command] --help" for more information about a command.
|
||||
```
|
||||
|
||||
## Text Analysis
|
||||
|
||||
Bleve includes general-purpose analyzers (customizable) as well as pre-built text analyzers for the following languages:
|
||||
|
||||
Arabic (ar), Bulgarian (bg), Catalan (ca), Chinese-Japanese-Korean (cjk), Kurdish (ckb), Danish (da), German (de), Greek (el), English (en), Spanish - Castilian (es), Basque (eu), Persian (fa), Finnish (fi), French (fr), Gaelic (ga), Spanish - Galician (gl), Hindi (hi), Croatian (hr), Hungarian (hu), Armenian (hy), Indonesian (id, in), Italian (it), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Turkish (tr)
|
||||
|
||||
## Text Analysis Wizard
|
||||
|
||||
[bleveanalysis.couchbase.com](https://bleveanalysis.couchbase.com)
|
||||
|
||||
## Discussion/Issues
|
||||
|
||||
Discuss usage/development of bleve and/or report issues here:
|
||||
* [Github issues](https://github.com/blevesearch/bleve/issues)
|
||||
* [Google group](https://groups.google.com/forum/#!forum/bleve)
|
||||
|
||||
## License
|
||||
|
||||
Apache License Version 2.0
|
15
SECURITY.md
Normal file
15
SECURITY.md
Normal file
|
@ -0,0 +1,15 @@
|
|||
# Security Policy
|
||||
|
||||
## Supported Versions
|
||||
|
||||
We support the latest release (for example, bleve v2.3.x).
|
||||
|
||||
## Reporting a Vulnerability
|
||||
|
||||
All security issues for this project should be reported by email to security@couchbase.com and fts-team@couchbase.com.
|
||||
This mail will be delivered to the owners of this project.
|
||||
|
||||
- To ensure your report is NOT marked as spam, please include the word "security/vulnerability" along with the project name (blevesearch/bleve) in the subject of the email.
|
||||
- Please be as descriptive as possible while explaining the issue, and a testcase highlighting the issue is always welcome.
|
||||
|
||||
Your email will be acknowledged at the soonest possible.
|
148
analysis/analyzer/custom/custom.go
Normal file
148
analysis/analyzer/custom/custom.go
Normal file
|
@ -0,0 +1,148 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package custom
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "custom"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
|
||||
var err error
|
||||
var charFilters []analysis.CharFilter
|
||||
charFiltersValue, ok := config["char_filters"]
|
||||
if ok {
|
||||
switch charFiltersValue := charFiltersValue.(type) {
|
||||
case []string:
|
||||
charFilters, err = getCharFilters(charFiltersValue, cache)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
case []interface{}:
|
||||
charFiltersNames, err := convertInterfaceSliceToStringSlice(charFiltersValue, "char filter")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
charFilters, err = getCharFilters(charFiltersNames, cache)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
default:
|
||||
return nil, fmt.Errorf("unsupported type for char_filters, must be slice")
|
||||
}
|
||||
}
|
||||
|
||||
var tokenizerName string
|
||||
tokenizerValue, ok := config["tokenizer"]
|
||||
if ok {
|
||||
tokenizerName, ok = tokenizerValue.(string)
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify tokenizer as string")
|
||||
}
|
||||
} else {
|
||||
return nil, fmt.Errorf("must specify tokenizer")
|
||||
}
|
||||
|
||||
tokenizer, err := cache.TokenizerNamed(tokenizerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
var tokenFilters []analysis.TokenFilter
|
||||
tokenFiltersValue, ok := config["token_filters"]
|
||||
if ok {
|
||||
switch tokenFiltersValue := tokenFiltersValue.(type) {
|
||||
case []string:
|
||||
tokenFilters, err = getTokenFilters(tokenFiltersValue, cache)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
case []interface{}:
|
||||
tokenFiltersNames, err := convertInterfaceSliceToStringSlice(tokenFiltersValue, "token filter")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
tokenFilters, err = getTokenFilters(tokenFiltersNames, cache)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
default:
|
||||
return nil, fmt.Errorf("unsupported type for token_filters, must be slice")
|
||||
}
|
||||
}
|
||||
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
}
|
||||
if charFilters != nil {
|
||||
rv.CharFilters = charFilters
|
||||
}
|
||||
if tokenFilters != nil {
|
||||
rv.TokenFilters = tokenFilters
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
|
||||
func getCharFilters(charFilterNames []string, cache *registry.Cache) ([]analysis.CharFilter, error) {
|
||||
charFilters := make([]analysis.CharFilter, len(charFilterNames))
|
||||
for i, charFilterName := range charFilterNames {
|
||||
charFilter, err := cache.CharFilterNamed(charFilterName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
charFilters[i] = charFilter
|
||||
}
|
||||
|
||||
return charFilters, nil
|
||||
}
|
||||
|
||||
func getTokenFilters(tokenFilterNames []string, cache *registry.Cache) ([]analysis.TokenFilter, error) {
|
||||
tokenFilters := make([]analysis.TokenFilter, len(tokenFilterNames))
|
||||
for i, tokenFilterName := range tokenFilterNames {
|
||||
tokenFilter, err := cache.TokenFilterNamed(tokenFilterName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
tokenFilters[i] = tokenFilter
|
||||
}
|
||||
|
||||
return tokenFilters, nil
|
||||
}
|
||||
|
||||
func convertInterfaceSliceToStringSlice(interfaceSlice []interface{}, objType string) ([]string, error) {
|
||||
stringSlice := make([]string, len(interfaceSlice))
|
||||
for i, interfaceObj := range interfaceSlice {
|
||||
stringObj, ok := interfaceObj.(string)
|
||||
if ok {
|
||||
stringSlice[i] = stringObj
|
||||
} else {
|
||||
return nil, fmt.Errorf(objType + " name must be a string")
|
||||
}
|
||||
}
|
||||
|
||||
return stringSlice, nil
|
||||
}
|
41
analysis/analyzer/keyword/keyword.go
Normal file
41
analysis/analyzer/keyword/keyword.go
Normal file
|
@ -0,0 +1,41 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package keyword
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/single"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "keyword"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
keywordTokenizer, err := cache.TokenizerNamed(single.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: keywordTokenizer,
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
49
analysis/analyzer/simple/simple.go
Normal file
49
analysis/analyzer/simple/simple.go
Normal file
|
@ -0,0 +1,49 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package simple
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/letter"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "simple"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(letter.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
55
analysis/analyzer/standard/standard.go
Normal file
55
analysis/analyzer/standard/standard.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package standard
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/lang/en"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "standard"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopEnFilter, err := cache.TokenFilterNamed(en.StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
stopEnFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
55
analysis/analyzer/web/web.go
Normal file
55
analysis/analyzer/web/web.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package web
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/lang/en"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/web"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "web"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(web.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopEnFilter, err := cache.TokenFilterNamed(en.StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
stopEnFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
117
analysis/benchmark_test.go
Normal file
117
analysis/benchmark_test.go
Normal file
|
@ -0,0 +1,117 @@
|
|||
// Copyright (c) 2015 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package analysis_test
|
||||
|
||||
import (
|
||||
index "github.com/blevesearch/bleve_index_api"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/analyzer/standard"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func BenchmarkAnalysis(b *testing.B) {
|
||||
for i := 0; i < b.N; i++ {
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(standard.Name)
|
||||
if err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
|
||||
ts := analyzer.Analyze(bleveWikiArticle)
|
||||
freqs := analysis.TokenFrequency(ts, nil, index.IncludeTermVectors)
|
||||
if len(freqs) != 511 {
|
||||
b.Errorf("expected %d freqs, got %d", 511, len(freqs))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var bleveWikiArticle = []byte(`Boiling liquid expanding vapor explosion
|
||||
From Wikipedia, the free encyclopedia
|
||||
See also: Boiler explosion and Steam explosion
|
||||
|
||||
Flames subsequent to a flammable liquid BLEVE from a tanker. BLEVEs do not necessarily involve fire.
|
||||
|
||||
This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. See Wikipedia's guide to writing better articles for suggestions. (July 2013)
|
||||
A boiling liquid expanding vapor explosion (BLEVE, /ˈblɛviː/ blev-ee) is an explosion caused by the rupture of a vessel containing a pressurized liquid above its boiling point.[1]
|
||||
Contents [hide]
|
||||
1 Mechanism
|
||||
1.1 Water example
|
||||
1.2 BLEVEs without chemical reactions
|
||||
2 Fires
|
||||
3 Incidents
|
||||
4 Safety measures
|
||||
5 See also
|
||||
6 References
|
||||
7 External links
|
||||
Mechanism[edit]
|
||||
|
||||
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2013)
|
||||
There are three characteristics of liquids which are relevant to the discussion of a BLEVE:
|
||||
If a liquid in a sealed container is boiled, the pressure inside the container increases. As the liquid changes to a gas it expands - this expansion in a vented container would cause the gas and liquid to take up more space. In a sealed container the gas and liquid are not able to take up more space and so the pressure rises. Pressurized vessels containing liquids can reach an equilibrium where the liquid stops boiling and the pressure stops rising. This occurs when no more heat is being added to the system (either because it has reached ambient temperature or has had a heat source removed).
|
||||
The boiling temperature of a liquid is dependent on pressure - high pressures will yield high boiling temperatures, and low pressures will yield low boiling temperatures. A common simple experiment is to place a cup of water in a vacuum chamber, and then reduce the pressure in the chamber until the water boils. By reducing the pressure the water will boil even at room temperature. This works both ways - if the pressure is increased beyond normal atmospheric pressures, the boiling of hot water could be suppressed far beyond normal temperatures. The cooling system of a modern internal combustion engine is a real-world example.
|
||||
When a liquid boils it turns into a gas. The resulting gas takes up far more space than the liquid did.
|
||||
Typically, a BLEVE starts with a container of liquid which is held above its normal, atmospheric-pressure boiling temperature. Many substances normally stored as liquids, such as CO2, propane, and other similar industrial gases have boiling temperatures, at atmospheric pressure, far below room temperature. In the case of water, a BLEVE could occur if a pressurized chamber of water is heated far beyond the standard 100 °C (212 °F). That container, because the boiling water pressurizes it, is capable of holding liquid water at very high temperatures.
|
||||
If the pressurized vessel, containing liquid at high temperature (which may be room temperature, depending on the substance) ruptures, the pressure which prevents the liquid from boiling is lost. If the rupture is catastrophic, where the vessel is immediately incapable of holding any pressure at all, then there suddenly exists a large mass of liquid which is at very high temperature and very low pressure. This causes the entire volume of liquid to instantaneously boil, which in turn causes an extremely rapid expansion. Depending on temperatures, pressures and the substance involved, that expansion may be so rapid that it can be classified as an explosion, fully capable of inflicting severe damage on its surroundings.
|
||||
Water example[edit]
|
||||
Imagine, for example, a tank of pressurized liquid water held at 204.4 °C (400 °F). This tank would normally be pressurized to 1.7 MPa (250 psi) above atmospheric ("gauge") pressure. If the tank containing the water were to rupture, there would for a slight moment exist a volume of liquid water which would be
|
||||
at atmospheric pressure, and
|
||||
204.4 °C (400 °F).
|
||||
At atmospheric pressure the boiling point of water is 100 °C (212 °F) - liquid water at atmospheric pressure cannot exist at temperatures higher than 100 °C (212 °F). At that moment, the water would boil and turn to vapour explosively, and the 204.4 °C (400 °F) liquid water turned to gas would take up a lot more volume than it did as liquid, causing a vapour explosion. Such explosions can happen when the superheated water of a steam engine escapes through a crack in a boiler, causing a boiler explosion.
|
||||
BLEVEs without chemical reactions[edit]
|
||||
It is important to note that a BLEVE need not be a chemical explosion—nor does there need to be a fire—however if a flammable substance is subject to a BLEVE it may also be subject to intense heating, either from an external source of heat which may have caused the vessel to rupture in the first place or from an internal source of localized heating such as skin friction. This heating can cause a flammable substance to ignite, adding a secondary explosion caused by the primary BLEVE. While blast effects of any BLEVE can be devastating, a flammable substance such as propane can add significantly to the danger.
|
||||
Bleve explosion.svg
|
||||
While the term BLEVE is most often used to describe the results of a container of flammable liquid rupturing due to fire, a BLEVE can occur even with a non-flammable substance such as water,[2] liquid nitrogen,[3] liquid helium or other refrigerants or cryogens, and therefore is not usually considered a type of chemical explosion.
|
||||
Fires[edit]
|
||||
BLEVEs can be caused by an external fire near the storage vessel causing heating of the contents and pressure build-up. While tanks are often designed to withstand great pressure, constant heating can cause the metal to weaken and eventually fail. If the tank is being heated in an area where there is no liquid, it may rupture faster without the liquid to absorb the heat. Gas containers are usually equipped with relief valves that vent off excess pressure, but the tank can still fail if the pressure is not released quickly enough.[1] Relief valves are sized to release pressure fast enough to prevent the pressure from increasing beyond the strength of the vessel, but not so fast as to be the cause of an explosion. An appropriately sized relief valve will allow the liquid inside to boil slowly, maintaining a constant pressure in the vessel until all the liquid has boiled and the vessel empties.
|
||||
If the substance involved is flammable, it is likely that the resulting cloud of the substance will ignite after the BLEVE has occurred, forming a fireball and possibly a fuel-air explosion, also termed a vapor cloud explosion (VCE). If the materials are toxic, a large area will be contaminated.[4]
|
||||
Incidents[edit]
|
||||
The term "BLEVE" was coined by three researchers at Factory Mutual, in the analysis of an accident there in 1957 involving a chemical reactor vessel.[5]
|
||||
In August 1959 the Kansas City Fire Department suffered its largest ever loss of life in the line of duty, when a 25,000 gallon (95,000 litre) gas tank exploded during a fire on Southwest Boulevard killing five firefighters. This was the first time BLEVE was used to describe a burning fuel tank.[citation needed]
|
||||
Later incidents included the Cheapside Street Whisky Bond Fire in Glasgow, Scotland in 1960; Feyzin, France in 1966; Crescent City, Illinois in 1970; Kingman, Arizona in 1973; a liquid nitrogen tank rupture[6] at Air Products and Chemicals and Mobay Chemical Company at New Martinsville, West Virginia on January 31, 1978 [1];Texas City, Texas in 1978; Murdock, Illinois in 1983; San Juan Ixhuatepec, Mexico City in 1984; and Toronto, Ontario in 2008.
|
||||
Safety measures[edit]
|
||||
[icon] This section requires expansion. (July 2013)
|
||||
Some fire mitigation measures are listed under liquefied petroleum gas.
|
||||
See also[edit]
|
||||
Boiler explosion
|
||||
Expansion ratio
|
||||
Explosive boiling or phase explosion
|
||||
Rapid phase transition
|
||||
Viareggio train derailment
|
||||
2008 Toronto explosions
|
||||
Gas carriers
|
||||
Los Alfaques Disaster
|
||||
Lac-Mégantic derailment
|
||||
References[edit]
|
||||
^ Jump up to: a b Kletz, Trevor (March 1990). Critical Aspects of Safety and Loss Prevention. London: Butterworth–Heinemann. pp. 43–45. ISBN 0-408-04429-2.
|
||||
Jump up ^ "Temperature Pressure Relief Valves on Water Heaters: test, inspect, replace, repair guide". Inspect-ny.com. Retrieved 2011-07-12.
|
||||
Jump up ^ Liquid nitrogen BLEVE demo
|
||||
Jump up ^ "Chemical Process Safety" (PDF). Retrieved 2011-07-12.
|
||||
Jump up ^ David F. Peterson, BLEVE: Facts, Risk Factors, and Fallacies, Fire Engineering magazine (2002).
|
||||
Jump up ^ "STATE EX REL. VAPOR CORP. v. NARICK". Supreme Court of Appeals of West Virginia. 1984-07-12. Retrieved 2014-03-16.
|
||||
External links[edit]
|
||||
Look up boiling liquid expanding vapor explosion in Wiktionary, the free dictionary.
|
||||
Wikimedia Commons has media related to BLEVE.
|
||||
BLEVE Demo on YouTube — video of a controlled BLEVE demo
|
||||
huge explosions on YouTube — video of propane and isobutane BLEVEs from a train derailment at Murdock, Illinois (3 September 1983)
|
||||
Propane BLEVE on YouTube — video of BLEVE from the Toronto propane depot fire
|
||||
Moscow Ring Road Accident on YouTube - Dozens of LPG tank BLEVEs after a road accident in Moscow
|
||||
Kingman, AZ BLEVE — An account of the 5 July 1973 explosion in Kingman, with photographs
|
||||
Propane Tank Explosions — Description of circumstances required to cause a propane tank BLEVE.
|
||||
Analysis of BLEVE Events at DOE Sites - Details physics and mathematics of BLEVEs.
|
||||
HID - SAFETY REPORT ASSESSMENT GUIDE: Whisky Maturation Warehouses - The liquor is aged in wooden barrels that can suffer BLEVE.
|
||||
Categories: ExplosivesFirefightingFireTypes of fireGas technologiesIndustrial fires and explosions`)
|
3572
analysis/char/asciifolding/asciifolding.go
Normal file
3572
analysis/char/asciifolding/asciifolding.go
Normal file
File diff suppressed because it is too large
Load diff
124
analysis/char/asciifolding/asciifolding_test.go
Normal file
124
analysis/char/asciifolding/asciifolding_test.go
Normal file
|
@ -0,0 +1,124 @@
|
|||
// Copyright (c) 2018 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package asciifolding
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestAsciiFoldingFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output []byte
|
||||
}{
|
||||
{
|
||||
// empty input passes
|
||||
input: []byte(``),
|
||||
output: []byte(``),
|
||||
},
|
||||
{
|
||||
// no modification for plain ASCII
|
||||
input: []byte(`The quick brown fox jumps over the lazy dog`),
|
||||
output: []byte(`The quick brown fox jumps over the lazy dog`),
|
||||
},
|
||||
{
|
||||
// Umlauts are folded to plain ASCII
|
||||
input: []byte(`The quick bröwn fox jümps over the läzy dog`),
|
||||
output: []byte(`The quick brown fox jumps over the lazy dog`),
|
||||
},
|
||||
{
|
||||
// composite unicode runes are folded to more than one ASCII rune
|
||||
input: []byte(`ÆꜴ`),
|
||||
output: []byte(`AEAO`),
|
||||
},
|
||||
{
|
||||
// apples from https://issues.couchbase.com/browse/MB-33486
|
||||
input: []byte(`Ápple Àpple Äpple Âpple Ãpple Åpple`),
|
||||
output: []byte(`Apple Apple Apple Apple Apple Apple`),
|
||||
},
|
||||
{
|
||||
// Fix ASCII folding of \u24A2
|
||||
input: []byte(`⒢`),
|
||||
output: []byte(`(g)`),
|
||||
},
|
||||
{
|
||||
// Test folding of \u2053 (SWUNG DASH)
|
||||
input: []byte(`a⁓b`),
|
||||
output: []byte(`a~b`),
|
||||
},
|
||||
{
|
||||
// Test folding of \uFF5E (FULLWIDTH TILDE)
|
||||
input: []byte(`c~d`),
|
||||
output: []byte(`c~d`),
|
||||
},
|
||||
{
|
||||
// Test folding of \uFF3F (FULLWIDTH LOW LINE) - case before tilde
|
||||
input: []byte(`e_f`),
|
||||
output: []byte(`e_f`),
|
||||
},
|
||||
{
|
||||
// Test mix including tilde and default fallthrough (using a character not explicitly folded)
|
||||
input: []byte(`a⁓b✅c~d`),
|
||||
output: []byte(`a~b✅c~d`),
|
||||
},
|
||||
{
|
||||
// Test start of 'A' fallthrough block
|
||||
input: []byte(`ÀBC`),
|
||||
output: []byte(`ABC`),
|
||||
},
|
||||
{
|
||||
// Test end of 'A' fallthrough block
|
||||
input: []byte(`DEFẶ`),
|
||||
output: []byte(`DEFA`),
|
||||
},
|
||||
{
|
||||
// Test start of 'AE' fallthrough block
|
||||
input: []byte(`Æ`),
|
||||
output: []byte(`AE`),
|
||||
},
|
||||
{
|
||||
// Test end of 'AE' fallthrough block
|
||||
input: []byte(`ᴁ`),
|
||||
output: []byte(`AE`),
|
||||
},
|
||||
{
|
||||
// Test 'DZ' multi-rune output
|
||||
input: []byte(`DŽebra`),
|
||||
output: []byte(`DZebra`),
|
||||
},
|
||||
{
|
||||
// Test start of 'a' fallthrough block
|
||||
input: []byte(`àbc`),
|
||||
output: []byte(`abc`),
|
||||
},
|
||||
{
|
||||
// Test end of 'a' fallthrough block
|
||||
input: []byte(`defa`),
|
||||
output: []byte(`defa`),
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
filter := New()
|
||||
t.Run(fmt.Sprintf("on %s", test.input), func(t *testing.T) {
|
||||
output := filter.Filter(test.input)
|
||||
if !reflect.DeepEqual(output, test.output) {
|
||||
t.Errorf("\nExpected:\n`%s`\ngot:\n`%s`\n", string(test.output), string(output))
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
57
analysis/char/html/html.go
Normal file
57
analysis/char/html/html.go
Normal file
|
@ -0,0 +1,57 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package html
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"regexp"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "html"
|
||||
|
||||
var htmlCharFilterRegexp = regexp.MustCompile(`</?[!\w]+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>`)
|
||||
|
||||
type CharFilter struct {
|
||||
r *regexp.Regexp
|
||||
replacement []byte
|
||||
}
|
||||
|
||||
func New() *CharFilter {
|
||||
return &CharFilter{
|
||||
r: htmlCharFilterRegexp,
|
||||
replacement: []byte(" "),
|
||||
}
|
||||
}
|
||||
|
||||
func (s *CharFilter) Filter(input []byte) []byte {
|
||||
return s.r.ReplaceAllFunc(
|
||||
input, func(in []byte) []byte {
|
||||
return bytes.Repeat(s.replacement, len(in))
|
||||
})
|
||||
}
|
||||
|
||||
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
|
||||
return New(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
65
analysis/char/regexp/regexp.go
Normal file
65
analysis/char/regexp/regexp.go
Normal file
|
@ -0,0 +1,65 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package regexp
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"regexp"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "regexp"
|
||||
|
||||
type CharFilter struct {
|
||||
r *regexp.Regexp
|
||||
replacement []byte
|
||||
}
|
||||
|
||||
func New(r *regexp.Regexp, replacement []byte) *CharFilter {
|
||||
return &CharFilter{
|
||||
r: r,
|
||||
replacement: replacement,
|
||||
}
|
||||
}
|
||||
|
||||
func (s *CharFilter) Filter(input []byte) []byte {
|
||||
return s.r.ReplaceAll(input, s.replacement)
|
||||
}
|
||||
|
||||
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
|
||||
regexpStr, ok := config["regexp"].(string)
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify regexp")
|
||||
}
|
||||
r, err := regexp.Compile(regexpStr)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("unable to build regexp char filter: %v", err)
|
||||
}
|
||||
replaceBytes := []byte(" ")
|
||||
replaceStr, ok := config["replace"].(string)
|
||||
if ok {
|
||||
replaceBytes = []byte(replaceStr)
|
||||
}
|
||||
return New(r, replaceBytes), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
88
analysis/char/regexp/regexp_test.go
Normal file
88
analysis/char/regexp/regexp_test.go
Normal file
|
@ -0,0 +1,88 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package regexp
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"reflect"
|
||||
"regexp"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestRegexpCharFilter(t *testing.T) {
|
||||
|
||||
tests := []struct {
|
||||
regexStr string
|
||||
replace []byte
|
||||
input []byte
|
||||
output []byte
|
||||
}{
|
||||
{
|
||||
regexStr: `</?[!\w]+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>`,
|
||||
replace: []byte{' '},
|
||||
input: []byte(`<html>test</html>`),
|
||||
output: []byte(` test `),
|
||||
},
|
||||
{
|
||||
regexStr: `\x{200C}`,
|
||||
replace: []byte{' '},
|
||||
input: []byte("water\u200Cunder\u200Cthe\u200Cbridge"),
|
||||
output: []byte("water under the bridge"),
|
||||
},
|
||||
{
|
||||
regexStr: `([a-z])\s+(\d)`,
|
||||
replace: []byte(`$1-$2`),
|
||||
input: []byte(`temp 1`),
|
||||
output: []byte(`temp-1`),
|
||||
},
|
||||
{
|
||||
regexStr: `foo.?`,
|
||||
replace: []byte(`X`),
|
||||
input: []byte(`seafood, fool`),
|
||||
output: []byte(`seaX, X`),
|
||||
},
|
||||
{
|
||||
regexStr: `def`,
|
||||
replace: []byte(`_`),
|
||||
input: []byte(`abcdefghi`),
|
||||
output: []byte(`abc_ghi`),
|
||||
},
|
||||
{
|
||||
regexStr: `456`,
|
||||
replace: []byte(`000000`),
|
||||
input: []byte(`123456789`),
|
||||
output: []byte(`123000000789`),
|
||||
},
|
||||
{
|
||||
regexStr: `“|”`,
|
||||
replace: []byte(`"`),
|
||||
input: []byte(`“hello”`),
|
||||
output: []byte(`"hello"`),
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
t.Run(fmt.Sprintf("match %s replace %s", test.regexStr, string(test.replace)), func(t *testing.T) {
|
||||
regex := regexp.MustCompile(test.regexStr)
|
||||
filter := New(regex, test.replace)
|
||||
|
||||
output := filter.Filter(test.input)
|
||||
if !reflect.DeepEqual(test.output, output) {
|
||||
t.Errorf("Expected: `%s`, Got: `%s`\n", string(test.output), string(output))
|
||||
}
|
||||
})
|
||||
|
||||
}
|
||||
}
|
39
analysis/char/zerowidthnonjoiner/zerowidthnonjoiner.go
Normal file
39
analysis/char/zerowidthnonjoiner/zerowidthnonjoiner.go
Normal file
|
@ -0,0 +1,39 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package zerowidthnonjoiner
|
||||
|
||||
import (
|
||||
"regexp"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
regexpCharFilter "github.com/blevesearch/bleve/v2/analysis/char/regexp"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "zero_width_spaces"
|
||||
|
||||
var zeroWidthNonJoinerRegexp = regexp.MustCompile(`\x{200C}`)
|
||||
|
||||
func CharFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.CharFilter, error) {
|
||||
replaceBytes := []byte(" ")
|
||||
return regexpCharFilter.New(zeroWidthNonJoinerRegexp, replaceBytes), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterCharFilter(Name, CharFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
67
analysis/datetime/flexible/flexible.go
Normal file
67
analysis/datetime/flexible/flexible.go
Normal file
|
@ -0,0 +1,67 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package flexible
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "flexiblego"
|
||||
|
||||
type DateTimeParser struct {
|
||||
layouts []string
|
||||
}
|
||||
|
||||
func New(layouts []string) *DateTimeParser {
|
||||
return &DateTimeParser{
|
||||
layouts: layouts,
|
||||
}
|
||||
}
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
for _, layout := range p.layouts {
|
||||
rv, err := time.Parse(layout, input)
|
||||
if err == nil {
|
||||
return rv, layout, nil
|
||||
}
|
||||
}
|
||||
return time.Time{}, "", analysis.ErrInvalidDateTime
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
layouts, ok := config["layouts"].([]interface{})
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify layouts")
|
||||
}
|
||||
var layoutStrs []string
|
||||
for _, layout := range layouts {
|
||||
layoutStr, ok := layout.(string)
|
||||
if ok {
|
||||
layoutStrs = append(layoutStrs, layoutStr)
|
||||
}
|
||||
}
|
||||
return New(layoutStrs), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
100
analysis/datetime/flexible/flexible_test.go
Normal file
100
analysis/datetime/flexible/flexible_test.go
Normal file
|
@ -0,0 +1,100 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package flexible
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestFlexibleDateTimeParser(t *testing.T) {
|
||||
testLocation := time.FixedZone("", -8*60*60)
|
||||
|
||||
rfc3339NoTimezone := "2006-01-02T15:04:05"
|
||||
rfc3339NoTimezoneNoT := "2006-01-02 15:04:05"
|
||||
rfc3339NoTime := "2006-01-02"
|
||||
|
||||
dateOptionalTimeParser := New(
|
||||
[]string{
|
||||
time.RFC3339Nano,
|
||||
time.RFC3339,
|
||||
rfc3339NoTimezone,
|
||||
rfc3339NoTimezoneNoT,
|
||||
rfc3339NoTime,
|
||||
})
|
||||
|
||||
tests := []struct {
|
||||
input string
|
||||
expectedTime time.Time
|
||||
expectedLayout string
|
||||
expectedError error
|
||||
}{
|
||||
{
|
||||
input: "2014-08-03",
|
||||
expectedTime: time.Date(2014, 8, 3, 0, 0, 0, 0, time.UTC),
|
||||
expectedLayout: rfc3339NoTime,
|
||||
expectedError: nil,
|
||||
},
|
||||
{
|
||||
input: "2014-08-03T15:59:30",
|
||||
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, time.UTC),
|
||||
expectedLayout: rfc3339NoTimezone,
|
||||
expectedError: nil,
|
||||
},
|
||||
{
|
||||
input: "2014-08-03 15:59:30",
|
||||
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, time.UTC),
|
||||
expectedLayout: rfc3339NoTimezoneNoT,
|
||||
expectedError: nil,
|
||||
},
|
||||
{
|
||||
input: "2014-08-03T15:59:30-08:00",
|
||||
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 0, testLocation),
|
||||
expectedLayout: time.RFC3339Nano,
|
||||
expectedError: nil,
|
||||
},
|
||||
{
|
||||
|
||||
input: "2014-08-03T15:59:30.999999999-08:00",
|
||||
expectedTime: time.Date(2014, 8, 3, 15, 59, 30, 999999999, testLocation),
|
||||
expectedLayout: time.RFC3339Nano,
|
||||
expectedError: nil,
|
||||
},
|
||||
{
|
||||
input: "not a date time",
|
||||
expectedTime: time.Time{},
|
||||
expectedLayout: "",
|
||||
expectedError: analysis.ErrInvalidDateTime,
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
t.Run(test.input, func(t *testing.T) {
|
||||
actualTime, actualLayout, actualErr := dateOptionalTimeParser.ParseDateTime(test.input)
|
||||
if actualErr != test.expectedError {
|
||||
t.Fatalf("expected error %#v, got %#v", test.expectedError, actualErr)
|
||||
}
|
||||
if !reflect.DeepEqual(actualTime, test.expectedTime) {
|
||||
t.Errorf("expected time %v, got %v", test.expectedTime, actualTime)
|
||||
}
|
||||
if !reflect.DeepEqual(actualLayout, test.expectedLayout) {
|
||||
t.Errorf("expected layout %v, got %v", test.expectedLayout, actualLayout)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
250
analysis/datetime/iso/iso.go
Normal file
250
analysis/datetime/iso/iso.go
Normal file
|
@ -0,0 +1,250 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package iso
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "isostyle"
|
||||
|
||||
var textLiteralDelimiter byte = '\'' // single quote
|
||||
|
||||
// ISO style date strings are represented in
|
||||
// https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
|
||||
//
|
||||
// Some format specifiers are not specified in go time package, such as:
|
||||
// - 'V' for timezone name, like 'Europe/Berlin' or 'America/New_York'.
|
||||
// - 'Q' for quarter of year, like Q3 or 3rd Quarter.
|
||||
// - 'zzzz' for full name of timezone like "Japan Standard Time" or "Eastern Standard Time".
|
||||
// - 'O' for localized zone-offset, like GMT+8 or GMT+08:00.
|
||||
// - '[]' for optional section of the format.
|
||||
// - 'G' for era, like AD or BC.
|
||||
// - 'W' for week of month.
|
||||
// - 'D' for day of year.
|
||||
// So date strings with these date elements cannot be parsed.
|
||||
var timeElementToLayout = map[byte]map[int]string{
|
||||
'M': {
|
||||
4: "January", // MMMM = full month name
|
||||
3: "Jan", // MMM = short month name
|
||||
2: "01", // MM = month of year (2 digits) (01-12)
|
||||
1: "1", // M = month of year (1 digit) (1-12)
|
||||
},
|
||||
'd': {
|
||||
2: "02", // dd = day of month (2 digits) (01-31)
|
||||
1: "2", // d = day of month (1 digit) (1-31)
|
||||
},
|
||||
'a': {
|
||||
2: "pm", // aa = pm/am
|
||||
1: "PM", // a = PM/AM
|
||||
},
|
||||
'H': {
|
||||
2: "15", // HH = hour (24 hour clock) (2 digits)
|
||||
1: "15", // H = hour (24 hour clock) (1 digit)
|
||||
},
|
||||
'm': {
|
||||
2: "04", // mm = minute (2 digits)
|
||||
1: "4", // m = minute (1 digit)
|
||||
},
|
||||
's': {
|
||||
2: "05", // ss = seconds (2 digits)
|
||||
1: "5", // s = seconds (1 digit)
|
||||
},
|
||||
|
||||
// timezone offsets from UTC below
|
||||
'X': {
|
||||
5: "Z07:00:00", // XXXXX = timezone offset (+-hh:mm:ss)
|
||||
4: "Z070000", // XXXX = timezone offset (+-hhmmss)
|
||||
3: "Z07:00", // XXX = timezone offset (+-hh:mm)
|
||||
2: "Z0700", // XX = timezone offset (+-hhmm)
|
||||
1: "Z07", // X = timezone offset (+-hh)
|
||||
},
|
||||
'x': {
|
||||
5: "-07:00:00", // xxxxx = timezone offset (+-hh:mm:ss)
|
||||
4: "-070000", // xxxx = timezone offset (+-hhmmss)
|
||||
3: "-07:00", // xxx = timezone offset (+-hh:mm)
|
||||
2: "-0700", // xx = timezone offset (+-hhmm)
|
||||
1: "-07", // x = timezone offset (+-hh)
|
||||
},
|
||||
}
|
||||
|
||||
type DateTimeParser struct {
|
||||
layouts []string
|
||||
}
|
||||
|
||||
func New(layouts []string) *DateTimeParser {
|
||||
return &DateTimeParser{
|
||||
layouts: layouts,
|
||||
}
|
||||
}
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
for _, layout := range p.layouts {
|
||||
rv, err := time.Parse(layout, input)
|
||||
if err == nil {
|
||||
return rv, layout, nil
|
||||
}
|
||||
}
|
||||
return time.Time{}, "", analysis.ErrInvalidDateTime
|
||||
}
|
||||
|
||||
func letterCounter(layout string, idx int) int {
|
||||
count := 1
|
||||
for idx+count < len(layout) {
|
||||
if layout[idx+count] == layout[idx] {
|
||||
count++
|
||||
} else {
|
||||
break
|
||||
}
|
||||
}
|
||||
return count
|
||||
}
|
||||
|
||||
func invalidFormatError(character byte, count int) error {
|
||||
return fmt.Errorf("invalid format string, unknown format specifier: " + strings.Repeat(string(character), count))
|
||||
}
|
||||
|
||||
func parseISOString(layout string) (string, error) {
|
||||
var dateTimeLayout strings.Builder
|
||||
|
||||
for idx := 0; idx < len(layout); {
|
||||
// check if the character is a text literal delimiter (')
|
||||
if layout[idx] == textLiteralDelimiter {
|
||||
if idx+1 < len(layout) && layout[idx+1] == textLiteralDelimiter {
|
||||
// if the next character is also a text literal delimiter, then
|
||||
// copy the character as is
|
||||
dateTimeLayout.WriteByte(textLiteralDelimiter)
|
||||
idx += 2
|
||||
continue
|
||||
}
|
||||
// find the next text literal delimiter
|
||||
for idx++; idx < len(layout); idx++ {
|
||||
if layout[idx] == textLiteralDelimiter {
|
||||
break
|
||||
}
|
||||
dateTimeLayout.WriteByte(layout[idx])
|
||||
}
|
||||
// idx can either be equal to len(layout) if the text literal delimiter is not found
|
||||
// after the first text literal delimiter or it will be equal to the index of the
|
||||
// second text literal delimiter
|
||||
if idx == len(layout) {
|
||||
// text literal delimiter not found error
|
||||
return "", fmt.Errorf("invalid format string, expected text literal delimiter: " + string(textLiteralDelimiter))
|
||||
}
|
||||
// increment idx to skip the second text literal delimiter
|
||||
idx++
|
||||
continue
|
||||
}
|
||||
// check if character is a letter in english alphabet - a-zA-Z which are reserved
|
||||
// for format specifiers
|
||||
if (layout[idx] >= 'a' && layout[idx] <= 'z') || (layout[idx] >= 'A' && layout[idx] <= 'Z') {
|
||||
// find the number of times the character occurs consecutively
|
||||
count := letterCounter(layout, idx)
|
||||
character := layout[idx]
|
||||
// first check the table
|
||||
if layout, ok := timeElementToLayout[character][count]; ok {
|
||||
dateTimeLayout.WriteString(layout)
|
||||
} else {
|
||||
switch character {
|
||||
case 'y', 'u', 'Y':
|
||||
// year
|
||||
if count == 2 {
|
||||
dateTimeLayout.WriteString("06")
|
||||
} else {
|
||||
format := fmt.Sprintf("%%0%ds", count)
|
||||
dateTimeLayout.WriteString(fmt.Sprintf(format, "2006"))
|
||||
}
|
||||
case 'h', 'K':
|
||||
// hour (1-12)
|
||||
switch count {
|
||||
case 2:
|
||||
// hh, KK -> 03
|
||||
dateTimeLayout.WriteString("03")
|
||||
case 1:
|
||||
// h, K -> 3
|
||||
dateTimeLayout.WriteString("3")
|
||||
default:
|
||||
// e.g., hhh
|
||||
return "", invalidFormatError(character, count)
|
||||
}
|
||||
case 'E':
|
||||
// day of week
|
||||
if count == 4 {
|
||||
dateTimeLayout.WriteString("Monday") // EEEE -> Monday
|
||||
} else if count <= 3 {
|
||||
dateTimeLayout.WriteString("Mon") // E, EE, EEE -> Mon
|
||||
} else {
|
||||
return "", invalidFormatError(character, count) // e.g., EEEEE
|
||||
}
|
||||
case 'S':
|
||||
// fraction of second
|
||||
// .SSS = millisecond
|
||||
// .SSSSSS = microsecond
|
||||
// .SSSSSSSSS = nanosecond
|
||||
if count > 9 {
|
||||
return "", invalidFormatError(character, count)
|
||||
}
|
||||
dateTimeLayout.WriteString(strings.Repeat(string('0'), count))
|
||||
case 'z':
|
||||
// timezone id
|
||||
if count < 5 {
|
||||
dateTimeLayout.WriteString("MST")
|
||||
} else {
|
||||
return "", invalidFormatError(character, count)
|
||||
}
|
||||
default:
|
||||
return "", invalidFormatError(character, count)
|
||||
}
|
||||
}
|
||||
idx += count
|
||||
} else {
|
||||
// copy the character as is
|
||||
dateTimeLayout.WriteByte(layout[idx])
|
||||
idx++
|
||||
}
|
||||
}
|
||||
return dateTimeLayout.String(), nil
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
layouts, ok := config["layouts"].([]interface{})
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify layouts")
|
||||
}
|
||||
var layoutStrs []string
|
||||
for _, layout := range layouts {
|
||||
layoutStr, ok := layout.(string)
|
||||
if ok {
|
||||
layout, err := parseISOString(layoutStr)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
layoutStrs = append(layoutStrs, layout)
|
||||
}
|
||||
}
|
||||
return New(layoutStrs), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
182
analysis/datetime/iso/iso_test.go
Normal file
182
analysis/datetime/iso/iso_test.go
Normal file
|
@ -0,0 +1,182 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package iso
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestConversionFromISOStyle(t *testing.T) {
|
||||
tests := []struct {
|
||||
input string
|
||||
output string
|
||||
err error
|
||||
}{
|
||||
{
|
||||
input: "yyyy-MM-dd",
|
||||
output: "2006-01-02",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "uuu/M''''dd'T'HH:m:ss.SSS",
|
||||
output: "2006/1''02T15:4:05.000",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "YYYY-MM-dd'T'H:mm:ss zzz",
|
||||
output: "2006-01-02T15:04:05 MST",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "MMMM dd yyyy', 'HH:mm:ss.SSS",
|
||||
output: "January 02 2006, 15:04:05.000",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "h 'o'''' clock' a, XXX",
|
||||
output: "3 o' clock PM, Z07:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "YYYY-MM-dd'T'HH:mm:ss'Z'",
|
||||
output: "2006-01-02T15:04:05Z",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "E MMM d H:mm:ss z Y",
|
||||
output: "Mon Jan 2 15:04:05 MST 2006",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "E MMM DD H:m:s z Y",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: DD"),
|
||||
},
|
||||
{
|
||||
input: "E MMM''''' H:m:s z Y",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, expected text literal delimiter: '"),
|
||||
},
|
||||
{
|
||||
input: "MMMMM dd yyyy', 'HH:mm:ss.SSS",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: MMMMM"),
|
||||
},
|
||||
{
|
||||
input: "yy", // year (2 digits)
|
||||
output: "06",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "yyyyy", // year (5 digits, padded)
|
||||
output: "02006",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "h", // hour 1-12 (1 digit)
|
||||
output: "3",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "hh", // hour 1-12 (2 digits)
|
||||
output: "03",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "KK", // hour 1-12 (2 digits, alt)
|
||||
output: "03",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "hhh", // invalid hour count
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: hhh"),
|
||||
},
|
||||
{
|
||||
input: "E", // Day of week (short)
|
||||
output: "Mon",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "EEE", // Day of week (short)
|
||||
output: "Mon",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "EEEE", // Day of week (long)
|
||||
output: "Monday",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "EEEEE", // Day of week (long)
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: EEEEE"),
|
||||
},
|
||||
{
|
||||
input: "S", // Fraction of second (1 digit)
|
||||
output: "0",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "SSSSSSSSS", // Fraction of second (9 digits)
|
||||
output: "000000000",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "SSSSSSSSSS", // Invalid fraction of second count
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: SSSSSSSSSS"),
|
||||
},
|
||||
{
|
||||
input: "z", // Timezone name (short)
|
||||
output: "MST",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
input: "zzz", // Timezone name (short) - Corrected expectation
|
||||
output: "MST", // Should output MST
|
||||
err: nil, // Should not produce an error
|
||||
},
|
||||
{
|
||||
input: "zzzz", // Timezone name (long) - Corrected expectation
|
||||
output: "MST", // Should output MST
|
||||
err: nil, // Should not produce an error
|
||||
},
|
||||
{
|
||||
input: "G", // Era designator (unsupported)
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: G"),
|
||||
},
|
||||
{
|
||||
input: "W", // Week of month (unsupported)
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: W"),
|
||||
},
|
||||
}
|
||||
for i, test := range tests {
|
||||
t.Run(fmt.Sprintf("test %d: %s", i, test.input), func(t *testing.T) {
|
||||
out, err := parseISOString(test.input)
|
||||
// Check error matching
|
||||
if (err != nil && test.err == nil) || (err == nil && test.err != nil) || (err != nil && test.err != nil && err.Error() != test.err.Error()) {
|
||||
t.Fatalf("expected error %v, got error %v", test.err, err)
|
||||
}
|
||||
// Check output matching only if no error was expected/occurred
|
||||
if err == nil && test.err == nil && out != test.output {
|
||||
t.Fatalf("expected output '%v', got '%v'", test.output, out)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
50
analysis/datetime/optional/optional.go
Normal file
50
analysis/datetime/optional/optional.go
Normal file
|
@ -0,0 +1,50 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package optional
|
||||
|
||||
import (
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/datetime/flexible"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "dateTimeOptional"
|
||||
|
||||
const rfc3339NoTimezone = "2006-01-02T15:04:05"
|
||||
const rfc3339NoTimezoneNoT = "2006-01-02 15:04:05"
|
||||
const rfc3339Offset = "2006-01-02 15:04:05 -0700"
|
||||
const rfc3339NoTime = "2006-01-02"
|
||||
|
||||
var layouts = []string{
|
||||
time.RFC3339Nano,
|
||||
time.RFC3339,
|
||||
rfc3339NoTimezone,
|
||||
rfc3339NoTimezoneNoT,
|
||||
rfc3339Offset,
|
||||
rfc3339NoTime,
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
return flexible.New(layouts), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
205
analysis/datetime/percent/percent.go
Normal file
205
analysis/datetime/percent/percent.go
Normal file
|
@ -0,0 +1,205 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package percent
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "percentstyle"
|
||||
|
||||
var formatDelimiter byte = '%'
|
||||
|
||||
// format specifiers as per strftime in the C standard library
|
||||
// https://man7.org/linux/man-pages/man3/strftime.3.html
|
||||
var formatSpecifierToLayout = map[byte]string{
|
||||
formatDelimiter: string(formatDelimiter), // %% = % (literal %)
|
||||
'a': "Mon", // %a = short weekday name
|
||||
'A': "Monday", // %A = full weekday name
|
||||
'd': "02", // %d = day of month (2 digits) (01-31)
|
||||
'e': "2", // %e = day of month (1 digit) (1-31)
|
||||
'b': "Jan", // %b = short month name
|
||||
'B': "January", // %B = full month name
|
||||
'm': "01", // %m = month of year (2 digits) (01-12)
|
||||
'y': "06", // %y = year without century
|
||||
'Y': "2006", // %Y = year with century
|
||||
'H': "15", // %H = hour (24 hour clock) (2 digits)
|
||||
'I': "03", // %I = hour (12 hour clock) (2 digits)
|
||||
'l': "3", // %l = hour (12 hour clock) (1 digit)
|
||||
'p': "PM", // %p = PM/AM
|
||||
'P': "pm", // %P = pm/am (lowercase)
|
||||
'M': "04", // %M = minute (2 digits)
|
||||
'S': "05", // %S = seconds (2 digits)
|
||||
'f': "999999", // .%f = fraction of seconds - up to microseconds (6 digits) - deci/milli/micro
|
||||
'Z': "MST", // %Z = timezone name (GMT, JST, UTC etc)
|
||||
// %z is present in timezone options
|
||||
|
||||
// some additional options not in strftime to support additional options such as
|
||||
// disallow 0 padding in minute and seconds, nanosecond precision, etc
|
||||
'o': "1", // %o = month of year (1 digit) (1-12)
|
||||
'i': "4", // %i = minute (1 digit)
|
||||
's': "5", // %s = seconds (1 digit)
|
||||
'N': "999999999", // .%N = fraction of seconds - up to microseconds (9 digits) - milli/micro/nano
|
||||
}
|
||||
|
||||
// some additional options for timezone
|
||||
// such as allowing colon in timezone offset and specifying the seconds
|
||||
// timezone offsets are from UTC
|
||||
var timezoneOptions = map[string]string{
|
||||
"z": "Z0700", // %z = timezone offset in +-hhmm / +-(2 digit hour)(2 digit minute) +0500, -0600 etc
|
||||
"z:M": "Z07:00", // %z:M = timezone offset(+-hh:mm) / +-(2 digit hour):(2 digit minute) +05:00, -06:00 etc
|
||||
"z:S": "Z07:00:00", // %z:M = timezone offset(+-hh:mm:ss) / +-(2 digit hour):(2 digit minute):(2 digit second) +05:20:00, -06:30:00 etc
|
||||
"zH": "Z07", // %zH = timezone offset(+-hh) / +-(2 digit hour) +05, -06 etc
|
||||
"zS": "Z070000", // %zS = timezone offset(+-hhmmss) / +-(2 digit hour)(2 digit minute)(2 digit second) +052000, -063000 etc
|
||||
}
|
||||
|
||||
type DateTimeParser struct {
|
||||
layouts []string
|
||||
}
|
||||
|
||||
func New(layouts []string) *DateTimeParser {
|
||||
return &DateTimeParser{
|
||||
layouts: layouts,
|
||||
}
|
||||
}
|
||||
|
||||
func checkTZOptions(formatString string, idx int) (string, int) {
|
||||
// idx points to '%'
|
||||
// We know formatString[idx+1] == 'z'
|
||||
nextIdx := idx + 2 // Index of the character immediately after 'z'
|
||||
|
||||
// Default values assume only '%z' is present
|
||||
layout := timezoneOptions["z"]
|
||||
finalIdx := nextIdx // Index after '%z'
|
||||
|
||||
if nextIdx < len(formatString) {
|
||||
switch formatString[nextIdx] {
|
||||
case ':':
|
||||
// Check for modifier after the colon ':'
|
||||
colonModifierIdx := nextIdx + 1
|
||||
if colonModifierIdx < len(formatString) {
|
||||
switch formatString[colonModifierIdx] {
|
||||
case 'M':
|
||||
// Found %z:M
|
||||
layout = timezoneOptions["z:M"]
|
||||
finalIdx = colonModifierIdx + 1 // Index after %z:M
|
||||
case 'S':
|
||||
// Found %z:S
|
||||
layout = timezoneOptions["z:S"]
|
||||
finalIdx = colonModifierIdx + 1 // Index after %z:S
|
||||
// default: If %z: is followed by something else, or just %z: at the end.
|
||||
// Keep the default layout ("z") and finalIdx (idx + 2).
|
||||
// The ':' will be treated as a literal by the main loop.
|
||||
}
|
||||
}
|
||||
// else: %z: is at the very end of the string.
|
||||
// Keep the default layout ("z") and finalIdx (idx + 2).
|
||||
// The ':' will be treated as a literal by the main loop.
|
||||
|
||||
case 'H':
|
||||
// Found %zH
|
||||
layout = timezoneOptions["zH"]
|
||||
finalIdx = nextIdx + 1 // Index after %zH
|
||||
case 'S':
|
||||
// Found %zS
|
||||
layout = timezoneOptions["zS"]
|
||||
finalIdx = nextIdx + 1 // Index after %zS
|
||||
|
||||
// default: If %z is followed by something other than ':', 'H', or 'S'.
|
||||
// Keep the default layout ("z") and finalIdx (idx + 2).
|
||||
// The character formatString[nextIdx] will be handled by the main loop.
|
||||
}
|
||||
}
|
||||
// else: %z is at the very end of the string.
|
||||
// Keep the default layout ("z") and finalIdx (idx + 2).
|
||||
|
||||
return layout, finalIdx
|
||||
}
|
||||
|
||||
func parseFormatString(formatString string) (string, error) {
|
||||
var dateTimeLayout strings.Builder
|
||||
// iterate over the format string and replace the format specifiers with
|
||||
// the corresponding golang constants
|
||||
for idx := 0; idx < len(formatString); {
|
||||
// check if the character is a format delimiter (%)
|
||||
if formatString[idx] == formatDelimiter {
|
||||
// check if there is a character after the format delimiter (%)
|
||||
if idx+1 >= len(formatString) {
|
||||
return "", fmt.Errorf("invalid format string, expected character after %s", string(formatDelimiter))
|
||||
}
|
||||
formatSpecifier := formatString[idx+1]
|
||||
if layout, ok := formatSpecifierToLayout[formatSpecifier]; ok {
|
||||
dateTimeLayout.WriteString(layout)
|
||||
idx += 2
|
||||
} else if formatSpecifier == 'z' {
|
||||
// did not find a valid specifier
|
||||
// check if it is for timezone
|
||||
var tzLayout string
|
||||
tzLayout, idx = checkTZOptions(formatString, idx)
|
||||
dateTimeLayout.WriteString(tzLayout)
|
||||
} else {
|
||||
return "", fmt.Errorf("invalid format string, unknown format specifier: %s", string(formatSpecifier))
|
||||
}
|
||||
continue
|
||||
}
|
||||
// copy the character as is
|
||||
dateTimeLayout.WriteByte(formatString[idx])
|
||||
idx++
|
||||
}
|
||||
return dateTimeLayout.String(), nil
|
||||
}
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
for _, layout := range p.layouts {
|
||||
rv, err := time.Parse(layout, input)
|
||||
if err == nil {
|
||||
return rv, layout, nil
|
||||
}
|
||||
}
|
||||
return time.Time{}, "", analysis.ErrInvalidDateTime
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
layouts, ok := config["layouts"].([]interface{})
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify layouts")
|
||||
}
|
||||
|
||||
layoutStrs := make([]string, 0, len(layouts))
|
||||
for _, layout := range layouts {
|
||||
layoutStr, ok := layout.(string)
|
||||
if ok {
|
||||
layout, err := parseFormatString(layoutStr)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
layoutStrs = append(layoutStrs, layout)
|
||||
}
|
||||
}
|
||||
|
||||
return New(layoutStrs), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
474
analysis/datetime/percent/percent_test.go
Normal file
474
analysis/datetime/percent/percent_test.go
Normal file
|
@ -0,0 +1,474 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package percent
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"reflect"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestConversionFromPercentStyle(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string // Added name field
|
||||
input string
|
||||
output string
|
||||
err error
|
||||
}{
|
||||
{
|
||||
name: "basic YMD",
|
||||
input: "%Y-%m-%d",
|
||||
output: "2006-01-02",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "YMD with double percent and literal T",
|
||||
input: "%Y/%m%%%%%dT%H%M:%S",
|
||||
output: "2006/01%%02T1504:05",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "YMD T HMS Z z",
|
||||
input: "%Y-%m-%dT%H:%M:%S %Z%z",
|
||||
output: "2006-01-02T15:04:05 MSTZ0700",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Full month, padded day/hour, am/pm, z:M",
|
||||
input: "%B %e, %Y %l:%i %P %z:M",
|
||||
output: "January 2, 2006 3:4 pm Z07:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Long format with literals and timezone literal :S",
|
||||
input: "Hour %H Minute %Mseconds %S.%N Timezone:%Z:S, Weekday %a; Day %d Month %b, Year %y",
|
||||
output: "Hour 15 Minute 04seconds 05.999999999 Timezone:MST:S, Weekday Mon; Day 02 Month Jan, Year 06",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "YMD T HMS with nanoseconds",
|
||||
input: "%Y-%m-%dT%H:%M:%S.%N",
|
||||
output: "2006-01-02T15:04:05.999999999",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z",
|
||||
input: "%H:%M:%S %Z %z",
|
||||
output: "15:04:05 MST Z0700",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z literal colon",
|
||||
input: "%H:%M:%S %Z %z:",
|
||||
output: "15:04:05 MST Z0700:",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z:M",
|
||||
input: "%H:%M:%S %Z %z:M",
|
||||
output: "15:04:05 MST Z07:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z:S",
|
||||
input: "%H:%M:%S %Z %z:S",
|
||||
output: "15:04:05 MST Z07:00:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z: literal A",
|
||||
input: "%H:%M:%S %Z %z:A",
|
||||
output: "15:04:05 MST Z0700:A",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z z literal M",
|
||||
input: "%H:%M:%S %Z %zM",
|
||||
output: "15:04:05 MST Z0700M",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z zH",
|
||||
input: "%H:%M:%S %Z %zH",
|
||||
output: "15:04:05 MST Z07",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "HMS Z zS",
|
||||
input: "%H:%M:%S %Z %zS",
|
||||
output: "15:04:05 MST Z070000",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Complex combination z zS z: zH",
|
||||
input: "%H:%M:%S %Z %z%Z %zS%z:%zH",
|
||||
output: "15:04:05 MST Z0700MST Z070000Z0700:Z07",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z at end",
|
||||
input: "%Y-%m-%d %z",
|
||||
output: "2006-01-02 Z0700",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z: at end",
|
||||
input: "%Y-%m-%d %z:",
|
||||
output: "2006-01-02 Z0700:",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "zH at end",
|
||||
input: "%Y-%m-%d %zH",
|
||||
output: "2006-01-02 Z07",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "zS at end",
|
||||
input: "%Y-%m-%d %zS",
|
||||
output: "2006-01-02 Z070000",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z:M at end",
|
||||
input: "%Y-%m-%d %z:M",
|
||||
output: "2006-01-02 Z07:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z:S at end",
|
||||
input: "%Y-%m-%d %z:S",
|
||||
output: "2006-01-02 Z07:00:00",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z followed by literal X",
|
||||
input: "%Y-%m-%d %zX",
|
||||
output: "2006-01-02 Z0700X",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "z: followed by literal X",
|
||||
input: "%Y-%m-%d %z:X",
|
||||
output: "2006-01-02 Z0700:X",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Invalid specifier T",
|
||||
input: "%Y-%m-%d%T%H:%M:%S %ZM",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: T"),
|
||||
},
|
||||
{
|
||||
name: "Ends with %",
|
||||
input: "%Y-%m-%dT%H:%M:%S %ZM%",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, expected character after %%"),
|
||||
},
|
||||
{
|
||||
name: "Just %",
|
||||
input: "%",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, expected character after %%"),
|
||||
},
|
||||
{
|
||||
name: "Just %%",
|
||||
input: "%%",
|
||||
output: "%",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Unknown specifier x",
|
||||
input: "%x",
|
||||
output: "",
|
||||
err: fmt.Errorf("invalid format string, unknown format specifier: x"),
|
||||
},
|
||||
{
|
||||
name: "Literal prefix",
|
||||
input: "literal %Y",
|
||||
output: "literal 2006",
|
||||
err: nil,
|
||||
},
|
||||
{
|
||||
name: "Literal suffix",
|
||||
input: "%Y literal",
|
||||
output: "2006 literal",
|
||||
err: nil,
|
||||
},
|
||||
}
|
||||
for _, test := range tests {
|
||||
t.Run(test.name, func(t *testing.T) {
|
||||
out, err := parseFormatString(test.input)
|
||||
|
||||
// Enhanced Error Check:
|
||||
expectedErrStr := ""
|
||||
if test.err != nil {
|
||||
expectedErrStr = test.err.Error()
|
||||
}
|
||||
actualErrStr := ""
|
||||
if err != nil {
|
||||
actualErrStr = err.Error()
|
||||
}
|
||||
|
||||
if expectedErrStr != actualErrStr {
|
||||
// Provide more detailed output if errors don't match as strings
|
||||
t.Fatalf("error mismatch:\nExpected error: %q\nGot error : %q", expectedErrStr, actualErrStr)
|
||||
}
|
||||
|
||||
// Original error presence check (redundant if string check passes, but safe to keep)
|
||||
if (err != nil && test.err == nil) || (err == nil && test.err != nil) {
|
||||
t.Fatalf("presence mismatch: expected error %v, got error %v", test.err, err)
|
||||
}
|
||||
|
||||
// Check output matching only if no error was expected/occurred
|
||||
if err == nil && test.err == nil && out != test.output {
|
||||
t.Fatalf("output mismatch: expected '%v', got '%v'", test.output, out)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestDateTimeParser_ParseDateTime(t *testing.T) {
|
||||
// Pre-create some parsers with known Go layouts
|
||||
parser1 := New([]string{"2006-01-02", "01/02/2006"}) // YYYY-MM-DD, MM/DD/YYYY
|
||||
parser2 := New([]string{"15:04:05"}) // HH:MM:SS
|
||||
parserEmpty := New([]string{}) // No layouts
|
||||
|
||||
// Define expected time values
|
||||
time1, _ := time.Parse("2006-01-02", "2023-10-27")
|
||||
time2, _ := time.Parse("01/02/2006", "10/27/2023")
|
||||
time3, _ := time.Parse("15:04:05", "14:30:00")
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
parser *DateTimeParser
|
||||
input string
|
||||
expectTime time.Time
|
||||
expectLayout string
|
||||
expectErr error
|
||||
}{
|
||||
{
|
||||
name: "match first layout",
|
||||
parser: parser1,
|
||||
input: "2023-10-27",
|
||||
expectTime: time1,
|
||||
expectLayout: "2006-01-02",
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "match second layout",
|
||||
parser: parser1,
|
||||
input: "10/27/2023",
|
||||
expectTime: time2,
|
||||
expectLayout: "01/02/2006",
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "no matching layout",
|
||||
parser: parser1,
|
||||
input: "14:30:00", // Matches parser2's layout, not parser1's
|
||||
expectTime: time.Time{},
|
||||
expectLayout: "",
|
||||
expectErr: analysis.ErrInvalidDateTime,
|
||||
},
|
||||
{
|
||||
name: "match only layout",
|
||||
parser: parser2,
|
||||
input: "14:30:00",
|
||||
expectTime: time3,
|
||||
expectLayout: "15:04:05",
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "invalid date format for layout",
|
||||
parser: parser1,
|
||||
input: "27-10-2023", // Wrong separators
|
||||
expectTime: time.Time{},
|
||||
expectLayout: "",
|
||||
expectErr: analysis.ErrInvalidDateTime, // time.Parse fails on all, returns ErrInvalidDateTime
|
||||
},
|
||||
{
|
||||
name: "empty input",
|
||||
parser: parser1,
|
||||
input: "",
|
||||
expectTime: time.Time{},
|
||||
expectLayout: "",
|
||||
expectErr: analysis.ErrInvalidDateTime,
|
||||
},
|
||||
{
|
||||
name: "parser with no layouts",
|
||||
parser: parserEmpty,
|
||||
input: "2023-10-27",
|
||||
expectTime: time.Time{},
|
||||
expectLayout: "",
|
||||
expectErr: analysis.ErrInvalidDateTime,
|
||||
},
|
||||
{
|
||||
name: "not a date string",
|
||||
parser: parser1,
|
||||
input: "hello world",
|
||||
expectTime: time.Time{},
|
||||
expectLayout: "",
|
||||
expectErr: analysis.ErrInvalidDateTime,
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
t.Run(test.name, func(t *testing.T) {
|
||||
gotTime, gotLayout, gotErr := test.parser.ParseDateTime(test.input)
|
||||
|
||||
// Check error
|
||||
if !reflect.DeepEqual(gotErr, test.expectErr) {
|
||||
t.Fatalf("error mismatch:\nExpected: %v\nGot: %v", test.expectErr, gotErr)
|
||||
}
|
||||
|
||||
// Check time only if no error expected
|
||||
if test.expectErr == nil {
|
||||
if !gotTime.Equal(test.expectTime) {
|
||||
t.Errorf("time mismatch:\nExpected: %v\nGot: %v", test.expectTime, gotTime)
|
||||
}
|
||||
if gotLayout != test.expectLayout {
|
||||
t.Errorf("layout mismatch:\nExpected: %q\nGot: %q", test.expectLayout, gotLayout)
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestDateTimeParserConstructor(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
config map[string]interface{}
|
||||
expectLayouts []string // Expected Go layouts after parsing
|
||||
expectErr error
|
||||
}{
|
||||
{
|
||||
name: "valid config with multiple layouts",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%Y-%m-%d", "%H:%M:%S %Z"},
|
||||
},
|
||||
expectLayouts: []string{"2006-01-02", "15:04:05 MST"},
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "valid config with single layout",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%Y/%m/%d %z:M"},
|
||||
},
|
||||
expectLayouts: []string{"2006/01/02 Z07:00"},
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "valid config with complex layout",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%a, %d %b %Y %H:%M:%S %zH"},
|
||||
},
|
||||
expectLayouts: []string{"Mon, 02 Jan 2006 15:04:05 Z07"},
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "config missing layouts key",
|
||||
config: map[string]interface{}{
|
||||
"other_key": "value",
|
||||
},
|
||||
expectLayouts: nil,
|
||||
expectErr: fmt.Errorf("must specify layouts"),
|
||||
},
|
||||
{
|
||||
name: "config layouts not a slice",
|
||||
config: map[string]interface{}{
|
||||
"layouts": "not-a-slice", // Value is a string
|
||||
},
|
||||
expectLayouts: nil,
|
||||
// Update the expected error message
|
||||
expectErr: fmt.Errorf("must specify layouts"),
|
||||
},
|
||||
{
|
||||
name: "config layouts contains non-string",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%Y-%m-%d", 123},
|
||||
},
|
||||
// Should process the valid string, ignore the int
|
||||
expectLayouts: []string{"2006-01-02"},
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "config layouts contains invalid percent format",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%Y-%m-%d", "%x"}, // %x is invalid
|
||||
},
|
||||
expectLayouts: nil,
|
||||
expectErr: fmt.Errorf("invalid format string, unknown format specifier: x"),
|
||||
},
|
||||
{
|
||||
name: "config layouts contains format ending in %",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{"%Y-%m-%d", "%H:%M:%"},
|
||||
},
|
||||
expectLayouts: nil,
|
||||
expectErr: fmt.Errorf("invalid format string, expected character after %%"),
|
||||
},
|
||||
{
|
||||
name: "config with empty layouts slice",
|
||||
config: map[string]interface{}{
|
||||
"layouts": []interface{}{},
|
||||
},
|
||||
expectLayouts: []string{}, // Expect an empty slice, not nil
|
||||
expectErr: nil,
|
||||
},
|
||||
{
|
||||
name: "nil config",
|
||||
config: nil,
|
||||
expectLayouts: nil,
|
||||
expectErr: fmt.Errorf("must specify layouts"),
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
t.Run(test.name, func(t *testing.T) {
|
||||
// Cache is not used by this constructor, so nil is fine
|
||||
parserIntf, err := DateTimeParserConstructor(test.config, nil)
|
||||
|
||||
// Check error
|
||||
// Use string comparison for errors as they might be created differently
|
||||
expectedErrStr := ""
|
||||
if test.expectErr != nil {
|
||||
expectedErrStr = test.expectErr.Error()
|
||||
}
|
||||
actualErrStr := ""
|
||||
if err != nil {
|
||||
actualErrStr = err.Error()
|
||||
}
|
||||
if expectedErrStr != actualErrStr {
|
||||
t.Fatalf("error mismatch:\nExpected: %q\nGot: %q", expectedErrStr, actualErrStr)
|
||||
}
|
||||
|
||||
// Check layouts only if no error expected
|
||||
if test.expectErr == nil {
|
||||
// Type assert to access the layouts field
|
||||
parser, ok := parserIntf.(*DateTimeParser)
|
||||
if !ok {
|
||||
t.Fatalf("constructor did not return a *DateTimeParser")
|
||||
}
|
||||
if !reflect.DeepEqual(parser.layouts, test.expectLayouts) {
|
||||
t.Errorf("layouts mismatch:\nExpected: %v\nGot: %v", test.expectLayouts, parser.layouts)
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
130
analysis/datetime/sanitized/sanitized.go
Normal file
130
analysis/datetime/sanitized/sanitized.go
Normal file
|
@ -0,0 +1,130 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package sanitized
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"regexp"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "sanitizedgo"
|
||||
|
||||
var validMagicNumbers = map[string]struct{}{
|
||||
"2006": {},
|
||||
"06": {}, // Year
|
||||
"01": {},
|
||||
"1": {},
|
||||
"_1": {},
|
||||
"January": {},
|
||||
"Jan": {}, // Month
|
||||
"02": {},
|
||||
"2": {},
|
||||
"_2": {},
|
||||
"__2": {},
|
||||
"002": {},
|
||||
"Monday": {},
|
||||
"Mon": {}, // Day
|
||||
"15": {},
|
||||
"3": {},
|
||||
"03": {}, // Hour
|
||||
"4": {},
|
||||
"04": {}, // Minute
|
||||
"5": {},
|
||||
"05": {}, // Second
|
||||
"0700": {},
|
||||
"070000": {},
|
||||
"07": {},
|
||||
"00": {},
|
||||
"": {},
|
||||
}
|
||||
|
||||
var layoutSplitRegex = regexp.MustCompile("[\\+\\-= :T,Z\\.<>;\\?!`~@#$%\\^&\\*|'\"\\(\\){}\\[\\]/\\\\]")
|
||||
|
||||
var layoutStripRegex = regexp.MustCompile(`PM|pm|\.9+|\.0+|MST`)
|
||||
|
||||
type DateTimeParser struct {
|
||||
layouts []string
|
||||
}
|
||||
|
||||
func New(layouts []string) *DateTimeParser {
|
||||
return &DateTimeParser{
|
||||
layouts: layouts,
|
||||
}
|
||||
}
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
for _, layout := range p.layouts {
|
||||
rv, err := time.Parse(layout, input)
|
||||
if err == nil {
|
||||
return rv, layout, nil
|
||||
}
|
||||
}
|
||||
return time.Time{}, "", analysis.ErrInvalidDateTime
|
||||
}
|
||||
|
||||
// date time layouts must be a combination of constants specified in golang time package
|
||||
// https://pkg.go.dev/time#pkg-constants
|
||||
// this validation verifies that only these constants are used in the custom layout
|
||||
// for compatibility with the golang time package
|
||||
func validateLayout(layout string) bool {
|
||||
// first we strip out commonly used constants
|
||||
// such as "PM" which can be present in the layout
|
||||
// right after a time component, e.g. 03:04PM;
|
||||
// because regex split cannot separate "03:04PM" into
|
||||
// "03:04" and "PM". We also strip out ".9+" and ".0+"
|
||||
// which represent fractional seconds.
|
||||
layout = layoutStripRegex.ReplaceAllString(layout, "")
|
||||
// then we split the layout by non-constant characters
|
||||
// which is a regex and verify that each split is a valid magic number
|
||||
split := layoutSplitRegex.Split(layout, -1)
|
||||
for i := range split {
|
||||
_, found := validMagicNumbers[split[i]]
|
||||
if !found {
|
||||
return false
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
layouts, ok := config["layouts"].([]interface{})
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("must specify layouts")
|
||||
}
|
||||
var layoutStrs []string
|
||||
for _, layout := range layouts {
|
||||
layoutStr, ok := layout.(string)
|
||||
if ok {
|
||||
if !validateLayout(layoutStr) {
|
||||
return nil, fmt.Errorf("invalid datetime parser layout: %s,"+
|
||||
" please refer to https://pkg.go.dev/time#pkg-constants for supported"+
|
||||
" layouts", layoutStr)
|
||||
}
|
||||
layoutStrs = append(layoutStrs, layoutStr)
|
||||
}
|
||||
}
|
||||
return New(layoutStrs), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
109
analysis/datetime/sanitized/sanitized_test.go
Normal file
109
analysis/datetime/sanitized/sanitized_test.go
Normal file
|
@ -0,0 +1,109 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package sanitized
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestLayoutValidatorRegex(t *testing.T) {
|
||||
splitRegexTests := []struct {
|
||||
input string
|
||||
output []string
|
||||
}{
|
||||
{
|
||||
input: "2014-08-03",
|
||||
output: []string{"2014", "08", "03"},
|
||||
},
|
||||
{
|
||||
input: "2014-08-03T15:59:30",
|
||||
output: []string{"2014", "08", "03", "15", "59", "30"},
|
||||
},
|
||||
{
|
||||
input: "2014.08-03 15/59`30",
|
||||
output: []string{"2014", "08", "03", "15", "59", "30"},
|
||||
},
|
||||
{
|
||||
input: "2014/08/03T15:59:30Z08:00",
|
||||
output: []string{"2014", "08", "03", "15", "59", "30", "08", "00"},
|
||||
},
|
||||
{
|
||||
input: "2014\\08|03T15=59.30.999999999+08*00",
|
||||
output: []string{"2014", "08", "03", "15", "59", "30", "999999999", "08", "00"},
|
||||
},
|
||||
{
|
||||
input: "2006-01-02T15:04:05.999999999Z07:00",
|
||||
output: []string{"2006", "01", "02", "15", "04", "05", "999999999", "07", "00"},
|
||||
},
|
||||
{
|
||||
input: "A-B C:DTE,FZG.H<I>J;K?L!M`N~O@P#Q$R%S^U&V*W|X'Y\"A(B)C{D}E[F]G/H\\I+J=L",
|
||||
output: []string{"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
|
||||
"Q", "R", "S", "U", "V", "W", "X", "Y", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "L"},
|
||||
},
|
||||
}
|
||||
regex := layoutSplitRegex
|
||||
for _, test := range splitRegexTests {
|
||||
t.Run(test.input, func(t *testing.T) {
|
||||
actualOutput := regex.Split(test.input, -1)
|
||||
if !reflect.DeepEqual(actualOutput, test.output) {
|
||||
t.Fatalf("expected output %v, got %v", test.output, actualOutput)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
stripRegexTests := []struct {
|
||||
input string
|
||||
output string
|
||||
}{
|
||||
{
|
||||
input: "3PM",
|
||||
output: "3",
|
||||
},
|
||||
{
|
||||
input: "3.0PM",
|
||||
output: "3",
|
||||
},
|
||||
{
|
||||
input: "3.9AM",
|
||||
output: "3AM",
|
||||
},
|
||||
{
|
||||
input: "3.999999999pm",
|
||||
output: "3",
|
||||
},
|
||||
{
|
||||
input: "2006-01-02T15:04:05.999999999Z07:00MST",
|
||||
output: "2006-01-02T15:04:05Z07:00",
|
||||
},
|
||||
{
|
||||
input: "Jan _2 15:04:05.0000000+07:00MST",
|
||||
output: "Jan _2 15:04:05+07:00",
|
||||
},
|
||||
{
|
||||
input: "15:04:05.99PM+07:00MST",
|
||||
output: "15:04:05+07:00",
|
||||
},
|
||||
}
|
||||
regex = layoutStripRegex
|
||||
for _, test := range stripRegexTests {
|
||||
t.Run(test.input, func(t *testing.T) {
|
||||
actualOutput := layoutStripRegex.ReplaceAllString(test.input, "")
|
||||
if !reflect.DeepEqual(actualOutput, test.output) {
|
||||
t.Fatalf("expected output %v, got %v", test.output, actualOutput)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
55
analysis/datetime/timestamp/microseconds/microseconds.go
Normal file
55
analysis/datetime/timestamp/microseconds/microseconds.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package microseconds
|
||||
|
||||
import (
|
||||
"math"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "unix_micro"
|
||||
|
||||
type DateTimeParser struct {
|
||||
}
|
||||
|
||||
var minBound int64 = math.MinInt64 / 1000
|
||||
var maxBound int64 = math.MaxInt64 / 1000
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
// unix timestamp is milliseconds since UNIX epoch
|
||||
timestamp, err := strconv.ParseInt(input, 10, 64)
|
||||
if err != nil {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampString
|
||||
}
|
||||
if timestamp < minBound || timestamp > maxBound {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampRange
|
||||
}
|
||||
return time.UnixMicro(timestamp), Name, nil
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
return &DateTimeParser{}, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
55
analysis/datetime/timestamp/milliseconds/milliseconds.go
Normal file
55
analysis/datetime/timestamp/milliseconds/milliseconds.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package milliseconds
|
||||
|
||||
import (
|
||||
"math"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "unix_milli"
|
||||
|
||||
type DateTimeParser struct {
|
||||
}
|
||||
|
||||
var minBound int64 = math.MinInt64 / 1000000
|
||||
var maxBound int64 = math.MaxInt64 / 1000000
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
// unix timestamp is milliseconds since UNIX epoch
|
||||
timestamp, err := strconv.ParseInt(input, 10, 64)
|
||||
if err != nil {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampString
|
||||
}
|
||||
if timestamp < minBound || timestamp > maxBound {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampRange
|
||||
}
|
||||
return time.UnixMilli(timestamp), Name, nil
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
return &DateTimeParser{}, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
55
analysis/datetime/timestamp/nanoseconds/nanoseconds.go
Normal file
55
analysis/datetime/timestamp/nanoseconds/nanoseconds.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2023 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package nanoseconds
|
||||
|
||||
import (
|
||||
"math"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "unix_nano"
|
||||
|
||||
type DateTimeParser struct {
|
||||
}
|
||||
|
||||
var minBound int64 = math.MinInt64
|
||||
var maxBound int64 = math.MaxInt64
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
// unix timestamp is milliseconds since UNIX epoch
|
||||
timestamp, err := strconv.ParseInt(input, 10, 64)
|
||||
if err != nil {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampString
|
||||
}
|
||||
if timestamp < minBound || timestamp > maxBound {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampRange
|
||||
}
|
||||
return time.Unix(0, timestamp), Name, nil
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
return &DateTimeParser{}, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
55
analysis/datetime/timestamp/seconds/seconds.go
Normal file
55
analysis/datetime/timestamp/seconds/seconds.go
Normal file
|
@ -0,0 +1,55 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package seconds
|
||||
|
||||
import (
|
||||
"math"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const Name = "unix_sec"
|
||||
|
||||
type DateTimeParser struct {
|
||||
}
|
||||
|
||||
var minBound int64 = math.MinInt64 / 1000000000
|
||||
var maxBound int64 = math.MaxInt64 / 1000000000
|
||||
|
||||
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
|
||||
// unix timestamp is seconds since UNIX epoch
|
||||
timestamp, err := strconv.ParseInt(input, 10, 64)
|
||||
if err != nil {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampString
|
||||
}
|
||||
if timestamp < minBound || timestamp > maxBound {
|
||||
return time.Time{}, "", analysis.ErrInvalidTimestampRange
|
||||
}
|
||||
return time.Unix(timestamp, 0), Name, nil
|
||||
}
|
||||
|
||||
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
|
||||
return &DateTimeParser{}, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
70
analysis/freq.go
Normal file
70
analysis/freq.go
Normal file
|
@ -0,0 +1,70 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package analysis
|
||||
|
||||
import (
|
||||
index "github.com/blevesearch/bleve_index_api"
|
||||
)
|
||||
|
||||
func TokenFrequency(tokens TokenStream, arrayPositions []uint64, options index.FieldIndexingOptions) index.TokenFrequencies {
|
||||
rv := make(map[string]*index.TokenFreq, len(tokens))
|
||||
|
||||
if options.IncludeTermVectors() {
|
||||
tls := make([]index.TokenLocation, len(tokens))
|
||||
tlNext := 0
|
||||
|
||||
for _, token := range tokens {
|
||||
tls[tlNext] = index.TokenLocation{
|
||||
ArrayPositions: arrayPositions,
|
||||
Start: token.Start,
|
||||
End: token.End,
|
||||
Position: token.Position,
|
||||
}
|
||||
|
||||
curr, ok := rv[string(token.Term)]
|
||||
if ok {
|
||||
curr.Locations = append(curr.Locations, &tls[tlNext])
|
||||
} else {
|
||||
curr = &index.TokenFreq{
|
||||
Term: token.Term,
|
||||
Locations: []*index.TokenLocation{&tls[tlNext]},
|
||||
}
|
||||
rv[string(token.Term)] = curr
|
||||
}
|
||||
|
||||
if !options.SkipFreqNorm() {
|
||||
curr.SetFrequency(curr.Frequency() + 1)
|
||||
}
|
||||
|
||||
tlNext++
|
||||
}
|
||||
} else {
|
||||
for _, token := range tokens {
|
||||
curr, exists := rv[string(token.Term)]
|
||||
if !exists {
|
||||
curr = &index.TokenFreq{
|
||||
Term: token.Term,
|
||||
}
|
||||
rv[string(token.Term)] = curr
|
||||
}
|
||||
|
||||
if !options.SkipFreqNorm() {
|
||||
curr.SetFrequency(curr.Frequency() + 1)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return rv
|
||||
}
|
60
analysis/freq_test.go
Normal file
60
analysis/freq_test.go
Normal file
|
@ -0,0 +1,60 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package analysis
|
||||
|
||||
import (
|
||||
index "github.com/blevesearch/bleve_index_api"
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestTokenFrequency(t *testing.T) {
|
||||
tokens := TokenStream{
|
||||
&Token{
|
||||
Term: []byte("water"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
&Token{
|
||||
Term: []byte("water"),
|
||||
Position: 2,
|
||||
Start: 6,
|
||||
End: 11,
|
||||
},
|
||||
}
|
||||
expectedResult := index.TokenFrequencies{
|
||||
"water": &index.TokenFreq{
|
||||
Term: []byte("water"),
|
||||
Locations: []*index.TokenLocation{
|
||||
{
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
{
|
||||
Position: 2,
|
||||
Start: 6,
|
||||
End: 11,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
expectedResult["water"].SetFrequency(2)
|
||||
result := TokenFrequency(tokens, nil, index.IncludeTermVectors)
|
||||
if !reflect.DeepEqual(result, expectedResult) {
|
||||
t.Errorf("expected %#v, got %#v", expectedResult, result)
|
||||
}
|
||||
}
|
68
analysis/lang/ar/analyzer_ar.go
Normal file
68
analysis/lang/ar/analyzer_ar.go
Normal file
|
@ -0,0 +1,68 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/unicodenorm"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
)
|
||||
|
||||
const AnalyzerName = "ar"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
normalizeFilter := unicodenorm.MustNewUnicodeNormalizeFilter(unicodenorm.NFKC)
|
||||
stopArFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
normalizeArFilter, err := cache.TokenFilterNamed(NormalizeName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stemmerArFilter, err := cache.TokenFilterNamed(StemmerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
normalizeFilter,
|
||||
stopArFilter,
|
||||
normalizeArFilter,
|
||||
stemmerArFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
184
analysis/lang/ar/analyzer_ar_test.go
Normal file
184
analysis/lang/ar/analyzer_ar_test.go
Normal file
|
@ -0,0 +1,184 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestArabicAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: []byte("كبير"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("كبير"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
// feminine marker
|
||||
{
|
||||
input: []byte("كبيرة"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("كبير"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("مشروب"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("مشروب"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
// plural -at
|
||||
{
|
||||
input: []byte("مشروبات"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("مشروب"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
// plural -in
|
||||
{
|
||||
input: []byte("أمريكيين"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("امريك"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
// singular with bare alif
|
||||
{
|
||||
input: []byte("امريكي"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("امريك"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("كتاب"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("كتاب"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
// definite article
|
||||
{
|
||||
input: []byte("الكتاب"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("كتاب"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("ما ملكت أيمانكم"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ملكت"),
|
||||
Position: 2,
|
||||
Start: 5,
|
||||
End: 13,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ايمانكم"),
|
||||
Position: 3,
|
||||
Start: 14,
|
||||
End: 28,
|
||||
},
|
||||
},
|
||||
},
|
||||
// stopwords
|
||||
{
|
||||
input: []byte("الذين ملكت أيمانكم"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ملكت"),
|
||||
Position: 2,
|
||||
Start: 11,
|
||||
End: 19,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ايمانكم"),
|
||||
Position: 3,
|
||||
Start: 20,
|
||||
End: 34,
|
||||
},
|
||||
},
|
||||
},
|
||||
// presentation form normalization
|
||||
{
|
||||
input: []byte("ﺍﻟﺴﻼﻢ"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 15,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
88
analysis/lang/ar/arabic_normalize.go
Normal file
88
analysis/lang/ar/arabic_normalize.go
Normal file
|
@ -0,0 +1,88 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const NormalizeName = "normalize_ar"
|
||||
|
||||
const (
|
||||
Alef = '\u0627'
|
||||
AlefMadda = '\u0622'
|
||||
AlefHamzaAbove = '\u0623'
|
||||
AlefHamzaBelow = '\u0625'
|
||||
Yeh = '\u064A'
|
||||
DotlessYeh = '\u0649'
|
||||
TehMarbuta = '\u0629'
|
||||
Heh = '\u0647'
|
||||
Tatweel = '\u0640'
|
||||
Fathatan = '\u064B'
|
||||
Dammatan = '\u064C'
|
||||
Kasratan = '\u064D'
|
||||
Fatha = '\u064E'
|
||||
Damma = '\u064F'
|
||||
Kasra = '\u0650'
|
||||
Shadda = '\u0651'
|
||||
Sukun = '\u0652'
|
||||
)
|
||||
|
||||
type ArabicNormalizeFilter struct {
|
||||
}
|
||||
|
||||
func NewArabicNormalizeFilter() *ArabicNormalizeFilter {
|
||||
return &ArabicNormalizeFilter{}
|
||||
}
|
||||
|
||||
func (s *ArabicNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
term := normalize(token.Term)
|
||||
token.Term = term
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func normalize(input []byte) []byte {
|
||||
runes := bytes.Runes(input)
|
||||
for i := 0; i < len(runes); i++ {
|
||||
switch runes[i] {
|
||||
case AlefMadda, AlefHamzaAbove, AlefHamzaBelow:
|
||||
runes[i] = Alef
|
||||
case DotlessYeh:
|
||||
runes[i] = Yeh
|
||||
case TehMarbuta:
|
||||
runes[i] = Heh
|
||||
case Tatweel, Kasratan, Dammatan, Fathatan, Fatha, Damma, Kasra, Shadda, Sukun:
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
}
|
||||
}
|
||||
return analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewArabicNormalizeFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
234
analysis/lang/ar/arabic_normalize_test.go
Normal file
234
analysis/lang/ar/arabic_normalize_test.go
Normal file
|
@ -0,0 +1,234 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestArabicNormalizeFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// AlifMadda
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("آجن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("اجن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AlifHamzaAbove
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("أحمد"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("احمد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AlifHamzaBelow
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("إعاذ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("اعاذ"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AlifMaksura
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("بنى"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("بني"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// TehMarbuta
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("فاطمة"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("فاطمه"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Tatweel
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("روبرـــــت"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("روبرت"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Fatha
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("مَبنا"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("مبنا"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Kasra
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("علِي"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("علي"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Damma
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("بُوات"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("بوات"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Fathatan
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولداً"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولدا"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Kasratan
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولدٍ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Dammatan
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولدٌ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ولد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Sukun
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("نلْسون"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("نلسون"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Shaddah
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("هتميّ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("هتمي"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// empty
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
arabicNormalizeFilter := NewArabicNormalizeFilter()
|
||||
for _, test := range tests {
|
||||
actual := arabicNormalizeFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %#v, got %#v", test.output, actual)
|
||||
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
121
analysis/lang/ar/stemmer_ar.go
Normal file
121
analysis/lang/ar/stemmer_ar.go
Normal file
|
@ -0,0 +1,121 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StemmerName = "stemmer_ar"
|
||||
|
||||
// These were obtained from org.apache.lucene.analysis.ar.ArabicStemmer
|
||||
var prefixes = [][]rune{
|
||||
[]rune("ال"),
|
||||
[]rune("وال"),
|
||||
[]rune("بال"),
|
||||
[]rune("كال"),
|
||||
[]rune("فال"),
|
||||
[]rune("لل"),
|
||||
[]rune("و"),
|
||||
}
|
||||
var suffixes = [][]rune{
|
||||
[]rune("ها"),
|
||||
[]rune("ان"),
|
||||
[]rune("ات"),
|
||||
[]rune("ون"),
|
||||
[]rune("ين"),
|
||||
[]rune("يه"),
|
||||
[]rune("ية"),
|
||||
[]rune("ه"),
|
||||
[]rune("ة"),
|
||||
[]rune("ي"),
|
||||
}
|
||||
|
||||
type ArabicStemmerFilter struct{}
|
||||
|
||||
func NewArabicStemmerFilter() *ArabicStemmerFilter {
|
||||
return &ArabicStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *ArabicStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
term := stem(token.Term)
|
||||
token.Term = term
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func canStemPrefix(input, prefix []rune) bool {
|
||||
// Wa- prefix requires at least 3 characters.
|
||||
if len(prefix) == 1 && len(input) < 4 {
|
||||
return false
|
||||
}
|
||||
// Other prefixes require only 2.
|
||||
if len(input)-len(prefix) < 2 {
|
||||
return false
|
||||
}
|
||||
for i := range prefix {
|
||||
if prefix[i] != input[i] {
|
||||
return false
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func canStemSuffix(input, suffix []rune) bool {
|
||||
// All suffixes require at least 2 characters after stemming.
|
||||
if len(input)-len(suffix) < 2 {
|
||||
return false
|
||||
}
|
||||
stemEnd := len(input) - len(suffix)
|
||||
for i := range suffix {
|
||||
if suffix[i] != input[stemEnd+i] {
|
||||
return false
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func stem(input []byte) []byte {
|
||||
runes := bytes.Runes(input)
|
||||
// Strip a single prefix.
|
||||
for _, p := range prefixes {
|
||||
if canStemPrefix(runes, p) {
|
||||
runes = runes[len(p):]
|
||||
break
|
||||
}
|
||||
}
|
||||
// Strip off multiple suffixes, in their order in the suffixes array.
|
||||
for _, s := range suffixes {
|
||||
if canStemSuffix(runes, s) {
|
||||
runes = runes[:len(runes)-len(s)]
|
||||
}
|
||||
}
|
||||
return analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
func StemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewArabicStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StemmerName, StemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
397
analysis/lang/ar/stemmer_ar_test.go
Normal file
397
analysis/lang/ar/stemmer_ar_test.go
Normal file
|
@ -0,0 +1,397 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestArabicStemmerFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// AlPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("الحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// WalPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("والحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// BalPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("بالحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// KalPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("كالحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// FalPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("فالحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// LlPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("للاخر"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("اخر"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// WaPrefix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("وحسن"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("حسن"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AhSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("زوجها"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("زوج"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AnSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدان"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// AtSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدات"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// WnSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدون"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// YnSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدين"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// YhSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهديه"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// YpSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدية"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// HSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهده"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// PSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدة"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// YSuffix
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدي"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// ComboPrefSuf
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("وساهدون"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// ComboSuf
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهدهات"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ساهد"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Shouldn't Stem
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("الو"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("الو"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// NonArabic
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("English"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("English"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("السلام"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلامة"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("السلامة"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سلام"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("الوصل"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("وصل"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("والصل"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("صل"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Empty
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
arabicStemmerFilter := NewArabicStemmerFilter()
|
||||
for _, test := range tests {
|
||||
actual := arabicStemmerFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %#v, got %#v", test.output, actual)
|
||||
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/ar/stop_filter_ar.go
Normal file
36
analysis/lang/ar/stop_filter_ar.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ar
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
152
analysis/lang/ar/stop_words_ar.go
Normal file
152
analysis/lang/ar/stop_words_ar.go
Normal file
|
@ -0,0 +1,152 @@
|
|||
package ar
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_ar"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var ArabicStopWords = []byte(`# This file was created by Jacques Savoy and is distributed under the BSD license.
|
||||
# See http://members.unine.ch/jacques.savoy/clef/index.html.
|
||||
# Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
# Cleaned on October 11, 2009 (not normalized, so use before normalization)
|
||||
# This means that when modifying this list, you might need to add some
|
||||
# redundant entries, for example containing forms with both أ and ا
|
||||
من
|
||||
ومن
|
||||
منها
|
||||
منه
|
||||
في
|
||||
وفي
|
||||
فيها
|
||||
فيه
|
||||
و
|
||||
ف
|
||||
ثم
|
||||
او
|
||||
أو
|
||||
ب
|
||||
بها
|
||||
به
|
||||
ا
|
||||
أ
|
||||
اى
|
||||
اي
|
||||
أي
|
||||
أى
|
||||
لا
|
||||
ولا
|
||||
الا
|
||||
ألا
|
||||
إلا
|
||||
لكن
|
||||
ما
|
||||
وما
|
||||
كما
|
||||
فما
|
||||
عن
|
||||
مع
|
||||
اذا
|
||||
إذا
|
||||
ان
|
||||
أن
|
||||
إن
|
||||
انها
|
||||
أنها
|
||||
إنها
|
||||
انه
|
||||
أنه
|
||||
إنه
|
||||
بان
|
||||
بأن
|
||||
فان
|
||||
فأن
|
||||
وان
|
||||
وأن
|
||||
وإن
|
||||
التى
|
||||
التي
|
||||
الذى
|
||||
الذي
|
||||
الذين
|
||||
الى
|
||||
الي
|
||||
إلى
|
||||
إلي
|
||||
على
|
||||
عليها
|
||||
عليه
|
||||
اما
|
||||
أما
|
||||
إما
|
||||
ايضا
|
||||
أيضا
|
||||
كل
|
||||
وكل
|
||||
لم
|
||||
ولم
|
||||
لن
|
||||
ولن
|
||||
هى
|
||||
هي
|
||||
هو
|
||||
وهى
|
||||
وهي
|
||||
وهو
|
||||
فهى
|
||||
فهي
|
||||
فهو
|
||||
انت
|
||||
أنت
|
||||
لك
|
||||
لها
|
||||
له
|
||||
هذه
|
||||
هذا
|
||||
تلك
|
||||
ذلك
|
||||
هناك
|
||||
كانت
|
||||
كان
|
||||
يكون
|
||||
تكون
|
||||
وكانت
|
||||
وكان
|
||||
غير
|
||||
بعض
|
||||
قد
|
||||
نحو
|
||||
بين
|
||||
بينما
|
||||
منذ
|
||||
ضمن
|
||||
حيث
|
||||
الان
|
||||
الآن
|
||||
خلال
|
||||
بعد
|
||||
قبل
|
||||
حتى
|
||||
عند
|
||||
عندما
|
||||
لدى
|
||||
جميع
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(ArabicStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
36
analysis/lang/bg/stop_filter_bg.go
Normal file
36
analysis/lang/bg/stop_filter_bg.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package bg
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
220
analysis/lang/bg/stop_words_bg.go
Normal file
220
analysis/lang/bg/stop_words_bg.go
Normal file
|
@ -0,0 +1,220 @@
|
|||
package bg
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_bg"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var BulgarianStopWords = []byte(`# This file was created by Jacques Savoy and is distributed under the BSD license.
|
||||
# See http://members.unine.ch/jacques.savoy/clef/index.html.
|
||||
# Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
а
|
||||
аз
|
||||
ако
|
||||
ала
|
||||
бе
|
||||
без
|
||||
беше
|
||||
би
|
||||
бил
|
||||
била
|
||||
били
|
||||
било
|
||||
близо
|
||||
бъдат
|
||||
бъде
|
||||
бяха
|
||||
в
|
||||
вас
|
||||
ваш
|
||||
ваша
|
||||
вероятно
|
||||
вече
|
||||
взема
|
||||
ви
|
||||
вие
|
||||
винаги
|
||||
все
|
||||
всеки
|
||||
всички
|
||||
всичко
|
||||
всяка
|
||||
във
|
||||
въпреки
|
||||
върху
|
||||
г
|
||||
ги
|
||||
главно
|
||||
го
|
||||
д
|
||||
да
|
||||
дали
|
||||
до
|
||||
докато
|
||||
докога
|
||||
дори
|
||||
досега
|
||||
доста
|
||||
е
|
||||
едва
|
||||
един
|
||||
ето
|
||||
за
|
||||
зад
|
||||
заедно
|
||||
заради
|
||||
засега
|
||||
затова
|
||||
защо
|
||||
защото
|
||||
и
|
||||
из
|
||||
или
|
||||
им
|
||||
има
|
||||
имат
|
||||
иска
|
||||
й
|
||||
каза
|
||||
как
|
||||
каква
|
||||
какво
|
||||
както
|
||||
какъв
|
||||
като
|
||||
кога
|
||||
когато
|
||||
което
|
||||
които
|
||||
кой
|
||||
който
|
||||
колко
|
||||
която
|
||||
къде
|
||||
където
|
||||
към
|
||||
ли
|
||||
м
|
||||
ме
|
||||
между
|
||||
мен
|
||||
ми
|
||||
мнозина
|
||||
мога
|
||||
могат
|
||||
може
|
||||
моля
|
||||
момента
|
||||
му
|
||||
н
|
||||
на
|
||||
над
|
||||
назад
|
||||
най
|
||||
направи
|
||||
напред
|
||||
например
|
||||
нас
|
||||
не
|
||||
него
|
||||
нея
|
||||
ни
|
||||
ние
|
||||
никой
|
||||
нито
|
||||
но
|
||||
някои
|
||||
някой
|
||||
няма
|
||||
обаче
|
||||
около
|
||||
освен
|
||||
особено
|
||||
от
|
||||
отгоре
|
||||
отново
|
||||
още
|
||||
пак
|
||||
по
|
||||
повече
|
||||
повечето
|
||||
под
|
||||
поне
|
||||
поради
|
||||
после
|
||||
почти
|
||||
прави
|
||||
пред
|
||||
преди
|
||||
през
|
||||
при
|
||||
пък
|
||||
първо
|
||||
с
|
||||
са
|
||||
само
|
||||
се
|
||||
сега
|
||||
си
|
||||
скоро
|
||||
след
|
||||
сме
|
||||
според
|
||||
сред
|
||||
срещу
|
||||
сте
|
||||
съм
|
||||
със
|
||||
също
|
||||
т
|
||||
тази
|
||||
така
|
||||
такива
|
||||
такъв
|
||||
там
|
||||
твой
|
||||
те
|
||||
тези
|
||||
ти
|
||||
тн
|
||||
то
|
||||
това
|
||||
тогава
|
||||
този
|
||||
той
|
||||
толкова
|
||||
точно
|
||||
трябва
|
||||
тук
|
||||
тъй
|
||||
тя
|
||||
тях
|
||||
у
|
||||
харесва
|
||||
ч
|
||||
че
|
||||
често
|
||||
чрез
|
||||
ще
|
||||
щом
|
||||
я
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(BulgarianStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
33
analysis/lang/ca/articles_ca.go
Normal file
33
analysis/lang/ca/articles_ca.go
Normal file
|
@ -0,0 +1,33 @@
|
|||
package ca
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const ArticlesName = "articles_ca"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis
|
||||
|
||||
var CatalanArticles = []byte(`
|
||||
d
|
||||
l
|
||||
m
|
||||
n
|
||||
s
|
||||
t
|
||||
`)
|
||||
|
||||
func ArticlesTokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(CatalanArticles)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(ArticlesName, ArticlesTokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
40
analysis/lang/ca/elision_ca.go
Normal file
40
analysis/lang/ca/elision_ca.go
Normal file
|
@ -0,0 +1,40 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ca
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/elision"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const ElisionName = "elision_ca"
|
||||
|
||||
func ElisionFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
articlesTokenMap, err := cache.TokenMapNamed(ArticlesName)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("error building elision filter: %v", err)
|
||||
}
|
||||
return elision.NewElisionFilter(articlesTokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(ElisionName, ElisionFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
61
analysis/lang/ca/elision_ca_test.go
Normal file
61
analysis/lang/ca/elision_ca_test.go
Normal file
|
@ -0,0 +1,61 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ca
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestFrenchElision(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("l'Institut"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("d'Estudis"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Institut"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("Estudis"),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
elisionFilter, err := cache.TokenFilterNamed(ElisionName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := elisionFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/ca/stop_filter_ca.go
Normal file
36
analysis/lang/ca/stop_filter_ca.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ca
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
247
analysis/lang/ca/stop_words_ca.go
Normal file
247
analysis/lang/ca/stop_words_ca.go
Normal file
|
@ -0,0 +1,247 @@
|
|||
package ca
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_ca"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var CatalanStopWords = []byte(`# Catalan stopwords from http://github.com/vcl/cue.language (Apache 2 Licensed)
|
||||
a
|
||||
abans
|
||||
ací
|
||||
ah
|
||||
així
|
||||
això
|
||||
al
|
||||
als
|
||||
aleshores
|
||||
algun
|
||||
alguna
|
||||
algunes
|
||||
alguns
|
||||
alhora
|
||||
allà
|
||||
allí
|
||||
allò
|
||||
altra
|
||||
altre
|
||||
altres
|
||||
amb
|
||||
ambdós
|
||||
ambdues
|
||||
apa
|
||||
aquell
|
||||
aquella
|
||||
aquelles
|
||||
aquells
|
||||
aquest
|
||||
aquesta
|
||||
aquestes
|
||||
aquests
|
||||
aquí
|
||||
baix
|
||||
cada
|
||||
cadascú
|
||||
cadascuna
|
||||
cadascunes
|
||||
cadascuns
|
||||
com
|
||||
contra
|
||||
d'un
|
||||
d'una
|
||||
d'unes
|
||||
d'uns
|
||||
dalt
|
||||
de
|
||||
del
|
||||
dels
|
||||
des
|
||||
després
|
||||
dins
|
||||
dintre
|
||||
donat
|
||||
doncs
|
||||
durant
|
||||
e
|
||||
eh
|
||||
el
|
||||
els
|
||||
em
|
||||
en
|
||||
encara
|
||||
ens
|
||||
entre
|
||||
érem
|
||||
eren
|
||||
éreu
|
||||
es
|
||||
és
|
||||
esta
|
||||
està
|
||||
estàvem
|
||||
estaven
|
||||
estàveu
|
||||
esteu
|
||||
et
|
||||
etc
|
||||
ets
|
||||
fins
|
||||
fora
|
||||
gairebé
|
||||
ha
|
||||
han
|
||||
has
|
||||
havia
|
||||
he
|
||||
hem
|
||||
heu
|
||||
hi
|
||||
ho
|
||||
i
|
||||
igual
|
||||
iguals
|
||||
ja
|
||||
l'hi
|
||||
la
|
||||
les
|
||||
li
|
||||
li'n
|
||||
llavors
|
||||
m'he
|
||||
ma
|
||||
mal
|
||||
malgrat
|
||||
mateix
|
||||
mateixa
|
||||
mateixes
|
||||
mateixos
|
||||
me
|
||||
mentre
|
||||
més
|
||||
meu
|
||||
meus
|
||||
meva
|
||||
meves
|
||||
molt
|
||||
molta
|
||||
moltes
|
||||
molts
|
||||
mon
|
||||
mons
|
||||
n'he
|
||||
n'hi
|
||||
ne
|
||||
ni
|
||||
no
|
||||
nogensmenys
|
||||
només
|
||||
nosaltres
|
||||
nostra
|
||||
nostre
|
||||
nostres
|
||||
o
|
||||
oh
|
||||
oi
|
||||
on
|
||||
pas
|
||||
pel
|
||||
pels
|
||||
per
|
||||
però
|
||||
perquè
|
||||
poc
|
||||
poca
|
||||
pocs
|
||||
poques
|
||||
potser
|
||||
propi
|
||||
qual
|
||||
quals
|
||||
quan
|
||||
quant
|
||||
que
|
||||
què
|
||||
quelcom
|
||||
qui
|
||||
quin
|
||||
quina
|
||||
quines
|
||||
quins
|
||||
s'ha
|
||||
s'han
|
||||
sa
|
||||
semblant
|
||||
semblants
|
||||
ses
|
||||
seu
|
||||
seus
|
||||
seva
|
||||
seva
|
||||
seves
|
||||
si
|
||||
sobre
|
||||
sobretot
|
||||
sóc
|
||||
solament
|
||||
sols
|
||||
son
|
||||
són
|
||||
sons
|
||||
sota
|
||||
sou
|
||||
t'ha
|
||||
t'han
|
||||
t'he
|
||||
ta
|
||||
tal
|
||||
també
|
||||
tampoc
|
||||
tan
|
||||
tant
|
||||
tanta
|
||||
tantes
|
||||
teu
|
||||
teus
|
||||
teva
|
||||
teves
|
||||
ton
|
||||
tons
|
||||
tot
|
||||
tota
|
||||
totes
|
||||
tots
|
||||
un
|
||||
una
|
||||
unes
|
||||
uns
|
||||
us
|
||||
va
|
||||
vaig
|
||||
vam
|
||||
van
|
||||
vas
|
||||
veu
|
||||
vosaltres
|
||||
vostra
|
||||
vostre
|
||||
vostres
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(CatalanStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
60
analysis/lang/cjk/analyzer_cjk.go
Normal file
60
analysis/lang/cjk/analyzer_cjk.go
Normal file
|
@ -0,0 +1,60 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
)
|
||||
|
||||
const AnalyzerName = "cjk"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
widthFilter, err := cache.TokenFilterNamed(WidthName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
bigramFilter, err := cache.TokenFilterNamed(BigramName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
widthFilter,
|
||||
toLowerFilter,
|
||||
bigramFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
642
analysis/lang/cjk/analyzer_cjk_test.go
Normal file
642
analysis/lang/cjk/analyzer_cjk_test.go
Normal file
|
@ -0,0 +1,642 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestCJKAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: []byte("こんにちは世界"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こん"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("んに"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("にち"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ちは"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は世"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世界"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 21,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一二三四五六七八九十"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一二"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("二三"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("三四"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("四五"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("五六"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("六七"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("七八"),
|
||||
Type: analysis.Double,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 24,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("八九"),
|
||||
Type: analysis.Double,
|
||||
Position: 8,
|
||||
Start: 21,
|
||||
End: 27,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("九十"),
|
||||
Type: analysis.Double,
|
||||
Position: 9,
|
||||
Start: 24,
|
||||
End: 30,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一 二三四 五六七八九 十"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("二三"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 10,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("三四"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 7,
|
||||
End: 13,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("五六"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 14,
|
||||
End: 20,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("六七"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 17,
|
||||
End: 23,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("七八"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 20,
|
||||
End: 26,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("八九"),
|
||||
Type: analysis.Double,
|
||||
Position: 7,
|
||||
Start: 23,
|
||||
End: 29,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("十"),
|
||||
Type: analysis.Single,
|
||||
Position: 8,
|
||||
Start: 30,
|
||||
End: 33,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("abc defgh ijklmn opqrstu vwxy z"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("abc"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("defgh"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ijklmn"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 3,
|
||||
Start: 10,
|
||||
End: 16,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("opqrstu"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 4,
|
||||
Start: 17,
|
||||
End: 24,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("vwxy"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 5,
|
||||
Start: 25,
|
||||
End: 29,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("z"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 6,
|
||||
Start: 30,
|
||||
End: 31,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("あい"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("あい "),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("test"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("test"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("test "),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("test"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("あいtest"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("test"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 2,
|
||||
Start: 6,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("testあい "),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("test"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 4,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("あいうえおabcかきくけこ"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("いう"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("うえ"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("えお"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("abc"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 5,
|
||||
Start: 15,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("かき"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 18,
|
||||
End: 24,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("きく"),
|
||||
Type: analysis.Double,
|
||||
Position: 7,
|
||||
Start: 21,
|
||||
End: 27,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("くけ"),
|
||||
Type: analysis.Double,
|
||||
Position: 8,
|
||||
Start: 24,
|
||||
End: 30,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("けこ"),
|
||||
Type: analysis.Double,
|
||||
Position: 9,
|
||||
Start: 27,
|
||||
End: 33,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("あいうえおabんcかきくけ こ"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("あい"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("いう"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("うえ"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("えお"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ab"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 5,
|
||||
Start: 15,
|
||||
End: 17,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Single,
|
||||
Position: 6,
|
||||
Start: 17,
|
||||
End: 20,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("c"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 7,
|
||||
Start: 20,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("かき"),
|
||||
Type: analysis.Double,
|
||||
Position: 8,
|
||||
Start: 21,
|
||||
End: 27,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("きく"),
|
||||
Type: analysis.Double,
|
||||
Position: 9,
|
||||
Start: 24,
|
||||
End: 30,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("くけ"),
|
||||
Type: analysis.Double,
|
||||
Position: 10,
|
||||
Start: 27,
|
||||
End: 33,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Single,
|
||||
Position: 11,
|
||||
Start: 34,
|
||||
End: 37,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一 روبرت موير"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("روبرت"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 14,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("موير"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 3,
|
||||
Start: 15,
|
||||
End: 23,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一 رُوبرت موير"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("رُوبرت"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 16,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("موير"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 3,
|
||||
Start: 17,
|
||||
End: 25,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("𩬅艱鍟䇹愯瀛"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("𩬅艱"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 7,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("艱鍟"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 4,
|
||||
End: 10,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("鍟䇹"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 7,
|
||||
End: 13,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("䇹愯"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 10,
|
||||
End: 16,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("愯瀛"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 13,
|
||||
End: 19,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("一丁丂"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("一丁"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("丁丂"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
for _, test := range tests {
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func BenchmarkCJKAnalyzer(b *testing.B) {
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
|
||||
for i := 0; i < b.N; i++ {
|
||||
analyzer.Analyze(bleveWikiArticleJapanese)
|
||||
}
|
||||
}
|
||||
|
||||
var bleveWikiArticleJapanese = []byte(`加圧容器に貯蔵されている液体物質は、その時の気液平衡状態にあるが、火災により容器が加熱されていると容器内の液体は、その物質の大気圧のもとでの沸点より十分に高い温度まで加熱され、圧力も高くなる。この状態で容器が破裂すると容器内部の圧力は瞬間的に大気圧にまで低下する。
|
||||
この時に容器内の平衡状態が破られ、液体は突沸し、気体になることで爆発現象を起こす。液化石油ガスなどでは、さらに拡散して空気と混ざったガスが自由空間蒸気雲爆発を起こす。液化石油ガスなどの常温常圧で気体になる物を高い圧力で液化して収納している容器、あるいは、そのような液体を輸送するためのパイプラインや配管などが火災などによって破壊されたときに起きる。
|
||||
ブリーブという現象が明らかになったのは、フランス・リヨンの郊外にあるフェザンという町のフェザン製油所(ウニオン・ド・ゼネラル・ド・ペトロール)で大規模な爆発火災事故が発生したときだと言われている。
|
||||
中身の液体が高温高圧の水である場合には「水蒸気爆発」と呼ばれる。`)
|
210
analysis/lang/cjk/cjk_bigram.go
Normal file
210
analysis/lang/cjk/cjk_bigram.go
Normal file
|
@ -0,0 +1,210 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"container/ring"
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const BigramName = "cjk_bigram"
|
||||
|
||||
type CJKBigramFilter struct {
|
||||
outputUnigram bool
|
||||
}
|
||||
|
||||
func NewCJKBigramFilter(outputUnigram bool) *CJKBigramFilter {
|
||||
return &CJKBigramFilter{
|
||||
outputUnigram: outputUnigram,
|
||||
}
|
||||
}
|
||||
|
||||
func (s *CJKBigramFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
r := ring.New(2)
|
||||
itemsInRing := 0
|
||||
pos := 1
|
||||
outputPos := 1
|
||||
|
||||
rv := make(analysis.TokenStream, 0, len(input))
|
||||
|
||||
for _, tokout := range input {
|
||||
if tokout.Type == analysis.Ideographic {
|
||||
runes := bytes.Runes(tokout.Term)
|
||||
sofar := 0
|
||||
for _, run := range runes {
|
||||
rlen := utf8.RuneLen(run)
|
||||
token := &analysis.Token{
|
||||
Term: tokout.Term[sofar : sofar+rlen],
|
||||
Start: tokout.Start + sofar,
|
||||
End: tokout.Start + sofar + rlen,
|
||||
Position: pos,
|
||||
Type: tokout.Type,
|
||||
KeyWord: tokout.KeyWord,
|
||||
}
|
||||
pos++
|
||||
sofar += rlen
|
||||
if itemsInRing > 0 {
|
||||
// if items already buffered
|
||||
// check to see if this is aligned
|
||||
curr := r.Value.(*analysis.Token)
|
||||
if token.Start-curr.End != 0 {
|
||||
// not aligned flush
|
||||
flushToken := s.flush(r, &itemsInRing, outputPos)
|
||||
if flushToken != nil {
|
||||
outputPos++
|
||||
rv = append(rv, flushToken)
|
||||
}
|
||||
}
|
||||
}
|
||||
// now we can add this token to the buffer
|
||||
r = r.Next()
|
||||
r.Value = token
|
||||
if itemsInRing < 2 {
|
||||
itemsInRing++
|
||||
}
|
||||
builtUnigram := false
|
||||
if itemsInRing > 1 && s.outputUnigram {
|
||||
unigram := s.buildUnigram(r, &itemsInRing, outputPos)
|
||||
if unigram != nil {
|
||||
builtUnigram = true
|
||||
rv = append(rv, unigram)
|
||||
}
|
||||
}
|
||||
bigramToken := s.outputBigram(r, &itemsInRing, outputPos)
|
||||
if bigramToken != nil {
|
||||
rv = append(rv, bigramToken)
|
||||
outputPos++
|
||||
}
|
||||
|
||||
// prev token should be removed if unigram was built
|
||||
if builtUnigram {
|
||||
itemsInRing--
|
||||
}
|
||||
}
|
||||
|
||||
} else {
|
||||
// flush anything already buffered
|
||||
flushToken := s.flush(r, &itemsInRing, outputPos)
|
||||
if flushToken != nil {
|
||||
rv = append(rv, flushToken)
|
||||
outputPos++
|
||||
}
|
||||
// output this token as is
|
||||
tokout.Position = outputPos
|
||||
rv = append(rv, tokout)
|
||||
outputPos++
|
||||
}
|
||||
}
|
||||
|
||||
// deal with possible trailing unigram
|
||||
if itemsInRing == 1 || s.outputUnigram {
|
||||
if itemsInRing == 2 {
|
||||
r = r.Next()
|
||||
}
|
||||
unigram := s.buildUnigram(r, &itemsInRing, outputPos)
|
||||
if unigram != nil {
|
||||
rv = append(rv, unigram)
|
||||
}
|
||||
}
|
||||
return rv
|
||||
}
|
||||
|
||||
func (s *CJKBigramFilter) flush(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
|
||||
var rv *analysis.Token
|
||||
if *itemsInRing == 1 {
|
||||
rv = s.buildUnigram(r, itemsInRing, pos)
|
||||
}
|
||||
r.Value = nil
|
||||
*itemsInRing = 0
|
||||
|
||||
return rv
|
||||
}
|
||||
|
||||
func (s *CJKBigramFilter) outputBigram(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
|
||||
if *itemsInRing == 2 {
|
||||
thisShingleRing := r.Move(-1)
|
||||
shingledBytes := make([]byte, 0)
|
||||
|
||||
// do first token
|
||||
prev := thisShingleRing.Value.(*analysis.Token)
|
||||
shingledBytes = append(shingledBytes, prev.Term...)
|
||||
|
||||
// do second token
|
||||
thisShingleRing = thisShingleRing.Next()
|
||||
curr := thisShingleRing.Value.(*analysis.Token)
|
||||
shingledBytes = append(shingledBytes, curr.Term...)
|
||||
|
||||
token := analysis.Token{
|
||||
Type: analysis.Double,
|
||||
Term: shingledBytes,
|
||||
Position: pos,
|
||||
Start: prev.Start,
|
||||
End: curr.End,
|
||||
}
|
||||
return &token
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *CJKBigramFilter) buildUnigram(r *ring.Ring, itemsInRing *int, pos int) *analysis.Token {
|
||||
switch *itemsInRing {
|
||||
case 2:
|
||||
thisShingleRing := r.Move(-1)
|
||||
// do first token
|
||||
prev := thisShingleRing.Value.(*analysis.Token)
|
||||
token := analysis.Token{
|
||||
Type: analysis.Single,
|
||||
Term: prev.Term,
|
||||
Position: pos,
|
||||
Start: prev.Start,
|
||||
End: prev.End,
|
||||
}
|
||||
return &token
|
||||
case 1:
|
||||
// do first token
|
||||
prev := r.Value.(*analysis.Token)
|
||||
token := analysis.Token{
|
||||
Type: analysis.Single,
|
||||
Term: prev.Term,
|
||||
Position: pos,
|
||||
Start: prev.Start,
|
||||
End: prev.End,
|
||||
}
|
||||
return &token
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func CJKBigramFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
outputUnigram := false
|
||||
outVal, ok := config["output_unigram"].(bool)
|
||||
if ok {
|
||||
outputUnigram = outVal
|
||||
}
|
||||
return NewCJKBigramFilter(outputUnigram), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(BigramName, CJKBigramFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
848
analysis/lang/cjk/cjk_bigram_test.go
Normal file
848
analysis/lang/cjk/cjk_bigram_test.go
Normal file
|
@ -0,0 +1,848 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"container/ring"
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
// Helper function to create a token
|
||||
func makeToken(term string, start, end, pos int) *analysis.Token {
|
||||
return &analysis.Token{
|
||||
Term: []byte(term),
|
||||
Start: start,
|
||||
End: end,
|
||||
Position: pos, // Note: buildUnigram uses the 'pos' argument, not the token's original pos
|
||||
Type: analysis.Ideographic,
|
||||
}
|
||||
}
|
||||
|
||||
func TestCJKBigramFilter_buildUnigram(t *testing.T) {
|
||||
filter := NewCJKBigramFilter(false)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
ringSetup func() (*ring.Ring, int) // Function to set up the ring and itemsInRing
|
||||
inputPos int // Position to pass to buildUnigram
|
||||
expectToken *analysis.Token
|
||||
}{
|
||||
{
|
||||
name: "itemsInRing == 2",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("一", 0, 3, 1) // Original pos 1
|
||||
token2 := makeToken("二", 3, 6, 2) // Original pos 2
|
||||
r.Value = token1
|
||||
r = r.Next()
|
||||
r.Value = token2
|
||||
// r currently points to token2, r.Move(-1) points to token1
|
||||
return r, 2
|
||||
},
|
||||
inputPos: 10, // Expected output position
|
||||
expectToken: &analysis.Token{
|
||||
Type: analysis.Single,
|
||||
Term: []byte("一"),
|
||||
Position: 10, // Should use inputPos
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 1 (ring points to the single item)",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("三", 6, 9, 3)
|
||||
r.Value = token1
|
||||
// r points to token1
|
||||
return r, 1
|
||||
},
|
||||
inputPos: 11,
|
||||
expectToken: &analysis.Token{
|
||||
Type: analysis.Single,
|
||||
Term: []byte("三"),
|
||||
Position: 11, // Should use inputPos
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 1 (ring points to nil, next is the single item)",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("四", 9, 12, 4)
|
||||
r = r.Next() // r points to nil initially
|
||||
r.Value = token1
|
||||
// r points to token1
|
||||
return r, 1
|
||||
},
|
||||
inputPos: 12,
|
||||
expectToken: &analysis.Token{
|
||||
Type: analysis.Single,
|
||||
Term: []byte("四"),
|
||||
Position: 12, // Should use inputPos
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 0",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
// Ring is empty
|
||||
return r, 0
|
||||
},
|
||||
inputPos: 13,
|
||||
expectToken: nil, // Expect nil when itemsInRing is not 1 or 2
|
||||
},
|
||||
{
|
||||
name: "itemsInRing > 2 (should behave like 0)",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("五", 12, 15, 5)
|
||||
token2 := makeToken("六", 15, 18, 6)
|
||||
r.Value = token1
|
||||
r = r.Next()
|
||||
r.Value = token2
|
||||
// Simulate incorrect itemsInRing count
|
||||
return r, 3
|
||||
},
|
||||
inputPos: 14,
|
||||
expectToken: nil, // Expect nil when itemsInRing is not 1 or 2
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
ringPtr, itemsInRing := tt.ringSetup()
|
||||
itemsInRingCopy := itemsInRing // Pass a pointer to a copy
|
||||
|
||||
gotToken := filter.buildUnigram(ringPtr, &itemsInRingCopy, tt.inputPos)
|
||||
|
||||
if !reflect.DeepEqual(gotToken, tt.expectToken) {
|
||||
t.Errorf("buildUnigram() got = %v, want %v", gotToken, tt.expectToken)
|
||||
}
|
||||
|
||||
// Check if itemsInRing was modified (it shouldn't be by buildUnigram)
|
||||
if itemsInRingCopy != itemsInRing {
|
||||
t.Errorf("buildUnigram() modified itemsInRing, got = %d, want %d", itemsInRingCopy, itemsInRing)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestCJKBigramFilter_outputBigram(t *testing.T) {
|
||||
// Create a filter instance (outputUnigram value doesn't matter for outputBigram)
|
||||
filter := NewCJKBigramFilter(false)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
ringSetup func() (*ring.Ring, int) // Function to set up the ring and itemsInRing
|
||||
inputPos int // Position to pass to outputBigram
|
||||
expectToken *analysis.Token
|
||||
}{
|
||||
{
|
||||
name: "itemsInRing == 2",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("一", 0, 3, 1) // Original pos 1
|
||||
token2 := makeToken("二", 3, 6, 2) // Original pos 2
|
||||
r.Value = token1
|
||||
r = r.Next()
|
||||
r.Value = token2
|
||||
// r currently points to token2, r.Move(-1) points to token1
|
||||
return r, 2
|
||||
},
|
||||
inputPos: 10, // Expected output position
|
||||
expectToken: &analysis.Token{
|
||||
Type: analysis.Double,
|
||||
Term: []byte("一二"), // Combined term
|
||||
Position: 10, // Should use inputPos
|
||||
Start: 0, // Start of first token
|
||||
End: 6, // End of second token
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 2 with different terms",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("你好", 0, 6, 1)
|
||||
token2 := makeToken("世界", 6, 12, 2)
|
||||
r.Value = token1
|
||||
r = r.Next()
|
||||
r.Value = token2
|
||||
return r, 2
|
||||
},
|
||||
inputPos: 5,
|
||||
expectToken: &analysis.Token{
|
||||
Type: analysis.Double,
|
||||
Term: []byte("你好世界"),
|
||||
Position: 5,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 1",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("三", 6, 9, 3)
|
||||
r.Value = token1
|
||||
return r, 1
|
||||
},
|
||||
inputPos: 11,
|
||||
expectToken: nil, // Expect nil when itemsInRing is not 2
|
||||
},
|
||||
{
|
||||
name: "itemsInRing == 0",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
// Ring is empty
|
||||
return r, 0
|
||||
},
|
||||
inputPos: 13,
|
||||
expectToken: nil, // Expect nil when itemsInRing is not 2
|
||||
},
|
||||
{
|
||||
name: "itemsInRing > 2 (should behave like 0)",
|
||||
ringSetup: func() (*ring.Ring, int) {
|
||||
r := ring.New(2)
|
||||
token1 := makeToken("五", 12, 15, 5)
|
||||
token2 := makeToken("六", 15, 18, 6)
|
||||
r.Value = token1
|
||||
r = r.Next()
|
||||
r.Value = token2
|
||||
// Simulate incorrect itemsInRing count
|
||||
return r, 3
|
||||
},
|
||||
inputPos: 14,
|
||||
expectToken: nil, // Expect nil when itemsInRing is not 2
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
ringPtr, itemsInRing := tt.ringSetup()
|
||||
itemsInRingCopy := itemsInRing // Pass a pointer to a copy
|
||||
|
||||
gotToken := filter.outputBigram(ringPtr, &itemsInRingCopy, tt.inputPos)
|
||||
|
||||
if !reflect.DeepEqual(gotToken, tt.expectToken) {
|
||||
t.Errorf("outputBigram() got = %v, want %v", gotToken, tt.expectToken)
|
||||
}
|
||||
|
||||
// Check if itemsInRing was modified (it shouldn't be by outputBigram)
|
||||
if itemsInRingCopy != itemsInRing {
|
||||
t.Errorf("outputBigram() modified itemsInRing, got = %d, want %d", itemsInRingCopy, itemsInRing)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestCJKBigramFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
outputUnigram bool
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// first test that non-adjacent terms are not combined
|
||||
{
|
||||
outputUnigram: false,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 2,
|
||||
Start: 5,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Single,
|
||||
Position: 2,
|
||||
Start: 5,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
outputUnigram: false,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こん"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("んに"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("にち"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ちは"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は世"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世界"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 21,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
outputUnigram: true,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("こん"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Single,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("んに"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Single,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("にち"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Single,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ちは"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Single,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は世"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Single,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 18,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世界"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 15,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Single,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
// Assuming that `、` is removed by unicode tokenizer from `こんにちは、世界`
|
||||
outputUnigram: true,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 8,
|
||||
Start: 21,
|
||||
End: 24,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Single,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("こん"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Single,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("んに"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Single,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("にち"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Single,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ちは"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Single,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Single,
|
||||
Position: 6,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世界"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 18,
|
||||
End: 24,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Single,
|
||||
Position: 7,
|
||||
Start: 21,
|
||||
End: 24,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
outputUnigram: false,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こ"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 3,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ん"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("に"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ち"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("は"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("cat"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 6,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 7,
|
||||
Start: 18,
|
||||
End: 21,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("界"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 8,
|
||||
Start: 21,
|
||||
End: 24,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("こん"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("んに"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("にち"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ちは"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("cat"),
|
||||
Type: analysis.AlphaNumeric,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("世界"),
|
||||
Type: analysis.Double,
|
||||
Position: 6,
|
||||
Start: 18,
|
||||
End: 24,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
outputUnigram: false,
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("パイプライン"),
|
||||
Type: analysis.Ideographic,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 18,
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("パイ"),
|
||||
Type: analysis.Double,
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("イプ"),
|
||||
Type: analysis.Double,
|
||||
Position: 2,
|
||||
Start: 3,
|
||||
End: 9,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("プラ"),
|
||||
Type: analysis.Double,
|
||||
Position: 3,
|
||||
Start: 6,
|
||||
End: 12,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("ライ"),
|
||||
Type: analysis.Double,
|
||||
Position: 4,
|
||||
Start: 9,
|
||||
End: 15,
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("イン"),
|
||||
Type: analysis.Double,
|
||||
Position: 5,
|
||||
Start: 12,
|
||||
End: 18,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
cjkBigramFilter := NewCJKBigramFilter(test.outputUnigram)
|
||||
actual := cjkBigramFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
104
analysis/lang/cjk/cjk_width.go
Normal file
104
analysis/lang/cjk/cjk_width.go
Normal file
|
@ -0,0 +1,104 @@
|
|||
// Copyright (c) 2016 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const WidthName = "cjk_width"
|
||||
|
||||
type CJKWidthFilter struct{}
|
||||
|
||||
func NewCJKWidthFilter() *CJKWidthFilter {
|
||||
return &CJKWidthFilter{}
|
||||
}
|
||||
|
||||
func (s *CJKWidthFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
runeCount := utf8.RuneCount(token.Term)
|
||||
runes := bytes.Runes(token.Term)
|
||||
for i := 0; i < runeCount; i++ {
|
||||
ch := runes[i]
|
||||
if ch >= 0xFF01 && ch <= 0xFF5E {
|
||||
// fullwidth ASCII variants
|
||||
runes[i] -= 0xFEE0
|
||||
} else if ch >= 0xFF65 && ch <= 0xFF9F {
|
||||
// halfwidth Katakana variants
|
||||
if (ch == 0xFF9E || ch == 0xFF9F) && i > 0 && combine(runes, i, ch) {
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
runeCount = len(runes)
|
||||
} else {
|
||||
runes[i] = kanaNorm[ch-0xFF65]
|
||||
}
|
||||
}
|
||||
}
|
||||
token.Term = analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
return input
|
||||
}
|
||||
|
||||
var kanaNorm = []rune{
|
||||
0x30fb, 0x30f2, 0x30a1, 0x30a3, 0x30a5, 0x30a7, 0x30a9, 0x30e3, 0x30e5,
|
||||
0x30e7, 0x30c3, 0x30fc, 0x30a2, 0x30a4, 0x30a6, 0x30a8, 0x30aa, 0x30ab,
|
||||
0x30ad, 0x30af, 0x30b1, 0x30b3, 0x30b5, 0x30b7, 0x30b9, 0x30bb, 0x30bd,
|
||||
0x30bf, 0x30c1, 0x30c4, 0x30c6, 0x30c8, 0x30ca, 0x30cb, 0x30cc, 0x30cd,
|
||||
0x30ce, 0x30cf, 0x30d2, 0x30d5, 0x30d8, 0x30db, 0x30de, 0x30df, 0x30e0,
|
||||
0x30e1, 0x30e2, 0x30e4, 0x30e6, 0x30e8, 0x30e9, 0x30ea, 0x30eb, 0x30ec,
|
||||
0x30ed, 0x30ef, 0x30f3, 0x3099, 0x309A,
|
||||
}
|
||||
|
||||
var kanaCombineVoiced = []rune{
|
||||
78, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
||||
0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
|
||||
0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
||||
0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
|
||||
}
|
||||
var kanaCombineHalfVoiced = []rune{
|
||||
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
||||
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2,
|
||||
0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
||||
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
||||
}
|
||||
|
||||
func combine(text []rune, pos int, r rune) bool {
|
||||
prev := text[pos-1]
|
||||
if prev >= 0x30A6 && prev <= 0x30FD {
|
||||
if r == 0xFF9F {
|
||||
text[pos-1] += kanaCombineHalfVoiced[prev-0x30A6]
|
||||
} else {
|
||||
text[pos-1] += kanaCombineVoiced[prev-0x30A6]
|
||||
}
|
||||
return text[pos-1] != prev
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func CJKWidthFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewCJKWidthFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(WidthName, CJKWidthFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
93
analysis/lang/cjk/cjk_width_test.go
Normal file
93
analysis/lang/cjk/cjk_width_test.go
Normal file
|
@ -0,0 +1,93 @@
|
|||
// Copyright (c) 2016 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cjk
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestCJKWidthFilter(t *testing.T) {
|
||||
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Test"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("1234"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Test"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("1234"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("カタカナ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("カタカナ"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ヴィッツ"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("ヴィッツ"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("パナソニック"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("パナソニック"),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
cjkWidthFilter := NewCJKWidthFilter()
|
||||
actual := cjkWidthFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
64
analysis/lang/ckb/analyzer_ckb.go
Normal file
64
analysis/lang/ckb/analyzer_ckb.go
Normal file
|
@ -0,0 +1,64 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const AnalyzerName = "ckb"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
normCkbFilter, err := cache.TokenFilterNamed(NormalizeName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopCkbFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stemmerCkbFilter, err := cache.TokenFilterNamed(StemmerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: unicodeTokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
normCkbFilter,
|
||||
toLowerFilter,
|
||||
stopCkbFilter,
|
||||
stemmerCkbFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
77
analysis/lang/ckb/analyzer_ckb_test.go
Normal file
77
analysis/lang/ckb/analyzer_ckb_test.go
Normal file
|
@ -0,0 +1,77 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestSoraniAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// stop word removal
|
||||
{
|
||||
input: []byte("ئەم پیاوە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 2,
|
||||
Start: 7,
|
||||
End: 17,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("پیاوە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("پیاو"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
121
analysis/lang/ckb/sorani_normalize.go
Normal file
121
analysis/lang/ckb/sorani_normalize.go
Normal file
|
@ -0,0 +1,121 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"unicode"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const NormalizeName = "normalize_ckb"
|
||||
|
||||
const (
|
||||
Yeh = '\u064A'
|
||||
DotlessYeh = '\u0649'
|
||||
FarsiYeh = '\u06CC'
|
||||
|
||||
Kaf = '\u0643'
|
||||
Keheh = '\u06A9'
|
||||
|
||||
Heh = '\u0647'
|
||||
Ae = '\u06D5'
|
||||
Zwnj = '\u200C'
|
||||
HehDoachashmee = '\u06BE'
|
||||
TehMarbuta = '\u0629'
|
||||
|
||||
Reh = '\u0631'
|
||||
Rreh = '\u0695'
|
||||
RrehAbove = '\u0692'
|
||||
|
||||
Tatweel = '\u0640'
|
||||
Fathatan = '\u064B'
|
||||
Dammatan = '\u064C'
|
||||
Kasratan = '\u064D'
|
||||
Fatha = '\u064E'
|
||||
Damma = '\u064F'
|
||||
Kasra = '\u0650'
|
||||
Shadda = '\u0651'
|
||||
Sukun = '\u0652'
|
||||
)
|
||||
|
||||
type SoraniNormalizeFilter struct {
|
||||
}
|
||||
|
||||
func NewSoraniNormalizeFilter() *SoraniNormalizeFilter {
|
||||
return &SoraniNormalizeFilter{}
|
||||
}
|
||||
|
||||
func (s *SoraniNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
term := normalize(token.Term)
|
||||
token.Term = term
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func normalize(input []byte) []byte {
|
||||
runes := bytes.Runes(input)
|
||||
for i := 0; i < len(runes); i++ {
|
||||
switch runes[i] {
|
||||
case Yeh, DotlessYeh:
|
||||
runes[i] = FarsiYeh
|
||||
case Kaf:
|
||||
runes[i] = Keheh
|
||||
case Zwnj:
|
||||
if i > 0 && runes[i-1] == Heh {
|
||||
runes[i-1] = Ae
|
||||
}
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
case Heh:
|
||||
if i == len(runes)-1 {
|
||||
runes[i] = Ae
|
||||
}
|
||||
case TehMarbuta:
|
||||
runes[i] = Ae
|
||||
case HehDoachashmee:
|
||||
runes[i] = Heh
|
||||
case Reh:
|
||||
if i == 0 {
|
||||
runes[i] = Rreh
|
||||
}
|
||||
case RrehAbove:
|
||||
runes[i] = Rreh
|
||||
case Tatweel, Kasratan, Dammatan, Fathatan, Fatha, Damma, Kasra, Shadda, Sukun:
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
default:
|
||||
if unicode.In(runes[i], unicode.Cf) {
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
}
|
||||
}
|
||||
}
|
||||
return analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewSoraniNormalizeFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
323
analysis/lang/ckb/sorani_normalize_test.go
Normal file
323
analysis/lang/ckb/sorani_normalize_test.go
Normal file
|
@ -0,0 +1,323 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestSoraniNormalizeFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// test Y
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064A"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06CC"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0649"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06CC"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06CC"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06CC"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test K
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0643"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06A9"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06A9"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06A9"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test H
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0647\u200C"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06D5"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0647\u200C\u06A9"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06D5\u06A9"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06BE"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0647"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0629"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u06D5"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test final H
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0647\u0647\u0647"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0647\u0647\u06D5"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test RR
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0692"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0695"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test initial RR
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0631\u0631\u0631"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0695\u0631\u0631"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// test remove
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0640"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064B"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064C"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064D"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064E"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u064F"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0650"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0651"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u0652"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("\u200C"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
// empty
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
soraniNormalizeFilter := NewSoraniNormalizeFilter()
|
||||
for _, test := range tests {
|
||||
actual := soraniNormalizeFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %#v, got %#v", test.output, actual)
|
||||
t.Errorf("expected % x, got % x", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
151
analysis/lang/ckb/sorani_stemmer_filter.go
Normal file
151
analysis/lang/ckb/sorani_stemmer_filter.go
Normal file
|
@ -0,0 +1,151 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StemmerName = "stemmer_ckb"
|
||||
|
||||
type SoraniStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewSoraniStemmerFilter() *SoraniStemmerFilter {
|
||||
return &SoraniStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *SoraniStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
// if not protected keyword, stem it
|
||||
if !token.KeyWord {
|
||||
stemmed := stem(token.Term)
|
||||
token.Term = stemmed
|
||||
}
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func stem(input []byte) []byte {
|
||||
inputLen := utf8.RuneCount(input)
|
||||
|
||||
// postposition
|
||||
if inputLen > 5 && bytes.HasSuffix(input, []byte("دا")) {
|
||||
input = truncateRunes(input, 2)
|
||||
inputLen = utf8.RuneCount(input)
|
||||
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("نا")) {
|
||||
input = truncateRunes(input, 1)
|
||||
inputLen = utf8.RuneCount(input)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("ەوە")) {
|
||||
input = truncateRunes(input, 3)
|
||||
inputLen = utf8.RuneCount(input)
|
||||
}
|
||||
|
||||
// possessive pronoun
|
||||
if inputLen > 6 &&
|
||||
(bytes.HasSuffix(input, []byte("مان")) ||
|
||||
bytes.HasSuffix(input, []byte("یان")) ||
|
||||
bytes.HasSuffix(input, []byte("تان"))) {
|
||||
input = truncateRunes(input, 3)
|
||||
inputLen = utf8.RuneCount(input)
|
||||
}
|
||||
|
||||
// indefinite singular ezafe
|
||||
if inputLen > 6 && bytes.HasSuffix(input, []byte("ێکی")) {
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یەکی")) {
|
||||
return truncateRunes(input, 4)
|
||||
}
|
||||
|
||||
if inputLen > 5 && bytes.HasSuffix(input, []byte("ێک")) {
|
||||
// indefinite singular
|
||||
return truncateRunes(input, 2)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("یەک")) {
|
||||
// indefinite singular
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("ەکە")) {
|
||||
// definite singular
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 5 && bytes.HasSuffix(input, []byte("کە")) {
|
||||
// definite singular
|
||||
return truncateRunes(input, 2)
|
||||
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("ەکان")) {
|
||||
// definite plural
|
||||
return truncateRunes(input, 4)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("کان")) {
|
||||
// definite plural
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یانی")) {
|
||||
// indefinite plural ezafe
|
||||
return truncateRunes(input, 4)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("انی")) {
|
||||
// indefinite plural ezafe
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("یان")) {
|
||||
// indefinite plural
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 5 && bytes.HasSuffix(input, []byte("ان")) {
|
||||
// indefinite plural
|
||||
return truncateRunes(input, 2)
|
||||
} else if inputLen > 7 && bytes.HasSuffix(input, []byte("یانە")) {
|
||||
// demonstrative plural
|
||||
return truncateRunes(input, 4)
|
||||
} else if inputLen > 6 && bytes.HasSuffix(input, []byte("انە")) {
|
||||
// demonstrative plural
|
||||
return truncateRunes(input, 3)
|
||||
} else if inputLen > 5 && (bytes.HasSuffix(input, []byte("ایە")) || bytes.HasSuffix(input, []byte("ەیە"))) {
|
||||
// demonstrative singular
|
||||
return truncateRunes(input, 2)
|
||||
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("ە")) {
|
||||
// demonstrative singular
|
||||
return truncateRunes(input, 1)
|
||||
} else if inputLen > 4 && bytes.HasSuffix(input, []byte("ی")) {
|
||||
// absolute singular ezafe
|
||||
return truncateRunes(input, 1)
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func truncateRunes(input []byte, num int) []byte {
|
||||
runes := bytes.Runes(input)
|
||||
runes = runes[:len(runes)-num]
|
||||
out := buildTermFromRunes(runes)
|
||||
return out
|
||||
}
|
||||
|
||||
func buildTermFromRunes(runes []rune) []byte {
|
||||
rv := make([]byte, 0, len(runes)*4)
|
||||
for _, r := range runes {
|
||||
runeBytes := make([]byte, utf8.RuneLen(r))
|
||||
utf8.EncodeRune(runeBytes, r)
|
||||
rv = append(rv, runeBytes...)
|
||||
}
|
||||
return rv
|
||||
}
|
||||
|
||||
func StemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewSoraniStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StemmerName, StemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
299
analysis/lang/ckb/sorani_stemmer_filter_test.go
Normal file
299
analysis/lang/ckb/sorani_stemmer_filter_test.go
Normal file
|
@ -0,0 +1,299 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/single"
|
||||
)
|
||||
|
||||
func TestSoraniStemmerFilter(t *testing.T) {
|
||||
|
||||
// in order to match the lucene tests
|
||||
// we will test with an analyzer, not just the stemmer
|
||||
analyzer := analysis.DefaultAnalyzer{
|
||||
Tokenizer: single.NewSingleTokenTokenizer(),
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
NewSoraniNormalizeFilter(),
|
||||
NewSoraniStemmerFilter(),
|
||||
},
|
||||
}
|
||||
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{ // -ek
|
||||
input: []byte("پیاوێک"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -yek
|
||||
input: []byte("دەرگایەک"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -aka
|
||||
input: []byte("پیاوەكە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -ka
|
||||
input: []byte("دەرگاكە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -a
|
||||
input: []byte("کتاویە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("کتاوی"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -ya
|
||||
input: []byte("دەرگایە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -An
|
||||
input: []byte("پیاوان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -yAn
|
||||
input: []byte("دەرگایان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -akAn
|
||||
input: []byte("پیاوەکان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -kAn
|
||||
input: []byte("دەرگاکان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -Ana
|
||||
input: []byte("پیاوانە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پیاو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -yAna
|
||||
input: []byte("دەرگایانە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دەرگا"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 18,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // Ezafe singular
|
||||
input: []byte("هۆتیلی"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("هۆتیل"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // Ezafe indefinite
|
||||
input: []byte("هۆتیلێکی"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("هۆتیل"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // Ezafe plural
|
||||
input: []byte("هۆتیلانی"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("هۆتیل"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 16,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -awa
|
||||
input: []byte("دوورەوە"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("دوور"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -dA
|
||||
input: []byte("نیوەشەودا"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("نیوەشەو"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 18,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -A
|
||||
input: []byte("سۆرانا"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("سۆران"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 12,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -mAn
|
||||
input: []byte("پارەمان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پارە"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -tAn
|
||||
input: []byte("پارەتان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پارە"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // -yAn
|
||||
input: []byte("پارەیان"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("پارە"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{ // empty
|
||||
input: []byte(""),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 0,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("for input %s(% x)", test.input, test.input)
|
||||
t.Errorf("\texpected:")
|
||||
for _, token := range test.output {
|
||||
t.Errorf("\t\t%v %s(% x)", token, token.Term, token.Term)
|
||||
}
|
||||
t.Errorf("\tactual:")
|
||||
for _, token := range actual {
|
||||
t.Errorf("\t\t%v %s(% x)", token, token.Term, token.Term)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/ckb/stop_filter_ckb.go
Normal file
36
analysis/lang/ckb/stop_filter_ckb.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package ckb
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
163
analysis/lang/ckb/stop_words_ckb.go
Normal file
163
analysis/lang/ckb/stop_words_ckb.go
Normal file
|
@ -0,0 +1,163 @@
|
|||
package ckb
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_ckb"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var SoraniStopWords = []byte(`# set of kurdish stopwords
|
||||
# note these have been normalized with our scheme (e represented with U+06D5, etc)
|
||||
# constructed from:
|
||||
# * Fig 5 of "Building A Test Collection For Sorani Kurdish" (Esmaili et al)
|
||||
# * "Sorani Kurdish: A Reference Grammar with selected readings" (Thackston)
|
||||
# * Corpus-based analysis of 77M word Sorani collection: wikipedia, news, blogs, etc
|
||||
|
||||
# and
|
||||
و
|
||||
# which
|
||||
کە
|
||||
# of
|
||||
ی
|
||||
# made/did
|
||||
کرد
|
||||
# that/which
|
||||
ئەوەی
|
||||
# on/head
|
||||
سەر
|
||||
# two
|
||||
دوو
|
||||
# also
|
||||
هەروەها
|
||||
# from/that
|
||||
لەو
|
||||
# makes/does
|
||||
دەکات
|
||||
# some
|
||||
چەند
|
||||
# every
|
||||
هەر
|
||||
|
||||
# demonstratives
|
||||
# that
|
||||
ئەو
|
||||
# this
|
||||
ئەم
|
||||
|
||||
# personal pronouns
|
||||
# I
|
||||
من
|
||||
# we
|
||||
ئێمە
|
||||
# you
|
||||
تۆ
|
||||
# you
|
||||
ئێوە
|
||||
# he/she/it
|
||||
ئەو
|
||||
# they
|
||||
ئەوان
|
||||
|
||||
# prepositions
|
||||
# to/with/by
|
||||
بە
|
||||
پێ
|
||||
# without
|
||||
بەبێ
|
||||
# along with/while/during
|
||||
بەدەم
|
||||
# in the opinion of
|
||||
بەلای
|
||||
# according to
|
||||
بەپێی
|
||||
# before
|
||||
بەرلە
|
||||
# in the direction of
|
||||
بەرەوی
|
||||
# in front of/toward
|
||||
بەرەوە
|
||||
# before/in the face of
|
||||
بەردەم
|
||||
# without
|
||||
بێ
|
||||
# except for
|
||||
بێجگە
|
||||
# for
|
||||
بۆ
|
||||
# on/in
|
||||
دە
|
||||
تێ
|
||||
# with
|
||||
دەگەڵ
|
||||
# after
|
||||
دوای
|
||||
# except for/aside from
|
||||
جگە
|
||||
# in/from
|
||||
لە
|
||||
لێ
|
||||
# in front of/before/because of
|
||||
لەبەر
|
||||
# between/among
|
||||
لەبەینی
|
||||
# concerning/about
|
||||
لەبابەت
|
||||
# concerning
|
||||
لەبارەی
|
||||
# instead of
|
||||
لەباتی
|
||||
# beside
|
||||
لەبن
|
||||
# instead of
|
||||
لەبرێتی
|
||||
# behind
|
||||
لەدەم
|
||||
# with/together with
|
||||
لەگەڵ
|
||||
# by
|
||||
لەلایەن
|
||||
# within
|
||||
لەناو
|
||||
# between/among
|
||||
لەنێو
|
||||
# for the sake of
|
||||
لەپێناوی
|
||||
# with respect to
|
||||
لەرەوی
|
||||
# by means of/for
|
||||
لەرێ
|
||||
# for the sake of
|
||||
لەرێگا
|
||||
# on/on top of/according to
|
||||
لەسەر
|
||||
# under
|
||||
لەژێر
|
||||
# between/among
|
||||
ناو
|
||||
# between/among
|
||||
نێوان
|
||||
# after
|
||||
پاش
|
||||
# before
|
||||
پێش
|
||||
# like
|
||||
وەک
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(SoraniStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
36
analysis/lang/cs/stop_filter_cs.go
Normal file
36
analysis/lang/cs/stop_filter_cs.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package cs
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
199
analysis/lang/cs/stop_words_cs.go
Normal file
199
analysis/lang/cs/stop_words_cs.go
Normal file
|
@ -0,0 +1,199 @@
|
|||
package cs
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_cs"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var CzechStopWords = []byte(`a
|
||||
s
|
||||
k
|
||||
o
|
||||
i
|
||||
u
|
||||
v
|
||||
z
|
||||
dnes
|
||||
cz
|
||||
tímto
|
||||
budeš
|
||||
budem
|
||||
byli
|
||||
jseš
|
||||
můj
|
||||
svým
|
||||
ta
|
||||
tomto
|
||||
tohle
|
||||
tuto
|
||||
tyto
|
||||
jej
|
||||
zda
|
||||
proč
|
||||
máte
|
||||
tato
|
||||
kam
|
||||
tohoto
|
||||
kdo
|
||||
kteří
|
||||
mi
|
||||
nám
|
||||
tom
|
||||
tomuto
|
||||
mít
|
||||
nic
|
||||
proto
|
||||
kterou
|
||||
byla
|
||||
toho
|
||||
protože
|
||||
asi
|
||||
ho
|
||||
naši
|
||||
napište
|
||||
re
|
||||
což
|
||||
tím
|
||||
takže
|
||||
svých
|
||||
její
|
||||
svými
|
||||
jste
|
||||
aj
|
||||
tu
|
||||
tedy
|
||||
teto
|
||||
bylo
|
||||
kde
|
||||
ke
|
||||
pravé
|
||||
ji
|
||||
nad
|
||||
nejsou
|
||||
či
|
||||
pod
|
||||
téma
|
||||
mezi
|
||||
přes
|
||||
ty
|
||||
pak
|
||||
vám
|
||||
ani
|
||||
když
|
||||
však
|
||||
neg
|
||||
jsem
|
||||
tento
|
||||
článku
|
||||
články
|
||||
aby
|
||||
jsme
|
||||
před
|
||||
pta
|
||||
jejich
|
||||
byl
|
||||
ještě
|
||||
až
|
||||
bez
|
||||
také
|
||||
pouze
|
||||
první
|
||||
vaše
|
||||
která
|
||||
nás
|
||||
nový
|
||||
tipy
|
||||
pokud
|
||||
může
|
||||
strana
|
||||
jeho
|
||||
své
|
||||
jiné
|
||||
zprávy
|
||||
nové
|
||||
není
|
||||
vás
|
||||
jen
|
||||
podle
|
||||
zde
|
||||
už
|
||||
být
|
||||
více
|
||||
bude
|
||||
již
|
||||
než
|
||||
který
|
||||
by
|
||||
které
|
||||
co
|
||||
nebo
|
||||
ten
|
||||
tak
|
||||
má
|
||||
při
|
||||
od
|
||||
po
|
||||
jsou
|
||||
jak
|
||||
další
|
||||
ale
|
||||
si
|
||||
se
|
||||
ve
|
||||
to
|
||||
jako
|
||||
za
|
||||
zpět
|
||||
ze
|
||||
do
|
||||
pro
|
||||
je
|
||||
na
|
||||
atd
|
||||
atp
|
||||
jakmile
|
||||
přičemž
|
||||
já
|
||||
on
|
||||
ona
|
||||
ono
|
||||
oni
|
||||
ony
|
||||
my
|
||||
vy
|
||||
jí
|
||||
ji
|
||||
mě
|
||||
mne
|
||||
jemu
|
||||
tomu
|
||||
těm
|
||||
těmu
|
||||
němu
|
||||
němuž
|
||||
jehož
|
||||
jíž
|
||||
jelikož
|
||||
jež
|
||||
jakož
|
||||
načež
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(CzechStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
59
analysis/lang/da/analyzer_da.go
Normal file
59
analysis/lang/da/analyzer_da.go
Normal file
|
@ -0,0 +1,59 @@
|
|||
// Copyright (c) 2018 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package da
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const AnalyzerName = "da"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopDaFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stemmerDaFilter, err := cache.TokenFilterNamed(SnowballStemmerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: unicodeTokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
stopDaFilter,
|
||||
stemmerDaFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
71
analysis/lang/da/analyzer_da_test.go
Normal file
71
analysis/lang/da/analyzer_da_test.go
Normal file
|
@ -0,0 +1,71 @@
|
|||
// Copyright (c) 2018 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package da
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestDanishAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// stemming
|
||||
{
|
||||
input: []byte("undersøg"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("undersøg"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 9,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("undersøgelse"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("undersøg"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 13,
|
||||
},
|
||||
},
|
||||
},
|
||||
// stop word
|
||||
{
|
||||
input: []byte("på"),
|
||||
output: analysis.TokenStream{},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
52
analysis/lang/da/stemmer_da.go
Normal file
52
analysis/lang/da/stemmer_da.go
Normal file
|
@ -0,0 +1,52 @@
|
|||
// Copyright (c) 2018 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package da
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/snowballstem"
|
||||
"github.com/blevesearch/snowballstem/danish"
|
||||
)
|
||||
|
||||
const SnowballStemmerName = "stemmer_da_snowball"
|
||||
|
||||
type DanishStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewDanishStemmerFilter() *DanishStemmerFilter {
|
||||
return &DanishStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *DanishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
env := snowballstem.NewEnv(string(token.Term))
|
||||
danish.Stem(env)
|
||||
token.Term = []byte(env.Current())
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func DanishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewDanishStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(SnowballStemmerName, DanishStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
36
analysis/lang/da/stop_filter_da.go
Normal file
36
analysis/lang/da/stop_filter_da.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2018 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package da
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
137
analysis/lang/da/stop_words_da.go
Normal file
137
analysis/lang/da/stop_words_da.go
Normal file
|
@ -0,0 +1,137 @@
|
|||
package da
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_da"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var DanishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt
|
||||
| This file is distributed under the BSD License.
|
||||
| See http://snowball.tartarus.org/license.php
|
||||
| Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
| - Encoding was converted to UTF-8.
|
||||
| - This notice was added.
|
||||
|
|
||||
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
|
||||
|
||||
| A Danish stop word list. Comments begin with vertical bar. Each stop
|
||||
| word is at the start of a line.
|
||||
|
||||
| This is a ranked list (commonest to rarest) of stopwords derived from
|
||||
| a large text sample.
|
||||
|
||||
|
||||
og | and
|
||||
i | in
|
||||
jeg | I
|
||||
det | that (dem. pronoun)/it (pers. pronoun)
|
||||
at | that (in front of a sentence)/to (with infinitive)
|
||||
en | a/an
|
||||
den | it (pers. pronoun)/that (dem. pronoun)
|
||||
til | to/at/for/until/against/by/of/into, more
|
||||
er | present tense of "to be"
|
||||
som | who, as
|
||||
på | on/upon/in/on/at/to/after/of/with/for, on
|
||||
de | they
|
||||
med | with/by/in, along
|
||||
han | he
|
||||
af | of/by/from/off/for/in/with/on, off
|
||||
for | at/for/to/from/by/of/ago, in front/before, because
|
||||
ikke | not
|
||||
der | who/which, there/those
|
||||
var | past tense of "to be"
|
||||
mig | me/myself
|
||||
sig | oneself/himself/herself/itself/themselves
|
||||
men | but
|
||||
et | a/an/one, one (number), someone/somebody/one
|
||||
har | present tense of "to have"
|
||||
om | round/about/for/in/a, about/around/down, if
|
||||
vi | we
|
||||
min | my
|
||||
havde | past tense of "to have"
|
||||
ham | him
|
||||
hun | she
|
||||
nu | now
|
||||
over | over/above/across/by/beyond/past/on/about, over/past
|
||||
da | then, when/as/since
|
||||
fra | from/off/since, off, since
|
||||
du | you
|
||||
ud | out
|
||||
sin | his/her/its/one's
|
||||
dem | them
|
||||
os | us/ourselves
|
||||
op | up
|
||||
man | you/one
|
||||
hans | his
|
||||
hvor | where
|
||||
eller | or
|
||||
hvad | what
|
||||
skal | must/shall etc.
|
||||
selv | myself/youself/herself/ourselves etc., even
|
||||
her | here
|
||||
alle | all/everyone/everybody etc.
|
||||
vil | will (verb)
|
||||
blev | past tense of "to stay/to remain/to get/to become"
|
||||
kunne | could
|
||||
ind | in
|
||||
når | when
|
||||
være | present tense of "to be"
|
||||
dog | however/yet/after all
|
||||
noget | something
|
||||
ville | would
|
||||
jo | you know/you see (adv), yes
|
||||
deres | their/theirs
|
||||
efter | after/behind/according to/for/by/from, later/afterwards
|
||||
ned | down
|
||||
skulle | should
|
||||
denne | this
|
||||
end | than
|
||||
dette | this
|
||||
mit | my/mine
|
||||
også | also
|
||||
under | under/beneath/below/during, below/underneath
|
||||
have | have
|
||||
dig | you
|
||||
anden | other
|
||||
hende | her
|
||||
mine | my
|
||||
alt | everything
|
||||
meget | much/very, plenty of
|
||||
sit | his, her, its, one's
|
||||
sine | his, her, its, one's
|
||||
vor | our
|
||||
mod | against
|
||||
disse | these
|
||||
hvis | if
|
||||
din | your/yours
|
||||
nogle | some
|
||||
hos | by/at
|
||||
blive | be/become
|
||||
mange | many
|
||||
ad | by/through
|
||||
bliver | present tense of "to be/to become"
|
||||
hendes | her/hers
|
||||
været | be
|
||||
thi | for (conj)
|
||||
jer | you
|
||||
sådan | such, like this/like that
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(DanishStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
64
analysis/lang/de/analyzer_de.go
Normal file
64
analysis/lang/de/analyzer_de.go
Normal file
|
@ -0,0 +1,64 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const AnalyzerName = "de"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopDeFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
normalizeDeFilter, err := cache.TokenFilterNamed(NormalizeName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
lightStemmerDeFilter, err := cache.TokenFilterNamed(LightStemmerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: unicodeTokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
stopDeFilter,
|
||||
normalizeDeFilter,
|
||||
lightStemmerDeFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
155
analysis/lang/de/analyzer_de_test.go
Normal file
155
analysis/lang/de/analyzer_de_test.go
Normal file
|
@ -0,0 +1,155 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestGermanAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: []byte("Tisch"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("tisch"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Tische"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("tisch"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Tischen"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("tisch"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 7,
|
||||
},
|
||||
},
|
||||
},
|
||||
// german specials
|
||||
{
|
||||
input: []byte("Schaltflächen"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("schaltflach"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Schaltflaechen"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("schaltflach"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 14,
|
||||
},
|
||||
},
|
||||
},
|
||||
// tests added by marty to increase coverage
|
||||
{
|
||||
input: []byte("Blechern"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("blech"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Klecks"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("kleck"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Mindestens"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("mindest"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Kugelfest"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("kugelf"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 9,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("Baldigst"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("baldig"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
98
analysis/lang/de/german_normalize.go
Normal file
98
analysis/lang/de/german_normalize.go
Normal file
|
@ -0,0 +1,98 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const NormalizeName = "normalize_de"
|
||||
|
||||
const (
|
||||
N = 0 /* ordinary state */
|
||||
V = 1 /* stops 'u' from entering umlaut state */
|
||||
U = 2 /* umlaut state, allows e-deletion */
|
||||
)
|
||||
|
||||
type GermanNormalizeFilter struct {
|
||||
}
|
||||
|
||||
func NewGermanNormalizeFilter() *GermanNormalizeFilter {
|
||||
return &GermanNormalizeFilter{}
|
||||
}
|
||||
|
||||
func (s *GermanNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
term := normalize(token.Term)
|
||||
token.Term = term
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func normalize(input []byte) []byte {
|
||||
state := N
|
||||
runes := bytes.Runes(input)
|
||||
for i := 0; i < len(runes); i++ {
|
||||
switch runes[i] {
|
||||
case 'a', 'o':
|
||||
state = U
|
||||
case 'u':
|
||||
if state == N {
|
||||
state = U
|
||||
} else {
|
||||
state = V
|
||||
}
|
||||
case 'e':
|
||||
if state == U {
|
||||
runes = analysis.DeleteRune(runes, i)
|
||||
i--
|
||||
}
|
||||
state = V
|
||||
case 'i', 'q', 'y':
|
||||
state = V
|
||||
case 'ä':
|
||||
runes[i] = 'a'
|
||||
state = V
|
||||
case 'ö':
|
||||
runes[i] = 'o'
|
||||
state = V
|
||||
case 'ü':
|
||||
runes[i] = 'u'
|
||||
state = V
|
||||
case 'ß':
|
||||
runes[i] = 's'
|
||||
i++
|
||||
runes = analysis.InsertRune(runes, i, 's')
|
||||
state = N
|
||||
default:
|
||||
state = N
|
||||
}
|
||||
}
|
||||
return analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewGermanNormalizeFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
103
analysis/lang/de/german_normalize_test.go
Normal file
103
analysis/lang/de/german_normalize_test.go
Normal file
|
@ -0,0 +1,103 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestGermanNormalizeFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// Tests that a/o/u + e is equivalent to the umlaut form
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Schaltflächen"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Schaltflachen"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Schaltflaechen"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Schaltflachen"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Tests the specific heuristic that ue is not folded after a vowel or q.
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("dauer"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("dauer"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// Tests german specific folding of sharp-s
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("weißbier"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("weissbier"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// empty
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
germanNormalizeFilter := NewGermanNormalizeFilter()
|
||||
for _, test := range tests {
|
||||
actual := germanNormalizeFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %#v, got %#v", test.output, actual)
|
||||
t.Errorf("expected %s(% x), got %s(% x)", test.output[0].Term, test.output[0].Term, actual[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
119
analysis/lang/de/light_stemmer_de.go
Normal file
119
analysis/lang/de/light_stemmer_de.go
Normal file
|
@ -0,0 +1,119 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const LightStemmerName = "stemmer_de_light"
|
||||
|
||||
type GermanLightStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewGermanLightStemmerFilter() *GermanLightStemmerFilter {
|
||||
return &GermanLightStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *GermanLightStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
runes := bytes.Runes(token.Term)
|
||||
runes = stem(runes)
|
||||
token.Term = analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func stem(input []rune) []rune {
|
||||
|
||||
for i, r := range input {
|
||||
switch r {
|
||||
case 'ä', 'à', 'á', 'â':
|
||||
input[i] = 'a'
|
||||
case 'ö', 'ò', 'ó', 'ô':
|
||||
input[i] = 'o'
|
||||
case 'ï', 'ì', 'í', 'î':
|
||||
input[i] = 'i'
|
||||
case 'ü', 'ù', 'ú', 'û':
|
||||
input[i] = 'u'
|
||||
}
|
||||
}
|
||||
|
||||
input = step1(input)
|
||||
return step2(input)
|
||||
}
|
||||
|
||||
func stEnding(ch rune) bool {
|
||||
switch ch {
|
||||
case 'b', 'd', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 't':
|
||||
return true
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func step1(s []rune) []rune {
|
||||
l := len(s)
|
||||
if l > 5 && s[l-3] == 'e' && s[l-2] == 'r' && s[l-1] == 'n' {
|
||||
return s[:l-3]
|
||||
}
|
||||
|
||||
if l > 4 && s[l-2] == 'e' {
|
||||
switch s[l-1] {
|
||||
case 'm', 'n', 'r', 's':
|
||||
return s[:l-2]
|
||||
}
|
||||
}
|
||||
|
||||
if l > 3 && s[l-1] == 'e' {
|
||||
return s[:l-1]
|
||||
}
|
||||
|
||||
if l > 3 && s[l-1] == 's' && stEnding(s[l-2]) {
|
||||
return s[:l-1]
|
||||
}
|
||||
|
||||
return s
|
||||
}
|
||||
|
||||
func step2(s []rune) []rune {
|
||||
l := len(s)
|
||||
if l > 5 && s[l-3] == 'e' && s[l-2] == 's' && s[l-1] == 't' {
|
||||
return s[:l-3]
|
||||
}
|
||||
|
||||
if l > 4 && s[l-2] == 'e' && (s[l-1] == 'r' || s[l-1] == 'n') {
|
||||
return s[:l-2]
|
||||
}
|
||||
|
||||
if l > 4 && s[l-2] == 's' && s[l-1] == 't' && stEnding(s[l-3]) {
|
||||
return s[:l-2]
|
||||
}
|
||||
|
||||
return s
|
||||
}
|
||||
|
||||
func GermanLightStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewGermanLightStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(LightStemmerName, GermanLightStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
52
analysis/lang/de/stemmer_de_snowball.go
Normal file
52
analysis/lang/de/stemmer_de_snowball.go
Normal file
|
@ -0,0 +1,52 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/snowballstem"
|
||||
"github.com/blevesearch/snowballstem/german"
|
||||
)
|
||||
|
||||
const SnowballStemmerName = "stemmer_de_snowball"
|
||||
|
||||
type GermanStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewGermanStemmerFilter() *GermanStemmerFilter {
|
||||
return &GermanStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *GermanStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
env := snowballstem.NewEnv(string(token.Term))
|
||||
german.Stem(env)
|
||||
token.Term = []byte(env.Current())
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func GermanStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewGermanStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(SnowballStemmerName, GermanStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
91
analysis/lang/de/stemmer_de_test.go
Normal file
91
analysis/lang/de/stemmer_de_test.go
Normal file
|
@ -0,0 +1,91 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestSnowballGermanStemmer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("abzuschrecken"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("abzuschreck"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("abzuwarten"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("abzuwart"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("zwirnfabrik"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("zwirnfabr"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("zyniker"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("zynik"),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := filter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/de/stop_filter_de.go
Normal file
36
analysis/lang/de/stop_filter_de.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package de
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
321
analysis/lang/de/stop_words_de.go
Normal file
321
analysis/lang/de/stop_words_de.go
Normal file
|
@ -0,0 +1,321 @@
|
|||
package de
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_de"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var GermanStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt
|
||||
| This file is distributed under the BSD License.
|
||||
| See http://snowball.tartarus.org/license.php
|
||||
| Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
| - Encoding was converted to UTF-8.
|
||||
| - This notice was added.
|
||||
|
|
||||
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
|
||||
|
||||
| A German stop word list. Comments begin with vertical bar. Each stop
|
||||
| word is at the start of a line.
|
||||
|
||||
| The number of forms in this list is reduced significantly by passing it
|
||||
| through the German stemmer.
|
||||
|
||||
|
||||
aber | but
|
||||
|
||||
alle | all
|
||||
allem
|
||||
allen
|
||||
aller
|
||||
alles
|
||||
|
||||
als | than, as
|
||||
also | so
|
||||
am | an + dem
|
||||
an | at
|
||||
|
||||
ander | other
|
||||
andere
|
||||
anderem
|
||||
anderen
|
||||
anderer
|
||||
anderes
|
||||
anderm
|
||||
andern
|
||||
anderr
|
||||
anders
|
||||
|
||||
auch | also
|
||||
auf | on
|
||||
aus | out of
|
||||
bei | by
|
||||
bin | am
|
||||
bis | until
|
||||
bist | art
|
||||
da | there
|
||||
damit | with it
|
||||
dann | then
|
||||
|
||||
der | the
|
||||
den
|
||||
des
|
||||
dem
|
||||
die
|
||||
das
|
||||
|
||||
daß | that
|
||||
|
||||
derselbe | the same
|
||||
derselben
|
||||
denselben
|
||||
desselben
|
||||
demselben
|
||||
dieselbe
|
||||
dieselben
|
||||
dasselbe
|
||||
|
||||
dazu | to that
|
||||
|
||||
dein | thy
|
||||
deine
|
||||
deinem
|
||||
deinen
|
||||
deiner
|
||||
deines
|
||||
|
||||
denn | because
|
||||
|
||||
derer | of those
|
||||
dessen | of him
|
||||
|
||||
dich | thee
|
||||
dir | to thee
|
||||
du | thou
|
||||
|
||||
dies | this
|
||||
diese
|
||||
diesem
|
||||
diesen
|
||||
dieser
|
||||
dieses
|
||||
|
||||
|
||||
doch | (several meanings)
|
||||
dort | (over) there
|
||||
|
||||
|
||||
durch | through
|
||||
|
||||
ein | a
|
||||
eine
|
||||
einem
|
||||
einen
|
||||
einer
|
||||
eines
|
||||
|
||||
einig | some
|
||||
einige
|
||||
einigem
|
||||
einigen
|
||||
einiger
|
||||
einiges
|
||||
|
||||
einmal | once
|
||||
|
||||
er | he
|
||||
ihn | him
|
||||
ihm | to him
|
||||
|
||||
es | it
|
||||
etwas | something
|
||||
|
||||
euer | your
|
||||
eure
|
||||
eurem
|
||||
euren
|
||||
eurer
|
||||
eures
|
||||
|
||||
für | for
|
||||
gegen | towards
|
||||
gewesen | p.p. of sein
|
||||
hab | have
|
||||
habe | have
|
||||
haben | have
|
||||
hat | has
|
||||
hatte | had
|
||||
hatten | had
|
||||
hier | here
|
||||
hin | there
|
||||
hinter | behind
|
||||
|
||||
ich | I
|
||||
mich | me
|
||||
mir | to me
|
||||
|
||||
|
||||
ihr | you, to her
|
||||
ihre
|
||||
ihrem
|
||||
ihren
|
||||
ihrer
|
||||
ihres
|
||||
euch | to you
|
||||
|
||||
im | in + dem
|
||||
in | in
|
||||
indem | while
|
||||
ins | in + das
|
||||
ist | is
|
||||
|
||||
jede | each, every
|
||||
jedem
|
||||
jeden
|
||||
jeder
|
||||
jedes
|
||||
|
||||
jene | that
|
||||
jenem
|
||||
jenen
|
||||
jener
|
||||
jenes
|
||||
|
||||
jetzt | now
|
||||
kann | can
|
||||
|
||||
kein | no
|
||||
keine
|
||||
keinem
|
||||
keinen
|
||||
keiner
|
||||
keines
|
||||
|
||||
können | can
|
||||
könnte | could
|
||||
machen | do
|
||||
man | one
|
||||
|
||||
manche | some, many a
|
||||
manchem
|
||||
manchen
|
||||
mancher
|
||||
manches
|
||||
|
||||
mein | my
|
||||
meine
|
||||
meinem
|
||||
meinen
|
||||
meiner
|
||||
meines
|
||||
|
||||
mit | with
|
||||
muss | must
|
||||
musste | had to
|
||||
nach | to(wards)
|
||||
nicht | not
|
||||
nichts | nothing
|
||||
noch | still, yet
|
||||
nun | now
|
||||
nur | only
|
||||
ob | whether
|
||||
oder | or
|
||||
ohne | without
|
||||
sehr | very
|
||||
|
||||
sein | his
|
||||
seine
|
||||
seinem
|
||||
seinen
|
||||
seiner
|
||||
seines
|
||||
|
||||
selbst | self
|
||||
sich | herself
|
||||
|
||||
sie | they, she
|
||||
ihnen | to them
|
||||
|
||||
sind | are
|
||||
so | so
|
||||
|
||||
solche | such
|
||||
solchem
|
||||
solchen
|
||||
solcher
|
||||
solches
|
||||
|
||||
soll | shall
|
||||
sollte | should
|
||||
sondern | but
|
||||
sonst | else
|
||||
über | over
|
||||
um | about, around
|
||||
und | and
|
||||
|
||||
uns | us
|
||||
unse
|
||||
unsem
|
||||
unsen
|
||||
unser
|
||||
unses
|
||||
|
||||
unter | under
|
||||
viel | much
|
||||
vom | von + dem
|
||||
von | from
|
||||
vor | before
|
||||
während | while
|
||||
war | was
|
||||
waren | were
|
||||
warst | wast
|
||||
was | what
|
||||
weg | away, off
|
||||
weil | because
|
||||
weiter | further
|
||||
|
||||
welche | which
|
||||
welchem
|
||||
welchen
|
||||
welcher
|
||||
welches
|
||||
|
||||
wenn | when
|
||||
werde | will
|
||||
werden | will
|
||||
wie | how
|
||||
wieder | again
|
||||
will | want
|
||||
wir | we
|
||||
wird | will
|
||||
wirst | willst
|
||||
wo | where
|
||||
wollen | want
|
||||
wollte | wanted
|
||||
würde | would
|
||||
würden | would
|
||||
zu | to
|
||||
zum | zu + dem
|
||||
zur | zu + der
|
||||
zwar | indeed
|
||||
zwischen | between
|
||||
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(GermanStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
36
analysis/lang/el/stop_filter_el.go
Normal file
36
analysis/lang/el/stop_filter_el.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package el
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
105
analysis/lang/el/stop_words_el.go
Normal file
105
analysis/lang/el/stop_words_el.go
Normal file
|
@ -0,0 +1,105 @@
|
|||
package el
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_el"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var GreekStopWords = []byte(`# Lucene Greek Stopwords list
|
||||
# Note: by default this file is used after GreekLowerCaseFilter,
|
||||
# so when modifying this file use 'σ' instead of 'ς'
|
||||
ο
|
||||
η
|
||||
το
|
||||
οι
|
||||
τα
|
||||
του
|
||||
τησ
|
||||
των
|
||||
τον
|
||||
την
|
||||
και
|
||||
κι
|
||||
κ
|
||||
ειμαι
|
||||
εισαι
|
||||
ειναι
|
||||
ειμαστε
|
||||
ειστε
|
||||
στο
|
||||
στον
|
||||
στη
|
||||
στην
|
||||
μα
|
||||
αλλα
|
||||
απο
|
||||
για
|
||||
προσ
|
||||
με
|
||||
σε
|
||||
ωσ
|
||||
παρα
|
||||
αντι
|
||||
κατα
|
||||
μετα
|
||||
θα
|
||||
να
|
||||
δε
|
||||
δεν
|
||||
μη
|
||||
μην
|
||||
επι
|
||||
ενω
|
||||
εαν
|
||||
αν
|
||||
τοτε
|
||||
που
|
||||
πωσ
|
||||
ποιοσ
|
||||
ποια
|
||||
ποιο
|
||||
ποιοι
|
||||
ποιεσ
|
||||
ποιων
|
||||
ποιουσ
|
||||
αυτοσ
|
||||
αυτη
|
||||
αυτο
|
||||
αυτοι
|
||||
αυτων
|
||||
αυτουσ
|
||||
αυτεσ
|
||||
αυτα
|
||||
εκεινοσ
|
||||
εκεινη
|
||||
εκεινο
|
||||
εκεινοι
|
||||
εκεινεσ
|
||||
εκεινα
|
||||
εκεινων
|
||||
εκεινουσ
|
||||
οπωσ
|
||||
ομωσ
|
||||
ισωσ
|
||||
οσο
|
||||
οτι
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(GreekStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
73
analysis/lang/en/analyzer_en.go
Normal file
73
analysis/lang/en/analyzer_en.go
Normal file
|
@ -0,0 +1,73 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
// Package en implements an analyzer with reasonable defaults for processing
|
||||
// English text.
|
||||
//
|
||||
// It strips possessive suffixes ('s), transforms tokens to lower case,
|
||||
// removes stopwords from a built-in list, and applies porter stemming.
|
||||
//
|
||||
// The built-in stopwords list is defined in EnglishStopWords.
|
||||
package en
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/porter"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
)
|
||||
|
||||
const AnalyzerName = "en"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
tokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
possEnFilter, err := cache.TokenFilterNamed(PossessiveName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopEnFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stemmerEnFilter, err := cache.TokenFilterNamed(porter.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: tokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
possEnFilter,
|
||||
toLowerFilter,
|
||||
stopEnFilter,
|
||||
stemmerEnFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
105
analysis/lang/en/analyzer_en_test.go
Normal file
105
analysis/lang/en/analyzer_en_test.go
Normal file
|
@ -0,0 +1,105 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestEnglishAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// stemming
|
||||
{
|
||||
input: []byte("books"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("book"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("book"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("book"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
// stop word removal
|
||||
{
|
||||
input: []byte("the"),
|
||||
output: analysis.TokenStream{},
|
||||
},
|
||||
// possessive removal
|
||||
{
|
||||
input: []byte("steven's"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("steven"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 8,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("steven\u2019s"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("steven"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("steven\uFF07s"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("steven"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 10,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
177
analysis/lang/en/plural_stemmer.go
Normal file
177
analysis/lang/en/plural_stemmer.go
Normal file
|
@ -0,0 +1,177 @@
|
|||
/*
|
||||
This code was ported from the Open Search Project
|
||||
https://github.com/opensearch-project/OpenSearch/blob/main/modules/analysis-common/src/main/java/org/opensearch/analysis/common/EnglishPluralStemFilter.java
|
||||
The algorithm itself was created by Mark Harwood
|
||||
https://github.com/markharwood
|
||||
*/
|
||||
|
||||
/*
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*
|
||||
* The OpenSearch Contributors require contributions made to
|
||||
* this file be licensed under the Apache-2.0 license or a
|
||||
* compatible open source license.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Licensed to Elasticsearch under one or more contributor
|
||||
* license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright
|
||||
* ownership. Elasticsearch licenses this file to you under
|
||||
* the Apache License, Version 2.0 (the "License"); you may
|
||||
* not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing,
|
||||
* software distributed under the License is distributed on an
|
||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
* KIND, either express or implied. See the License for the
|
||||
* specific language governing permissions and limitations
|
||||
* under the License.
|
||||
*/
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"strings"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const PluralStemmerName = "stemmer_en_plural"
|
||||
|
||||
type EnglishPluralStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewEnglishPluralStemmerFilter() *EnglishPluralStemmerFilter {
|
||||
return &EnglishPluralStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *EnglishPluralStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
token.Term = []byte(stem(string(token.Term)))
|
||||
}
|
||||
|
||||
return input
|
||||
}
|
||||
|
||||
func EnglishPluralStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewEnglishPluralStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(PluralStemmerName, EnglishPluralStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
|
||||
// ----------------------------------------------------------------------------
|
||||
|
||||
// Words ending in oes that retain the e when stemmed
|
||||
var oesExceptions = []string{"shoes", "canoes", "oboes"}
|
||||
|
||||
// Words ending in ches that retain the e when stemmed
|
||||
var chesExceptions = []string{
|
||||
"cliches",
|
||||
"avalanches",
|
||||
"mustaches",
|
||||
"moustaches",
|
||||
"quiches",
|
||||
"headaches",
|
||||
"heartaches",
|
||||
"porsches",
|
||||
"tranches",
|
||||
"caches",
|
||||
}
|
||||
|
||||
func stem(word string) string {
|
||||
runes := []rune(strings.ToLower(word))
|
||||
|
||||
if len(runes) < 3 || runes[len(runes)-1] != 's' {
|
||||
return string(runes)
|
||||
}
|
||||
|
||||
switch runes[len(runes)-2] {
|
||||
case 'u':
|
||||
fallthrough
|
||||
case 's':
|
||||
return string(runes)
|
||||
case 'e':
|
||||
// Modified ies->y logic from original s-stemmer - only work on strings > 4
|
||||
// so spies -> spy still but pies->pie.
|
||||
// The original code also special-cased aies and eies for no good reason as far as I can tell.
|
||||
// ( no words of consequence - eg http://www.thefreedictionary.com/words-that-end-in-aies )
|
||||
if len(runes) > 4 && runes[len(runes)-3] == 'i' {
|
||||
runes[len(runes)-3] = 'y'
|
||||
return string(runes[0 : len(runes)-2])
|
||||
}
|
||||
|
||||
// Suffix rules to remove any dangling "e"
|
||||
if len(runes) > 3 {
|
||||
// xes (but >1 prefix so we can stem "boxes->box" but keep "axes->axe")
|
||||
if len(runes) > 4 && runes[len(runes)-3] == 'x' {
|
||||
return string(runes[0 : len(runes)-2])
|
||||
}
|
||||
|
||||
// oes
|
||||
if len(runes) > 3 && runes[len(runes)-3] == 'o' {
|
||||
if isException(runes, oesExceptions) {
|
||||
// Only remove the S
|
||||
return string(runes[0 : len(runes)-1])
|
||||
}
|
||||
// Remove the es
|
||||
return string(runes[0 : len(runes)-2])
|
||||
}
|
||||
|
||||
if len(runes) > 4 {
|
||||
// shes/sses
|
||||
if runes[len(runes)-4] == 's' && (runes[len(runes)-3] == 'h' || runes[len(runes)-3] == 's') {
|
||||
return string(runes[0 : len(runes)-2])
|
||||
}
|
||||
|
||||
// ches
|
||||
if len(runes) > 4 {
|
||||
if runes[len(runes)-4] == 'c' && runes[len(runes)-3] == 'h' {
|
||||
if isException(runes, chesExceptions) {
|
||||
// Only remove the S
|
||||
return string(runes[0 : len(runes)-1])
|
||||
}
|
||||
// Remove the es
|
||||
return string(runes[0 : len(runes)-2])
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
fallthrough
|
||||
default:
|
||||
return string(runes[0 : len(runes)-1])
|
||||
}
|
||||
}
|
||||
|
||||
func isException(word []rune, exceptions []string) bool {
|
||||
for _, exception := range exceptions {
|
||||
|
||||
exceptionRunes := []rune(exception)
|
||||
|
||||
exceptionPos := len(exceptionRunes) - 1
|
||||
wordPos := len(word) - 1
|
||||
|
||||
matched := true
|
||||
for exceptionPos >= 0 && wordPos >= 0 {
|
||||
if exceptionRunes[exceptionPos] != word[wordPos] {
|
||||
matched = false
|
||||
break
|
||||
}
|
||||
exceptionPos--
|
||||
wordPos--
|
||||
}
|
||||
if matched {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
46
analysis/lang/en/plural_stemmer_test.go
Normal file
46
analysis/lang/en/plural_stemmer_test.go
Normal file
|
@ -0,0 +1,46 @@
|
|||
package en
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestEnglishPluralStemmer(t *testing.T) {
|
||||
data := []struct {
|
||||
In, Out string
|
||||
}{
|
||||
{"dresses", "dress"},
|
||||
{"dress", "dress"},
|
||||
{"axes", "axe"},
|
||||
{"ad", "ad"},
|
||||
{"ads", "ad"},
|
||||
{"gas", "ga"},
|
||||
{"sass", "sass"},
|
||||
{"berries", "berry"},
|
||||
{"dresses", "dress"},
|
||||
{"spies", "spy"},
|
||||
{"shoes", "shoe"},
|
||||
{"headaches", "headache"},
|
||||
{"computer", "computer"},
|
||||
{"dressing", "dressing"},
|
||||
{"clothes", "clothe"},
|
||||
{"DRESSES", "dress"},
|
||||
{"frog", "frog"},
|
||||
{"dress", "dress"},
|
||||
{"runs", "run"},
|
||||
{"pies", "pie"},
|
||||
{"foxes", "fox"},
|
||||
{"axes", "axe"},
|
||||
{"foes", "fo"},
|
||||
{"dishes", "dish"},
|
||||
{"snitches", "snitch"},
|
||||
{"cliches", "cliche"},
|
||||
{"forests", "forest"},
|
||||
{"yes", "ye"},
|
||||
}
|
||||
|
||||
for _, datum := range data {
|
||||
stemmed := stem(datum.In)
|
||||
|
||||
if stemmed != datum.Out {
|
||||
t.Errorf("expected %v but got %v", datum.Out, stemmed)
|
||||
}
|
||||
}
|
||||
}
|
70
analysis/lang/en/possessive_filter_en.go
Normal file
70
analysis/lang/en/possessive_filter_en.go
Normal file
|
@ -0,0 +1,70 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
// PossessiveName is the name PossessiveFilter is registered as
|
||||
// in the bleve registry.
|
||||
const PossessiveName = "possessive_en"
|
||||
|
||||
const rightSingleQuotationMark = '’'
|
||||
const apostrophe = '\''
|
||||
const fullWidthApostrophe = '''
|
||||
|
||||
const apostropheChars = rightSingleQuotationMark + apostrophe + fullWidthApostrophe
|
||||
|
||||
// PossessiveFilter implements a TokenFilter which
|
||||
// strips the English possessive suffix ('s) from tokens.
|
||||
// It handle a variety of apostrophe types, is case-insensitive
|
||||
// and doesn't distinguish between possessive and contraction.
|
||||
// (ie "She's So Rad" becomes "She So Rad")
|
||||
type PossessiveFilter struct {
|
||||
}
|
||||
|
||||
func NewPossessiveFilter() *PossessiveFilter {
|
||||
return &PossessiveFilter{}
|
||||
}
|
||||
|
||||
func (s *PossessiveFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
lastRune, lastRuneSize := utf8.DecodeLastRune(token.Term)
|
||||
if lastRune == 's' || lastRune == 'S' {
|
||||
nextLastRune, nextLastRuneSize := utf8.DecodeLastRune(token.Term[:len(token.Term)-lastRuneSize])
|
||||
if nextLastRune == rightSingleQuotationMark ||
|
||||
nextLastRune == apostrophe ||
|
||||
nextLastRune == fullWidthApostrophe {
|
||||
token.Term = token.Term[:len(token.Term)-lastRuneSize-nextLastRuneSize]
|
||||
}
|
||||
}
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func PossessiveFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewPossessiveFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(PossessiveName, PossessiveFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
142
analysis/lang/en/possessive_filter_en_test.go
Normal file
142
analysis/lang/en/possessive_filter_en_test.go
Normal file
|
@ -0,0 +1,142 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestEnglishPossessiveFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("marty's"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY'S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty’s"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY’S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty's"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY'S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("m"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("s"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("'s"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("marty"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("m"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("s"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
stemmerFilter, err := cache.TokenFilterNamed(PossessiveName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := stemmerFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func BenchmarkEnglishPossessiveFilter(b *testing.B) {
|
||||
|
||||
input := analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("marty's"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY'S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty’s"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY’S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("marty's"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("MARTY'S"),
|
||||
},
|
||||
&analysis.Token{
|
||||
Term: []byte("m"),
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
stemmerFilter, err := cache.TokenFilterNamed(PossessiveName)
|
||||
if err != nil {
|
||||
b.Fatal(err)
|
||||
}
|
||||
b.ResetTimer()
|
||||
|
||||
for i := 0; i < b.N; i++ {
|
||||
stemmerFilter.Filter(input)
|
||||
}
|
||||
|
||||
}
|
52
analysis/lang/en/stemmer_en_snowball.go
Normal file
52
analysis/lang/en/stemmer_en_snowball.go
Normal file
|
@ -0,0 +1,52 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/snowballstem"
|
||||
"github.com/blevesearch/snowballstem/english"
|
||||
)
|
||||
|
||||
const SnowballStemmerName = "stemmer_en_snowball"
|
||||
|
||||
type EnglishStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewEnglishStemmerFilter() *EnglishStemmerFilter {
|
||||
return &EnglishStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *EnglishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
env := snowballstem.NewEnv(string(token.Term))
|
||||
english.Stem(env)
|
||||
token.Term = []byte(env.Current())
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func EnglishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewEnglishStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(SnowballStemmerName, EnglishStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
79
analysis/lang/en/stemmer_en_test.go
Normal file
79
analysis/lang/en/stemmer_en_test.go
Normal file
|
@ -0,0 +1,79 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestSnowballEnglishStemmer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoy"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoy"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoyed"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoy"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoyable"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("enjoy"),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := filter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/en/stop_filter_en.go
Normal file
36
analysis/lang/en/stop_filter_en.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2014 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package en
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
347
analysis/lang/en/stop_words_en.go
Normal file
347
analysis/lang/en/stop_words_en.go
Normal file
|
@ -0,0 +1,347 @@
|
|||
package en
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_en"
|
||||
|
||||
// EnglishStopWords is the built-in list of stopwords used by the "stop_en" TokenFilter.
|
||||
//
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
|
||||
// ` was changed to ' to allow for literal string
|
||||
var EnglishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt
|
||||
| This file is distributed under the BSD License.
|
||||
| See http://snowball.tartarus.org/license.php
|
||||
| Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
| - Encoding was converted to UTF-8.
|
||||
| - This notice was added.
|
||||
|
|
||||
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
|
||||
|
||||
| An English stop word list. Comments begin with vertical bar. Each stop
|
||||
| word is at the start of a line.
|
||||
|
||||
| Many of the forms below are quite rare (e.g. "yourselves") but included for
|
||||
| completeness.
|
||||
|
||||
| PRONOUNS FORMS
|
||||
| 1st person sing
|
||||
|
||||
i | subject, always in upper case of course
|
||||
|
||||
me | object
|
||||
my | possessive adjective
|
||||
| the possessive pronoun 'mine' is best suppressed, because of the
|
||||
| sense of coal-mine etc.
|
||||
myself | reflexive
|
||||
| 1st person plural
|
||||
we | subject
|
||||
|
||||
| us | object
|
||||
| care is required here because US = United States. It is usually
|
||||
| safe to remove it if it is in lower case.
|
||||
our | possessive adjective
|
||||
ours | possessive pronoun
|
||||
ourselves | reflexive
|
||||
| second person (archaic 'thou' forms not included)
|
||||
you | subject and object
|
||||
your | possessive adjective
|
||||
yours | possessive pronoun
|
||||
yourself | reflexive (singular)
|
||||
yourselves | reflexive (plural)
|
||||
| third person singular
|
||||
he | subject
|
||||
him | object
|
||||
his | possessive adjective and pronoun
|
||||
himself | reflexive
|
||||
|
||||
she | subject
|
||||
her | object and possessive adjective
|
||||
hers | possessive pronoun
|
||||
herself | reflexive
|
||||
|
||||
it | subject and object
|
||||
its | possessive adjective
|
||||
itself | reflexive
|
||||
| third person plural
|
||||
they | subject
|
||||
them | object
|
||||
their | possessive adjective
|
||||
theirs | possessive pronoun
|
||||
themselves | reflexive
|
||||
| other forms (demonstratives, interrogatives)
|
||||
what
|
||||
which
|
||||
who
|
||||
whom
|
||||
this
|
||||
that
|
||||
these
|
||||
those
|
||||
|
||||
| VERB FORMS (using F.R. Palmer's nomenclature)
|
||||
| BE
|
||||
am | 1st person, present
|
||||
is | -s form (3rd person, present)
|
||||
are | present
|
||||
was | 1st person, past
|
||||
were | past
|
||||
be | infinitive
|
||||
been | past participle
|
||||
being | -ing form
|
||||
| HAVE
|
||||
have | simple
|
||||
has | -s form
|
||||
had | past
|
||||
having | -ing form
|
||||
| DO
|
||||
do | simple
|
||||
does | -s form
|
||||
did | past
|
||||
doing | -ing form
|
||||
|
||||
| The forms below are, I believe, best omitted, because of the significant
|
||||
| homonym forms:
|
||||
|
||||
| He made a WILL
|
||||
| old tin CAN
|
||||
| merry month of MAY
|
||||
| a smell of MUST
|
||||
| fight the good fight with all thy MIGHT
|
||||
|
||||
| would, could, should, ought might however be included
|
||||
|
||||
| | AUXILIARIES
|
||||
| | WILL
|
||||
|will
|
||||
|
||||
would
|
||||
|
||||
| | SHALL
|
||||
|shall
|
||||
|
||||
should
|
||||
|
||||
| | CAN
|
||||
|can
|
||||
|
||||
could
|
||||
|
||||
| | MAY
|
||||
|may
|
||||
|might
|
||||
| | MUST
|
||||
|must
|
||||
| | OUGHT
|
||||
|
||||
ought
|
||||
|
||||
| COMPOUND FORMS, increasingly encountered nowadays in 'formal' writing
|
||||
| pronoun + verb
|
||||
|
||||
i'm
|
||||
you're
|
||||
he's
|
||||
she's
|
||||
it's
|
||||
we're
|
||||
they're
|
||||
i've
|
||||
you've
|
||||
we've
|
||||
they've
|
||||
i'd
|
||||
you'd
|
||||
he'd
|
||||
she'd
|
||||
we'd
|
||||
they'd
|
||||
i'll
|
||||
you'll
|
||||
he'll
|
||||
she'll
|
||||
we'll
|
||||
they'll
|
||||
|
||||
| verb + negation
|
||||
|
||||
isn't
|
||||
aren't
|
||||
wasn't
|
||||
weren't
|
||||
hasn't
|
||||
haven't
|
||||
hadn't
|
||||
doesn't
|
||||
don't
|
||||
didn't
|
||||
|
||||
| auxiliary + negation
|
||||
|
||||
won't
|
||||
wouldn't
|
||||
shan't
|
||||
shouldn't
|
||||
can't
|
||||
cannot
|
||||
couldn't
|
||||
mustn't
|
||||
|
||||
| miscellaneous forms
|
||||
|
||||
let's
|
||||
that's
|
||||
who's
|
||||
what's
|
||||
here's
|
||||
there's
|
||||
when's
|
||||
where's
|
||||
why's
|
||||
how's
|
||||
|
||||
| rarer forms
|
||||
|
||||
| daren't needn't
|
||||
|
||||
| doubtful forms
|
||||
|
||||
| oughtn't mightn't
|
||||
|
||||
| ARTICLES
|
||||
a
|
||||
an
|
||||
the
|
||||
|
||||
| THE REST (Overlap among prepositions, conjunctions, adverbs etc is so
|
||||
| high, that classification is pointless.)
|
||||
and
|
||||
but
|
||||
if
|
||||
or
|
||||
because
|
||||
as
|
||||
until
|
||||
while
|
||||
|
||||
of
|
||||
at
|
||||
by
|
||||
for
|
||||
with
|
||||
about
|
||||
against
|
||||
between
|
||||
into
|
||||
through
|
||||
during
|
||||
before
|
||||
after
|
||||
above
|
||||
below
|
||||
to
|
||||
from
|
||||
up
|
||||
down
|
||||
in
|
||||
out
|
||||
on
|
||||
off
|
||||
over
|
||||
under
|
||||
|
||||
again
|
||||
further
|
||||
then
|
||||
once
|
||||
|
||||
here
|
||||
there
|
||||
when
|
||||
where
|
||||
why
|
||||
how
|
||||
|
||||
all
|
||||
any
|
||||
both
|
||||
each
|
||||
few
|
||||
more
|
||||
most
|
||||
other
|
||||
some
|
||||
such
|
||||
|
||||
no
|
||||
nor
|
||||
not
|
||||
only
|
||||
own
|
||||
same
|
||||
so
|
||||
than
|
||||
too
|
||||
very
|
||||
|
||||
| Just for the record, the following words are among the commonest in English
|
||||
|
||||
| one
|
||||
| every
|
||||
| least
|
||||
| less
|
||||
| many
|
||||
| now
|
||||
| ever
|
||||
| never
|
||||
| say
|
||||
| says
|
||||
| said
|
||||
| also
|
||||
| get
|
||||
| go
|
||||
| goes
|
||||
| just
|
||||
| made
|
||||
| make
|
||||
| put
|
||||
| see
|
||||
| seen
|
||||
| whether
|
||||
| like
|
||||
| well
|
||||
| back
|
||||
| even
|
||||
| still
|
||||
| way
|
||||
| take
|
||||
| since
|
||||
| another
|
||||
| however
|
||||
| two
|
||||
| three
|
||||
| four
|
||||
| five
|
||||
| first
|
||||
| second
|
||||
| new
|
||||
| old
|
||||
| high
|
||||
| long
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(EnglishStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
66
analysis/lang/es/analyzer_es.go
Normal file
66
analysis/lang/es/analyzer_es.go
Normal file
|
@ -0,0 +1,66 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
|
||||
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
|
||||
)
|
||||
|
||||
const AnalyzerName = "es"
|
||||
|
||||
func AnalyzerConstructor(config map[string]interface{},
|
||||
cache *registry.Cache) (analysis.Analyzer, error) {
|
||||
unicodeTokenizer, err := cache.TokenizerNamed(unicode.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
normalizeEsFilter, err := cache.TokenFilterNamed(NormalizeName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
stopEsFilter, err := cache.TokenFilterNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
lightStemmerEsFilter, err := cache.TokenFilterNamed(LightStemmerName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
rv := analysis.DefaultAnalyzer{
|
||||
Tokenizer: unicodeTokenizer,
|
||||
TokenFilters: []analysis.TokenFilter{
|
||||
toLowerFilter,
|
||||
stopEsFilter,
|
||||
normalizeEsFilter,
|
||||
lightStemmerEsFilter,
|
||||
},
|
||||
}
|
||||
return &rv, nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
122
analysis/lang/es/analyzer_es_test.go
Normal file
122
analysis/lang/es/analyzer_es_test.go
Normal file
|
@ -0,0 +1,122 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestSpanishAnalyzer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input []byte
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
// stemming
|
||||
{
|
||||
input: []byte("chicana"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("chican"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 7,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("chicano"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("chican"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 7,
|
||||
},
|
||||
},
|
||||
},
|
||||
// added by marty for better coverage
|
||||
{
|
||||
input: []byte("yeses"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("yes"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("jaeces"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("jaez"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 6,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("arcos"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("arc"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 5,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("caos"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("caos"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 4,
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: []byte("parecer"),
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("parecer"),
|
||||
Position: 1,
|
||||
Start: 0,
|
||||
End: 7,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
analyzer, err := cache.AnalyzerNamed(AnalyzerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := analyzer.Analyze(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %v, got %v", test.output, actual)
|
||||
}
|
||||
}
|
||||
}
|
78
analysis/lang/es/light_stemmer_es.go
Normal file
78
analysis/lang/es/light_stemmer_es.go
Normal file
|
@ -0,0 +1,78 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const LightStemmerName = "stemmer_es_light"
|
||||
|
||||
type SpanishLightStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewSpanishLightStemmerFilter() *SpanishLightStemmerFilter {
|
||||
return &SpanishLightStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *SpanishLightStemmerFilter) Filter(
|
||||
input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
runes := bytes.Runes(token.Term)
|
||||
runes = stem(runes)
|
||||
token.Term = analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func stem(input []rune) []rune {
|
||||
l := len(input)
|
||||
if l < 5 {
|
||||
return input
|
||||
}
|
||||
|
||||
switch input[l-1] {
|
||||
case 'o', 'a', 'e':
|
||||
return input[:l-1]
|
||||
case 's':
|
||||
if input[l-2] == 'e' && input[l-3] == 's' && input[l-4] == 'e' {
|
||||
return input[:l-2]
|
||||
}
|
||||
if input[l-2] == 'e' && input[l-3] == 'c' {
|
||||
input[l-3] = 'z'
|
||||
return input[:l-2]
|
||||
}
|
||||
if input[l-2] == 'o' || input[l-2] == 'a' || input[l-2] == 'e' {
|
||||
return input[:l-2]
|
||||
}
|
||||
}
|
||||
|
||||
return input
|
||||
}
|
||||
|
||||
func SpanishLightStemmerFilterConstructor(config map[string]interface{},
|
||||
cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewSpanishLightStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(LightStemmerName, SpanishLightStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
70
analysis/lang/es/spanish_normalize.go
Normal file
70
analysis/lang/es/spanish_normalize.go
Normal file
|
@ -0,0 +1,70 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const NormalizeName = "normalize_es"
|
||||
|
||||
type SpanishNormalizeFilter struct {
|
||||
}
|
||||
|
||||
func NewSpanishNormalizeFilter() *SpanishNormalizeFilter {
|
||||
return &SpanishNormalizeFilter{}
|
||||
}
|
||||
|
||||
func (s *SpanishNormalizeFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
term := normalize(token.Term)
|
||||
token.Term = term
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func normalize(input []byte) []byte {
|
||||
runes := bytes.Runes(input)
|
||||
for i := 0; i < len(runes); i++ {
|
||||
switch runes[i] {
|
||||
case 'à', 'á', 'â', 'ä':
|
||||
runes[i] = 'a'
|
||||
case 'ò', 'ó', 'ô', 'ö':
|
||||
runes[i] = 'o'
|
||||
case 'è', 'é', 'ê', 'ë':
|
||||
runes[i] = 'e'
|
||||
case 'ù', 'ú', 'û', 'ü':
|
||||
runes[i] = 'u'
|
||||
case 'ì', 'í', 'î', 'ï':
|
||||
runes[i] = 'i'
|
||||
}
|
||||
}
|
||||
|
||||
return analysis.BuildTermFromRunes(runes)
|
||||
}
|
||||
|
||||
func NormalizerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewSpanishNormalizeFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(NormalizeName, NormalizerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
112
analysis/lang/es/spanish_normalize_test.go
Normal file
112
analysis/lang/es/spanish_normalize_test.go
Normal file
|
@ -0,0 +1,112 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
)
|
||||
|
||||
func TestSpanishNormalizeFilter(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Guía"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Guia"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Belcebú"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Belcebu"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Limón"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("Limon"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agüero"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("aguero"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("laúd"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("laud"),
|
||||
},
|
||||
},
|
||||
},
|
||||
// empty
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte(""),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
spanishNormalizeFilter := NewSpanishNormalizeFilter()
|
||||
for _, test := range tests {
|
||||
actual := spanishNormalizeFilter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %#v, got %#v", test.output, actual)
|
||||
t.Errorf("expected %s(% x), got %s(% x)", test.output[0].Term, test.output[0].Term, actual[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
52
analysis/lang/es/stemmer_es_snowball.go
Normal file
52
analysis/lang/es/stemmer_es_snowball.go
Normal file
|
@ -0,0 +1,52 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
|
||||
"github.com/blevesearch/snowballstem"
|
||||
"github.com/blevesearch/snowballstem/spanish"
|
||||
)
|
||||
|
||||
const SnowballStemmerName = "stemmer_es_snowball"
|
||||
|
||||
type SpanishStemmerFilter struct {
|
||||
}
|
||||
|
||||
func NewSpanishStemmerFilter() *SpanishStemmerFilter {
|
||||
return &SpanishStemmerFilter{}
|
||||
}
|
||||
|
||||
func (s *SpanishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
|
||||
for _, token := range input {
|
||||
env := snowballstem.NewEnv(string(token.Term))
|
||||
spanish.Stem(env)
|
||||
token.Term = []byte(env.Current())
|
||||
}
|
||||
return input
|
||||
}
|
||||
|
||||
func SpanishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
return NewSpanishStemmerFilter(), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(SnowballStemmerName, SpanishStemmerFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
79
analysis/lang/es/stemmer_es_snowball_test.go
Normal file
79
analysis/lang/es/stemmer_es_snowball_test.go
Normal file
|
@ -0,0 +1,79 @@
|
|||
// Copyright (c) 2020 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
package es
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func TestSnowballSpanishStemmer(t *testing.T) {
|
||||
tests := []struct {
|
||||
input analysis.TokenStream
|
||||
output analysis.TokenStream
|
||||
}{
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agresivos"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agres"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agresivamente"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agres"),
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
input: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agresividad"),
|
||||
},
|
||||
},
|
||||
output: analysis.TokenStream{
|
||||
&analysis.Token{
|
||||
Term: []byte("agres"),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
cache := registry.NewCache()
|
||||
filter, err := cache.TokenFilterNamed(SnowballStemmerName)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, test := range tests {
|
||||
actual := filter.Filter(test.input)
|
||||
if !reflect.DeepEqual(actual, test.output) {
|
||||
t.Errorf("expected %s, got %s", test.output[0].Term, actual[0].Term)
|
||||
}
|
||||
}
|
||||
}
|
36
analysis/lang/es/stop_filter_es.go
Normal file
36
analysis/lang/es/stop_filter_es.go
Normal file
|
@ -0,0 +1,36 @@
|
|||
// Copyright (c) 2017 Couchbase, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
package es
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/analysis/token/stop"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
func StopTokenFilterConstructor(config map[string]interface{},
|
||||
cache *registry.Cache) (analysis.TokenFilter, error) {
|
||||
tokenMap, err := cache.TokenMapNamed(StopName)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return stop.NewStopTokensFilter(tokenMap), nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
383
analysis/lang/es/stop_words_es.go
Normal file
383
analysis/lang/es/stop_words_es.go
Normal file
|
@ -0,0 +1,383 @@
|
|||
package es
|
||||
|
||||
import (
|
||||
"github.com/blevesearch/bleve/v2/analysis"
|
||||
"github.com/blevesearch/bleve/v2/registry"
|
||||
)
|
||||
|
||||
const StopName = "stop_es"
|
||||
|
||||
// this content was obtained from:
|
||||
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
|
||||
// ` was changed to ' to allow for literal string
|
||||
|
||||
var SpanishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt
|
||||
| This file is distributed under the BSD License.
|
||||
| See http://snowball.tartarus.org/license.php
|
||||
| Also see http://www.opensource.org/licenses/bsd-license.html
|
||||
| - Encoding was converted to UTF-8.
|
||||
| - This notice was added.
|
||||
|
|
||||
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
|
||||
|
||||
| A Spanish stop word list. Comments begin with vertical bar. Each stop
|
||||
| word is at the start of a line.
|
||||
|
||||
|
||||
| The following is a ranked list (commonest to rarest) of stopwords
|
||||
| deriving from a large sample of text.
|
||||
|
||||
| Extra words have been added at the end.
|
||||
|
||||
de | from, of
|
||||
la | the, her
|
||||
que | who, that
|
||||
el | the
|
||||
en | in
|
||||
y | and
|
||||
a | to
|
||||
los | the, them
|
||||
del | de + el
|
||||
se | himself, from him etc
|
||||
las | the, them
|
||||
por | for, by, etc
|
||||
un | a
|
||||
para | for
|
||||
con | with
|
||||
no | no
|
||||
una | a
|
||||
su | his, her
|
||||
al | a + el
|
||||
| es from SER
|
||||
lo | him
|
||||
como | how
|
||||
más | more
|
||||
pero | pero
|
||||
sus | su plural
|
||||
le | to him, her
|
||||
ya | already
|
||||
o | or
|
||||
| fue from SER
|
||||
este | this
|
||||
| ha from HABER
|
||||
sí | himself etc
|
||||
porque | because
|
||||
esta | this
|
||||
| son from SER
|
||||
entre | between
|
||||
| está from ESTAR
|
||||
cuando | when
|
||||
muy | very
|
||||
sin | without
|
||||
sobre | on
|
||||
| ser from SER
|
||||
| tiene from TENER
|
||||
también | also
|
||||
me | me
|
||||
hasta | until
|
||||
hay | there is/are
|
||||
donde | where
|
||||
| han from HABER
|
||||
quien | whom, that
|
||||
| están from ESTAR
|
||||
| estado from ESTAR
|
||||
desde | from
|
||||
todo | all
|
||||
nos | us
|
||||
durante | during
|
||||
| estados from ESTAR
|
||||
todos | all
|
||||
uno | a
|
||||
les | to them
|
||||
ni | nor
|
||||
contra | against
|
||||
otros | other
|
||||
| fueron from SER
|
||||
ese | that
|
||||
eso | that
|
||||
| había from HABER
|
||||
ante | before
|
||||
ellos | they
|
||||
e | and (variant of y)
|
||||
esto | this
|
||||
mí | me
|
||||
antes | before
|
||||
algunos | some
|
||||
qué | what?
|
||||
unos | a
|
||||
yo | I
|
||||
otro | other
|
||||
otras | other
|
||||
otra | other
|
||||
él | he
|
||||
tanto | so much, many
|
||||
esa | that
|
||||
estos | these
|
||||
mucho | much, many
|
||||
quienes | who
|
||||
nada | nothing
|
||||
muchos | many
|
||||
cual | who
|
||||
| sea from SER
|
||||
poco | few
|
||||
ella | she
|
||||
estar | to be
|
||||
| haber from HABER
|
||||
estas | these
|
||||
| estaba from ESTAR
|
||||
| estamos from ESTAR
|
||||
algunas | some
|
||||
algo | something
|
||||
nosotros | we
|
||||
|
||||
| other forms
|
||||
|
||||
mi | me
|
||||
mis | mi plural
|
||||
tú | thou
|
||||
te | thee
|
||||
ti | thee
|
||||
tu | thy
|
||||
tus | tu plural
|
||||
ellas | they
|
||||
nosotras | we
|
||||
vosotros | you
|
||||
vosotras | you
|
||||
os | you
|
||||
mío | mine
|
||||
mía |
|
||||
míos |
|
||||
mías |
|
||||
tuyo | thine
|
||||
tuya |
|
||||
tuyos |
|
||||
tuyas |
|
||||
suyo | his, hers, theirs
|
||||
suya |
|
||||
suyos |
|
||||
suyas |
|
||||
nuestro | ours
|
||||
nuestra |
|
||||
nuestros |
|
||||
nuestras |
|
||||
vuestro | yours
|
||||
vuestra |
|
||||
vuestros |
|
||||
vuestras |
|
||||
esos | those
|
||||
esas | those
|
||||
|
||||
| forms of estar, to be (not including the infinitive):
|
||||
estoy
|
||||
estás
|
||||
está
|
||||
estamos
|
||||
estáis
|
||||
están
|
||||
esté
|
||||
estés
|
||||
estemos
|
||||
estéis
|
||||
estén
|
||||
estaré
|
||||
estarás
|
||||
estará
|
||||
estaremos
|
||||
estaréis
|
||||
estarán
|
||||
estaría
|
||||
estarías
|
||||
estaríamos
|
||||
estaríais
|
||||
estarían
|
||||
estaba
|
||||
estabas
|
||||
estábamos
|
||||
estabais
|
||||
estaban
|
||||
estuve
|
||||
estuviste
|
||||
estuvo
|
||||
estuvimos
|
||||
estuvisteis
|
||||
estuvieron
|
||||
estuviera
|
||||
estuvieras
|
||||
estuviéramos
|
||||
estuvierais
|
||||
estuvieran
|
||||
estuviese
|
||||
estuvieses
|
||||
estuviésemos
|
||||
estuvieseis
|
||||
estuviesen
|
||||
estando
|
||||
estado
|
||||
estada
|
||||
estados
|
||||
estadas
|
||||
estad
|
||||
|
||||
| forms of haber, to have (not including the infinitive):
|
||||
he
|
||||
has
|
||||
ha
|
||||
hemos
|
||||
habéis
|
||||
han
|
||||
haya
|
||||
hayas
|
||||
hayamos
|
||||
hayáis
|
||||
hayan
|
||||
habré
|
||||
habrás
|
||||
habrá
|
||||
habremos
|
||||
habréis
|
||||
habrán
|
||||
habría
|
||||
habrías
|
||||
habríamos
|
||||
habríais
|
||||
habrían
|
||||
había
|
||||
habías
|
||||
habíamos
|
||||
habíais
|
||||
habían
|
||||
hube
|
||||
hubiste
|
||||
hubo
|
||||
hubimos
|
||||
hubisteis
|
||||
hubieron
|
||||
hubiera
|
||||
hubieras
|
||||
hubiéramos
|
||||
hubierais
|
||||
hubieran
|
||||
hubiese
|
||||
hubieses
|
||||
hubiésemos
|
||||
hubieseis
|
||||
hubiesen
|
||||
habiendo
|
||||
habido
|
||||
habida
|
||||
habidos
|
||||
habidas
|
||||
|
||||
| forms of ser, to be (not including the infinitive):
|
||||
soy
|
||||
eres
|
||||
es
|
||||
somos
|
||||
sois
|
||||
son
|
||||
sea
|
||||
seas
|
||||
seamos
|
||||
seáis
|
||||
sean
|
||||
seré
|
||||
serás
|
||||
será
|
||||
seremos
|
||||
seréis
|
||||
serán
|
||||
sería
|
||||
serías
|
||||
seríamos
|
||||
seríais
|
||||
serían
|
||||
era
|
||||
eras
|
||||
éramos
|
||||
erais
|
||||
eran
|
||||
fui
|
||||
fuiste
|
||||
fue
|
||||
fuimos
|
||||
fuisteis
|
||||
fueron
|
||||
fuera
|
||||
fueras
|
||||
fuéramos
|
||||
fuerais
|
||||
fueran
|
||||
fuese
|
||||
fueses
|
||||
fuésemos
|
||||
fueseis
|
||||
fuesen
|
||||
siendo
|
||||
sido
|
||||
| sed also means 'thirst'
|
||||
|
||||
| forms of tener, to have (not including the infinitive):
|
||||
tengo
|
||||
tienes
|
||||
tiene
|
||||
tenemos
|
||||
tenéis
|
||||
tienen
|
||||
tenga
|
||||
tengas
|
||||
tengamos
|
||||
tengáis
|
||||
tengan
|
||||
tendré
|
||||
tendrás
|
||||
tendrá
|
||||
tendremos
|
||||
tendréis
|
||||
tendrán
|
||||
tendría
|
||||
tendrías
|
||||
tendríamos
|
||||
tendríais
|
||||
tendrían
|
||||
tenía
|
||||
tenías
|
||||
teníamos
|
||||
teníais
|
||||
tenían
|
||||
tuve
|
||||
tuviste
|
||||
tuvo
|
||||
tuvimos
|
||||
tuvisteis
|
||||
tuvieron
|
||||
tuviera
|
||||
tuvieras
|
||||
tuviéramos
|
||||
tuvierais
|
||||
tuvieran
|
||||
tuviese
|
||||
tuvieses
|
||||
tuviésemos
|
||||
tuvieseis
|
||||
tuviesen
|
||||
teniendo
|
||||
tenido
|
||||
tenida
|
||||
tenidos
|
||||
tenidas
|
||||
tened
|
||||
|
||||
`)
|
||||
|
||||
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
|
||||
rv := analysis.NewTokenMap()
|
||||
err := rv.LoadBytes(SpanishStopWords)
|
||||
return rv, err
|
||||
}
|
||||
|
||||
func init() {
|
||||
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue