Skip to content

Commit

Permalink
fix(Index): stop word filtering was not working correctly.
Browse files Browse the repository at this point in the history
See #10
"Can't index or search for "one" (or: maybe filter stopwords before stemming?)"

Bumped `indexVersion` to `1.1.0` as filter and transform changes could introduce surprising behaviour.

Index now applies Initial Transforms then Filters (stop word filtesr) then Transforms.
The only Initial Transform thats is currently used is `TokenProcessors.trimmer` to remove non word characters prefixes and suffixes.
This initial transform is useful to better match stop word filters.The word "one" will now properly index and be found even though when transformed it matches a default stop word of "on".

In addition some type changes were introduced to make it possible in future to implement loading and possibly saving of older versions of index.
  • Loading branch information
rluiten committed Apr 30, 2017
1 parent e47e72a commit a4ac4be
Show file tree
Hide file tree
Showing 19 changed files with 280 additions and 174 deletions.
31 changes: 14 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,24 @@ Copyright (c) 2016-2017 Robin Luiten
This is a full text indexing engine inspired by lunr.js and written in Elm language.
See http://lunrjs.com/ for lunr.js

While ElmTextSearch has a good selection of tests this library is not battle tested and may contain some performance issues that need addressing.
While ElmTextSearch has a good selection of tests this library is not battle tested and may contain some performance issues.

### Upgrading from 2.1.2 to 3.0.0.
I am happy to hear about users of this package.

Add `listFields = []` to your Configuration and that should be all thats required.
I am happy to receive contributions be they bug reports, pull requests, documention updates or examples.

* New field in Config is listFields for doc fields of type `List String`
### v4.0.0 will not load indexes saved with old version.

If you do not use `storeToValue` `storeToString` `fromString` `fromValue` in ElmTextSearch this update is not likely to introduce issues.

The way that filters and transforms are applied to the content of documents has changed.
This is to properly fix a bug reported see https://github.com/rluiten/elm-text-search/issues/10 where stop word filters werer not correctly applied. This means saved indexes from prevoius version of ElmTextSearch will not load in this version.

* `Defaults.indexVersion` has changed value.

The reason this is a Major version bump is some generalisation was done to enable future support
for loading and saving of older version and types of default index confgurations.

Example of updated config passed to create Index.
```elm
createNewIndexExample : ElmTextSearch.Index ExampleDocType
createNewIndexExample =
ElmTextSearch.new
{ ref = .cid
, fields =
[ ( .title, 5.0 )
, ( .body, 1.0 )
]
, listFields = []
}
```

### Packages

Expand Down
2 changes: 1 addition & 1 deletion elm-package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"version": "3.1.1",
"version": "4.0.0",
"summary": "Full text index engine in Elm language inspired by lunr.js.",
"repository": "https://github.com/rluiten/elm-text-search.git",
"license": "BSD3",
Expand Down
1 change: 1 addition & 0 deletions examples/IndexNewWithAddSearch.elm
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ createNewWithIndexExample =
, ( .body, 1.0 )
]
, listFields = []
, initialTransformFactories = Index.Defaults.defaultInitialTransformFactories
, transformFactories = Index.Defaults.defaultTransformFactories
, filterFactories = [ createMyStopWordFilter ]
}
Expand Down
2 changes: 1 addition & 1 deletion examples/MultipleAddSearch.elm
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ module Main exposing (..)

{-| Create an index and add multiple documents.
Copyright (c) 2016 Robin Luiten
Copyright (c) 2016-2017 Robin Luiten
-}

Expand Down
29 changes: 11 additions & 18 deletions src/ElmTextSearch.elm
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,12 @@ type alias Index doc =

{-| A SimpleConfig is the least amount of configuration data
required to create an Index.
See [`ElmTextSearch.new`](ElmTextSearch#new) for fields.
-}
type alias SimpleConfig doc =
{ ref : doc -> String
, fields : List ( doc -> String, Float )
, listFields : List ( doc -> List String, Float )
}
Model.IndexSimpleConfig doc


{-| A Config is required to create an Index.
Expand All @@ -109,17 +109,6 @@ type alias Config doc =
Model.Config doc


{-| convert ElmTextSearch.SimpleConfig to Index.Model.SimpleConfig
-}
getIndexSimpleConfig : SimpleConfig doc -> Model.SimpleConfig doc
getIndexSimpleConfig { ref, fields, listFields } =
{ indexType = Defaults.elmTextSearchIndexType
, ref = ref
, fields = fields
, listFields = listFields
}


{-| Create new index.
Example
Expand Down Expand Up @@ -155,6 +144,7 @@ The `SimpleConfig` parameter to new is
- The unique document reference will be extracted from each
document using `.cid`.
- fields
- Define which fields contain a strings to be indexed.
- The following fields will be indexed from each document
- `.title`
- `.body`
Expand All @@ -163,11 +153,13 @@ The `SimpleConfig` parameter to new is
more than if found in the `.body` field (boost value 1.0).
- The document match score determines the order of the list
of matching documents returned.
- listFields
- Define which fields contain list of strings to be indexed.
-}
new : SimpleConfig doc -> Index doc
new simpleConfig =
Index.new (getIndexSimpleConfig simpleConfig)
Index.new (Defaults.getIndexSimpleConfig simpleConfig)


{-| Create new index with additional configuration.
Expand Down Expand Up @@ -199,6 +191,7 @@ Example.
, ( .body, 1.0 )
]
, listFields = []
, initialTransformFactories = Index.Defaults.defaultInitialTransformFactories
, transformFactories = Index.Defaults.defaultTransformFactories
, filterFactories = [ createMyStopWordFilter ]
}
Expand Down Expand Up @@ -367,7 +360,7 @@ See [`ElmTextSearch.fromStringWith`](ElmTextSearch#fromStringWith) for possible
fromString : SimpleConfig doc -> String -> Result String (Index doc)
fromString simpleConfig inputString =
Index.Load.loadIndex
(getIndexSimpleConfig simpleConfig)
(Defaults.getIndexSimpleConfig simpleConfig)
inputString


Expand All @@ -377,7 +370,7 @@ See [`ElmTextSearch.fromStringWith`](ElmTextSearch#fromStringWith) for possible
fromValue : SimpleConfig doc -> Decode.Value -> Result String (Index doc)
fromValue simpleConfig inputValue =
Index.Load.loadIndexValue
(getIndexSimpleConfig simpleConfig)
(Defaults.getIndexSimpleConfig simpleConfig)
inputValue


Expand Down
12 changes: 5 additions & 7 deletions src/Index.elm
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ type alias Config doc =


type alias SimpleConfig doc =
Model.SimpleConfig doc
Model.ModelSimpleConfig doc


{-| Create new index.
Expand All @@ -77,15 +77,17 @@ new simpleConfig =
{-| Create new index with control of transformers and filters.
-}
newWith : Config doc -> Index doc
newWith { indexType, ref, fields, listFields, transformFactories, filterFactories } =
newWith { indexType, ref, fields, listFields, initialTransformFactories, transformFactories, filterFactories } =
Index
{ indexVersion = Defaults.indexVersion
, indexType = indexType
, ref = ref
, fields = fields
, listFields = listFields
, initialTransformFactories = initialTransformFactories
, transformFactories = transformFactories
, filterFactories = filterFactories
, initialTransforms = Nothing
, transforms = Nothing
, filters = Nothing
, corpusTokens = Set.empty
Expand Down Expand Up @@ -117,21 +119,17 @@ add doc ((Index irec) as index) =
( index, [] )
(List.map Tuple.first irec.fields)

-- _ = Debug.log "fieldsWordList" fieldsWordList
( u2index, u2fieldsWordList ) =
List.foldr
(getWordsForFieldList doc)
( u1index, fieldsWordList )
(List.map Tuple.first irec.listFields)

-- _ = Debug.log "u2fieldsWordList" u2fieldsWordList
fieldsTokens =
List.map Set.fromList u2fieldsWordList

docTokens =
List.foldr Set.union Set.empty fieldsTokens

-- _ = Debug.log("add docTokens") (docTokens)
in
if Set.isEmpty docTokens then
Err "Error after tokenisation there are no terms to index."
Expand Down Expand Up @@ -173,7 +171,7 @@ addDocsCore docsI docs ((Index irec) as index) errors =
addDocsCore (docsI + 1) tailDocs index (errors ++ [ ( docsI, msg ) ])


{-| reducer to extract tokens from each field Strin from doc
{-| reducer to extract tokens from each field String from doc
-}
getWordsForField :
doc
Expand Down
40 changes: 34 additions & 6 deletions src/Index/Defaults.elm
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ module Index.Defaults
, defaultStemmerFuncCreator
, defaultStopWordFilterFuncCreator
, getDefaultIndexConfig
, getIndexSimpleConfig
, defaultInitialTransformFactories
)

{-| Defaults for indexes and configurations.
Expand All @@ -26,18 +28,25 @@ module Index.Defaults
@docs defaultTokenTrimmerFuncCreator
@docs defaultStemmerFuncCreator
@docs defaultStopWordFilterFuncCreator
@docs defaultInitialTransformFactories
## Config type adapters
@docs getDefaultIndexConfig
@docs getIndexSimpleConfig
Copyright (c) 2016 Robin Luiten
Copyright (c) 2016-2017 Robin Luiten
-}

import Stemmer
import Index.Model as Model exposing (TransformFactory, FilterFactory)
import Index.Model as Model
exposing
( TransformFactory
, FilterFactory
, IndexSimpleConfig
)
import Index.Utils
import StopWordFilter
import TokenProcessors
Expand All @@ -54,7 +63,7 @@ well.
-}
indexVersion : String
indexVersion =
"1.0.0"
"1.1.0"


{-| The type of index defaults to using.
Expand All @@ -69,8 +78,15 @@ elmTextSearchIndexType =
-}
defaultTransformFactories : List (TransformFactory doc)
defaultTransformFactories =
[ defaultStemmerFuncCreator
]


{-| Index default transform factories that apply before filters.
-}
defaultInitialTransformFactories : List (TransformFactory doc)
defaultInitialTransformFactories =
[ defaultTokenTrimmerFuncCreator
, defaultStemmerFuncCreator
]


Expand Down Expand Up @@ -104,16 +120,28 @@ defaultStopWordFilterFuncCreator =
StopWordFilter.createDefaultFilterFunc


{-| Convert Index.Model.SimpleConfig to Index.Model.Config
{-| Convert Index.Model.ModelSimpleConfig to Index.Model.Config
Filling in default values for fields not in SimpleConfig
This is the definition of the default index configuration.
-}
getDefaultIndexConfig : Model.SimpleConfig doc -> Model.Config doc
getDefaultIndexConfig : Model.ModelSimpleConfig doc -> Model.Config doc
getDefaultIndexConfig { indexType, ref, fields, listFields } =
{ indexType = indexType
, ref = ref
, fields = fields
, listFields = listFields
, initialTransformFactories = defaultInitialTransformFactories
, transformFactories = defaultTransformFactories
, filterFactories = defaultFilterFactories
}


{-| convert ElmTextSearch.IndexSimpleConfig to Index.Model.ModelSimpleConfig
-}
getIndexSimpleConfig : IndexSimpleConfig doc -> Model.ModelSimpleConfig doc
getIndexSimpleConfig { ref, fields, listFields } =
{ indexType = elmTextSearchIndexType
, ref = ref
, fields = fields
, listFields = listFields
}
8 changes: 5 additions & 3 deletions src/Index/Load.elm
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ module Index.Load exposing (..)

{-| Load an index from Value or String
Copyright (c) 2016 Robin Luiten
Copyright (c) 2016-2017 Robin Luiten
-}

Expand Down Expand Up @@ -94,27 +94,29 @@ loadIndexFull ( config, decodedIndex ) =
, ref = config.ref
, fields = config.fields
, listFields = config.listFields
, initialTransformFactories = config.initialTransformFactories
, transformFactories = config.transformFactories
, filterFactories = config.filterFactories
, documentStore = decodedIndex.documentStore
, corpusTokens = decodedIndex.corpusTokens
, tokenStore = decodedIndex.tokenStore
, corpusTokensIndex =
(Index.Utils.buildOrderIndex decodedIndex.corpusTokens)
, initialTransforms = Nothing
, transforms = Nothing
, filters = Nothing
, idfCache = Dict.empty
}


loadIndex : SimpleConfig doc -> String -> Result String (Index doc)
loadIndex : ModelSimpleConfig doc -> String -> Result String (Index doc)
loadIndex simpleConfig inputString =
loadIndexWith
[ Defaults.getDefaultIndexConfig simpleConfig ]
inputString


loadIndexValue : SimpleConfig doc -> Decode.Value -> Result String (Index doc)
loadIndexValue : ModelSimpleConfig doc -> Decode.Value -> Result String (Index doc)
loadIndexValue simpleConfig inputValue =
loadIndexValueWith
[ Defaults.getDefaultIndexConfig simpleConfig ]
Expand Down
Loading

0 comments on commit a4ac4be

Please sign in to comment.