Thursday, December 15, 2011

Full Text Search Basics

Sql server full text search is a feature used to scan or search the large document stored in a table like documents, html, varbinary etc..we can do a like '%'  search, but these are much more efficient & optimised .
It can also rank the results based on the proximity of words you define & the parts of content you search for, inflectional words . We can tweak the index performance , avoid stop list words from search etc..

By default databases are enabled for full text search unless databases are created using ssms & we can enable the full search text by right clicking database s -> select properties -> check full text indexing







To create full text index --> catalog has to be created to store one or more full text indexes & full text indexes have to be created on them.

Adventureworks database is used for demo.

1) Creating Catalog :
Create FullText Catalog MyCatalog
This creates a folder in the specified directory during installation to store the full text indexes on the hard drive.
Catalog stores alll index files & configuration files. full text indexes store each word from each document indexed and maps them back to the table row for reference.

2) There should be primary key on the table which is unique & non nullable . create primary key if it doesn't exist.

3) create full text index on the table

create fulltext index on production.productreview (comments)
key index PK_ProductReview_ProductReviewID on MyCatalog
with change_tracking auto

comments is the column on which we want to full text index.

PK_ProductReview_ProductReviewID  is the unique primary key on the table

MyCatalog is the catalog on which we want the full text index to be on
change_tracking -> created full text index needs to be populated with data & it has different population modes.
1) Full Population is done for the first time after full text index is created
2) Incremental population can be done either based on change tracking or incremental time stamp based.
 
Full text index is very IO oriented  & intensive operation. so it must scheduled & maintain during down time.


Creating one more advanced full text index on table  production.document.

create fulltext index on production.document
(
document
type column FileExtension
language 0x0
)
key index PK_Document_DocumentID on MyCatalog
with change_tracking auto

Above query says,  create full text index on column document in table production.document & the document columns holds data of type specified in Fileextension column . File extension is used to load the appropriate filter .  lanaguage  0x0 is neutral language but we can specify any language needs to be used for parsing the text . so that it can use the appropriate language dictionary to parse the text from document.






 sys.fulltext_document_types  gives all the list of document types full text index can support.

























FreeText() is a predicate , query engine splits the sentence in to words (word breaking )
generates inflectional forms of the words ( stemming)
identifies a list of matching words based on the words from thesaurus.

Now querying :
Basic query :













We can use different search conditions for contains :
1) like wild cards  where contains(  comments, '"mount*"')  matches to mountain, mount , mounting etc.

2) can use binary operations like AND, OR, NOT.
   ex : contains ( comments, ' "mountain"  AND   "flag"'  NOT   "flagging"  ')








3) can use near operations which lists all the documents which have both the words & ranks higher the one which are more closer to each other.












4) we can use isabout   to give  the different weights to different words & they will be ranked on the weight given. the words with more weight show up first & then the rest follows in the order of weights given.























the above query contains the containstable  which creates a temporary table  taking  the bay as wild card with different weights given to different words. since street has more weight than view ,  bay wild card with street shows up first ,then comes the rest . we joined the temporary table with the actual table on address id (pk) & then order by rank desc,

5) we can use the inflectional words which give similar meaning  like foot returning documents with foot, feet etc.

















Any thing related to log for Full Text Index can be found   @
C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Log

with file name starting with SQLFT...


















Full text catalogs in a database can be found using the system table sys.fulltext_catalogs












The list of lanaguages that full text can support can be found using sys.fulltext_languages
the lcid can be declared when we create a full text index to search on the language  desired.



























Stop list is the words list which will not be used by full text search to search on. they are most generally used words in day to day life. by using stop list we can narrow down the words to search for.
we can add a stop list to full text index by :


CREATE FULLTEXT STOPLIST stList
FROM SYSTEM STOPLIST;

we can query the stop list  words as below :

















to  add or remove words from stop list :


ALTER FULLTEXT STOPLIST stList
ADD / DROP 'pen' LANGUAGE 1033;

language code 1033 is for english ( which is nothing but lcid discussed above )

Misc : Only one free text index is allowed per table & it can have 1024 columns


No comments:

Post a Comment