Microsoft Technologies: Full Text Search Internals

Overview :

Full text search provides the ability to search the character based data
1) character data
2) varbinary data --> when html documents are used for full text search ms-locale meta tag is used to define the language for the full text search
3) xml data --> xmllang is used to define the language for search
4) file stream data

Full text search functionality includes
1) simple search using freetext
2) prefix search using contain
3)inflectional forms of same words( like run, ran etc ) using contains & form of inflectional
4) words near other words using contains & nearby
5) ranking values & weighted values using containstable & isabout

Components:

1) supports close to 50 languages .
   SELECT        lcid, name FROM            sys.fulltext_languages
2) for each language there would be a word breaker & stemmer.
   word breaker breaks the document sentences in to word & decide on what constitues the words where as stemmers look for the inflectional forms of words. Third party word breakers can be purchased & supported,
stemmers are not invoked when full text search index is populated , they are invoked when full text search is done.
3) per instance thesauraus file. they are the xml files on the hard disk located @
   C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData

they are used for expansion sets & replacement sets ( if the search spells wrong then thesarus looks for the right words from xmls, replaces them & returns the search results on right words.
4) stop lists called as noice words, they are moved to internal tables in sql server 2008 . we can add the stop list per full text index.

Ifilters :

Filters are provided for each document type one for .doc, .docx, .pdf etc. Full text search looks for the registry and loads the ifilters for newly added ifilters. Filters implement Ifilter interface , third party filters can be supported.
Ifilters are used to parse the documents during the creation of fulltext index population not during the querying of data. We can provide the type of document for each row in the table & based on the type relevant ifilter is loaded , so same table can support multiple document types .

New Ifilters (or) third party Ifilters can be loaded using command :

exec sp_fulltext_service 'load_os_resources' ,1

Service :

Sql server 2005 used the external windows service for word breaker & stemmer. In sql server 2008 those service are brought in to sql server. FDHost launch service should be up & running to support the full text search. search runs with in the sql server engine.

Programming :

Full text search provides 4 programming functions,
1) Freetext 2) Contains are predicate based
3) Freetexttable 4) containstable are table valued functions.

1) Free text is to search the scentence , it does the word breaking, stemmer (inflectional words ) , thesaurus ,stop words & finds documents.
2) contains - all of the above are not done automatically, need to explicitly define thesarus, forms of inflectional etc based on what we are looking for.

---------

We can find the words that are indexed in the full text search per table with command :
select * from sys.dm_fts_index_keywords( DB_ID(), OBJECT_ID('HumanResources.Employee'))

fts parser gives the exact matches , stop word & numbers for a word given in the search.
Numbers will have nn in it.

sys.internal_tables gives more information on full text .

DOCID is the unique id which refers to the rows in the table. when the table has unique int primary key ,then that key will act as the DOC ID.

Thesaurus files can be modified & loaded as shown in diagram,

SQL Server Denali Improvements :

1) Indexing is made multi threading, in 2008 it is single threaded.
2) predicative performance is improved. ex: search for many things like contains( tree) and contains (road ) , they modified to convert ( tree and road )
3) faster time response using streamed table valued functions, means we don't have to wait till all rows are returned. they stream the rows.
4)More granular locks & min time locking.
5) property based search for the documents, like document author name, date created etc..
properties are per index & when ever properties are updated , full text index must be repopulated.
Ifilters must support the extraction of properties .

-----------

Microsoft Technologies

Sunday, December 18, 2011

Full Text Search Internals - part 3

No comments:

Post a Comment

Followers