æéãçµã€ã«ã€ããŠãSphinxã¯æ€çŽ¢ã¢ãŒããšã©ã³ãã³ã°ã¢ãŒãã®å€§ããªæã«æé·ããŸããã å®æçã«ããŸããŸãªããšã«ã€ããŠã®è³ªåããããŸãïŒãããã¥ã¡ã³ãã1äœã«åŒãåºãæ¹æ³ããããäžèŽã®åºŠåãã«å¿ããŠ1ã5åã®æãæãæ¹æ³ãïŒããããã¯å®éã«ãããã®ã¢ãŒãã®å
éšæ§é ã«é¢ãã質åã®æ¬è³ªã§ãã ãã®æçš¿ã§ã¯ãæ€çŽ¢ã¢ãŒããšã©ã³ãã³ã°ã¢ãŒãã®é
眮ãã©ã³ãã³ã°ãã¡ã¯ã¿ãŒãæçµãŠã§ã€ããªã©ã®ãã¡ã¯ã¿ãŒã®æ£ç¢ºãªèšç®æ¹æ³ãªã©ãèŠããŠãããã¹ãŠã®ããšã説æããŸãã ãããŠããã¡ãããæã«ã€ããŠïŒ
æ€çŽ¢ã¢ãŒããšã©ã³ãã³ã°ã¢ãŒãã«ã€ããŠ
ãŸãããããã®ã¢ãŒããäžè¬çã«è¡ãããšãç解ããŸãã APIãéããŠã
SetMatchModeïŒïŒãš
SetRankingModeïŒïŒã® 2ã€ã®ç°ãªãã¡ãœãããå©çšå¯èœã«ãªã
ãŸãã ã ããã¯éãããã«èŠããã§ãããã ãããå®éã«ã¯ãå
éšã«ã¯åããã®ããããŸãã 以åã¯ãããŒãžã§ã³0.9.8ãŸã§ã¯ãæ€çŽ¢ã¢ãŒãã®ã¿ãå©çšå¯èœã§ããã SetMatchModeïŒïŒã ãããã¯ãã¹ãŠãç°ãªãã³ãŒããã©ã³ãã«ãã£ãŠå®è£
ãããŸããã åã³ãŒããã©ã³ãèªäœããããã¥ã¡ã³ãã®æ€çŽ¢ãšã©ã³ãã³ã°ã®äž¡æ¹ãè¡ããŸããã ããã«ããã¡ãããããŸããŸãªæ¹æ³ã§ã ããŒãžã§ã³0.9.8ã§ã¯ãæ°ããçµ±åããã¥ã¡ã³ãæ€çŽ¢ãšã³ãžã³ã®éçºãéå§ãããŸããã ãã ããäºææ§ãæãªããªãããã«ããã®ãšã³ãžã³ã¯SPH_MATCH_EXTENDED2ã¢ãŒãã§ã®ã¿äœ¿çšã§ããŸããã 0.9.9ããã¯ãæ°ãããšã³ãžã³ãæ¢ã«éåžžã«å®å®ããŠãããé«éã§ããããšãæããã«ãªããŸããïŒå¿µã®ãããèŠèœãšããå Žåã¯ããŒãžã§ã³0.9.8ããããŸãïŒã ããã«ãããå€æ°ã®å€ãã³ãŒããåé€ã§ãã0.9.9以éããã¹ãŠã®æ€çŽ¢ã¢ãŒãã¯æ°ãããšã³ãžã³ã«ãã£ãŠåŠçãããŸãã äºææ§ã®ããã«ãå€ãæ€çŽ¢ã¢ãŒãã䜿çšããå Žåãç°¡ç¥åãããã¯ãšãªè§£æã³ãŒãã䜿çšããïŒãã«ããã¹ãã¯ãšãªèšèªã®æŒç®åã¯ç¡èŠãããŸãïŒãæ£ããïŒæ€çŽ¢ã¢ãŒãã«å¯Ÿå¿ããïŒã¬ã³ãžã£ãŒãèªåçã«èšå®ãããŸããããã¹ãŠã®éãã¯ããã§çµãããŸãã ãããã£ãŠãå®éã«
ã¯ãããã¥ã¡ã³ãã®éã¿ ïŒ weight ïŒã¯ã©ã³ãã³ã°ã¢ãŒãïŒrankerïŒã®ã¿ã«äŸåããŸã ã ãããã£ãŠã次ã®2ã€ã®ã¯ãšãªã¯åãéã¿ãäžããŸãã
// 1 $cl->SetMatchMode ( SPH_MATCH_ALL ); $cl->Query ( "hello world" ); // 2 $cl->SetMatchMode ( SPH_MATCH_EXTENDED2 ); $cl->SetRankingMode ( SPH_RANK_PROXIMITY ); $cl->Query ( "hello world" );
2çªç®ã®ãªãã·ã§ã³ã§ã¯ã
@title hello
ãèšè¿°ã§ããŸãïŒã¯ãšãªèšèªãåç
§ïŒã æåã¯äžå¯èœã§ãïŒäºææ§ãåç
§ïŒã
ã©ã³ã«ãŒã¯ãå©çšå¯èœãªããã€ãã®å
éšèŠå ã«åºã¥ããŠããã¥ã¡ã³ãã®éã¿ãèšç®ããŸãïŒãããŠããããã«ã®ã¿ïŒã ç°ãªãã©ã³ã¯ä»ãè
ã¯ãããŸããŸãªæ¹æ³ã§åçŽã«èŠå ãæçµçãªéã¿ã«åéãããšèšãããšãã§ããŸãã æãéèŠãª2ã€ã®èŠå ã¯ã1ïŒ80幎代以éã®ã»ãšãã©ïŒãã¹ãŠã§ã¯ãªãã«ããŠãïŒã®æ€çŽ¢ãšã³ãžã³ã§äœ¿çšãããŠããå€å
žçãªçµ±èšèŠå BM25ãããã³2ïŒSphinxåºæã®ãã¬ãŒãºéã¿ä¿æ°ã§ãã
BM25ã«ã€ããŠ
BM25ã¯0ãã1ã®ç¯å²ã®å®æ°ã§ãããã
ã¯ãäžæ¹ã§ã¯çŸåšã®ããã¥ã¡ã³ãã®
ããŒã¯ãŒãã®é »åºŠã«ãä»æ¹ã§ã¯äžè¬çãªããã¥ã¡ã³ãã®ã»ããïŒã³ã¬ã¯ã·ã§ã³ïŒã«ã
é »åºŠã®ã¿ã« äŸå ããŸã ã Sphinxã§ã®BM25ã®çŸåšã®å®è£
ã¯ãèŠæ±ãšå®éã«äžèŽããé »åºŠã ãã§ãªããããã¥ã¡ã³ãå
ã®
åèšåèªé »åºŠã«åºã¥ããŠèšç®ãããŸãã ããšãã°ã
ã¿ã€ãã« helloã¯ãšãªïŒããããŒå
ã®åèªhelloã®1ã€ã®ã³ããŒãšäžèŽïŒã®å ŽåãBM25ä¿æ°ã¯@ïŒã¿ã€ãã«ãã³ã³ãã³ãïŒããŒã¯ãŒãã¯ãšãªãšåãããã«èšç®ãããŸãã åçŽåãããå®è£
ã¯æå³çã«è¡ãããŸãããæšæºã©ã³ãã³ã°ã¢ãŒãã§ã¯ãBM25ä¿æ°ã¯äºæ¬¡çã§ããããã¹ãäžã®ã©ã³ãã³ã°ã®éãã¯ãããã§ããããé床ã¯ããªã倧ããç°ãªããŸããã æ£ç¢ºãªèšç®ã¢ã«ãŽãªãºã ã¯æ¬¡ã®ããã«ãªããŸãã
BM25 = 0 foreach ( keyword in matching_keywords ) { n = total_matching_documents ( keyword ) N = total_documents_in_collection k1 = 1.2 TF = current_document_occurrence_count ( keyword ) IDF = log((N-n+1)/n) / log(1+N) BM25 = BM25 + TF*IDF/(TF+k1) } BM25 = 0.5 + BM25 / ( 2*num_keywords ( query ) )
TFã¯ãçšèªé »åºŠïŒã©ã³ã¯ä»ããããããã¥ã¡ã³ãã®ããŒã¯ãŒãé »åºŠïŒã®ç¥ã§ãã IDFã¯Inverse Document Frequencyã®ç¥ã§ãã ã³ã¬ã¯ã·ã§ã³å
ã§ããèŠãããããŒã¯ãŒãã®IDFã¯å°ããããŸããªåèªã®IDFã¯å€§ãããªããŸãã ããŒã¯å€ã¯ã1ã€ã®ããã¥ã¡ã³ãã«æ£ç¢ºã«çŸããåèªã«å¯ŸããŠIDF = 1ã«ãªãããã¹ãŠã®ããã¥ã¡ã³ãã«ããåèªã«å¯ŸããŠIDFã= -1ã«ãªããŸãã TFã¯ãçè«çã«ã¯ãk1ã«å¿ããŠ0ã1ã®ç¯å²ã§ãã éžæãããk1 = 1.2ã§ãå®éã«ã¯0.4545 ...ãã1ãŸã§å€åããŸãã
BM25ã¯ãããŒã¯ãŒãããŸã°ãã§ããã¥ã¡ã³ãã«äœåºŠãå
¥åããããšäžæããããŒã¯ãŒããé »ç¹ã«çºçãããšã¯ã©ãã·ã¥ããŸãã BM25ã®æ倧ã®äŸ¡å€ã¯ããã¹ãŠã®ããŒã¯ãŒããããã¥ã¡ã³ããšäžèŽããåèªãéåžžã«ãŸãïŒã³ã¬ã¯ã·ã§ã³å
šäœãããã®1ã€ã®ããã¥ã¡ã³ãã®ã¿ãèŠã€ãã£ãïŒã§ãããããããäœåºŠãå«ãŸããŠããå Žåã«éæãããŸãã ããã¥ã¡ã³ãããã¹ãŠã®ããã¥ã¡ã³ãã§çºçãã1ã€ã®è¶
é »åºèªãšäžèŽããå Žåãããããæå°ã§ã...ãããã¥ã¡ã³ãå
ã§äœåºŠã衚瀺ãããŸãã
é »ç¹ã«äœ¿çšãããåèªïŒææžã®ååãããé »ç¹ã«èŠã€ããïŒãBM25ã
æžããããšã«æ³šæããŠãã ããã å®éã1ã€ãé€ããã¹ãŠã®ããã¥ã¡ã³ãã§åèªãèŠã€ãã£ãå Žåãé »ç¹ã«äœ¿çšãããåèªã®
ãªããã®1ã€ã®ããã¥ã¡ã³ãã¯äŸç¶ãšããŠç¹å¥ã§ãããããéèŠã«ãªããŸãã
ãã¬ãŒãºã®éã¿ã«ã€ããŠ
ãã¬ãŒãºã®éã¿ïŒã¯ãšãªãšã®è¿æ¥åºŠã§ãããã¯ãšãªã®è¿æ¥åºŠã§ããããŸãïŒã¯å®å
šã«ç°ãªããšèŠãªãããŸãã ãã®èŠçŽ ã¯é »åºŠããŸã£ããèæ
®ããŸããããã¯ãšãªãšããã¥ã¡ã³ãå
ã®ããŒã¯ãŒãã®çžå¯Ÿçãªäœçœ®ãèæ
®ããŸãã ãããèšç®ããããã«ãSphinxã¯ããã¥ã¡ã³ãã®åãã£ãŒã«ãã®ããŒã¯ãŒãã®
äœçœ®ãåæããã¯ãšãªãšã®æé·ã®é£ç¶äžèŽãèŠã€ããããŒã¯ãŒãã®äžèŽãäžèŽãé·ããèæ
®ããŸãã æ£åŒã«ã¯ãã¯ãšãªãšåŠçäžã®ãã£ãŒã«ãéã®ããŒã¯ãŒãã®æé·å
±éãµãã·ãŒã±ã³ã¹ïŒLCSïŒãèŠã€ãããã®ãã£ãŒã«ãã®ãã¬ãŒãºã®éã¿ãLCSã®é·ããšçãããªãããã«èšå®ããŸãã èšãæãã
ãšããã¬ãŒãºã®
éã¿ïŒäžïŒã¯ããªã¯ãšã¹ããšãŸã£ããåãé åºã§ãã£ãŒã«ãã«åºçŸããããŒã¯ãŒãã®æ°ã§ã ã 以äžã«äŸã瀺ããŸãã
1) query = one two three, field = one and two three
field_phrase_weight = 2 ( "two three", 2 )
2) query = one two three, field = one and two and three
field_phrase_weight = 1 ( , )
3) query = one two three, field = nothing matches at all
field_phrase_weight = 0
ããã¥ã¡ã³ãã®æçµçãªãã¬ãŒãºã®éã¿ãååŸããã«ã¯ãåãã£ãŒã«ãã®ãã¬ãŒãºã®éã¿ã«ãŠãŒã¶ãŒæå®ã®ãã£ãŒã«ãã®éã¿ãä¹ç®ãïŒSetFieldWeightsïŒïŒã¡ãœãããåç
§ïŒãä¹ç®ã®çµæãå ç®ããŸãã ïŒãšããã§ãããã©ã«ãã§ã¯ããã£ãŒã«ãã®éã¿ã¯1ã§ããã1æªæºã«èšå®ããããšã¯ã§ããŸãããïŒæ¬äŒŒã³ãŒãã¯æ¬¡ã®ããã«ãªããŸãã
doc_phrase_weight = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
doc_phrase_weight += user_weight ( field ) * field_phrase_weight
}
äŸïŒ
doc_title = hello world
doc_body = the world is a wonderful place
query = hello world
query_title_weight = 5
query_body_weight = 3
title_phrase_weight = 2
body_phrase_weight = 1
doc_phrase_weight = 2*5+3*1 = 13
æããæªæ¥ã«ã€ããŠ
ä»æ¥èª¬æãã2ã€ã®èŠå ã¯åºæ¬çãªãã®ã§ãããäžè¬çã«ã¯å¯äžã®èŠå ã§ã¯ãããŸããã ä»ã®ããã¹ãèŠçŽ ãèæ
®ããããšã¯æè¡çã«å¯èœã§ãã ããšãã°ãå®éã®äžèŽã«åºã¥ããŠãæ£ãããBM25ãæ€èšããŸãã å
¥ã£ãŠããåèªã®é »åºŠãèæ
®ããŠããµããã¬ãŒãºã®éã¿ãããmoreã«èããŸãã ããã«ããã£ãŒã«ãã§äžèŽããåèªã®æ°ãèæ
®ããŸãã ãªã©ãªã© ã©ã³ã¯ä»ãè
èªèº«ã®ã¬ãã«ã§ãããããçš®é¡ã®éããã¹ãèŠå ãèæ
®ããããšãã§ããŸãã ã€ãŸãã
weight ã®èšç®
ããã»ã¹ã§äžéšã®å±æ§ã䜿çšããèšç®ã«è¿œå ããã®ã§ã¯ãããŸããã
ã©ã³ã«ãŒã«ã€ããŠ
æåŸã«ãç°¡æœãã®ã©ã³ã¯ä»ãè
ã®ããã®ã©ã³ã¯ä»ãã¢ãŒãã«ã€ããŠã 圌ãã¯ãããããçš®é¡ã®ããŸããŸãªèŠå ããæçµçãªéã¿ãåéããŸãã ã©ã³ã«ãŒã®åºå£ã§ã®éã¿ã¯æŽæ°ã§ãã
ããã©ã«ãã®ã©ã³ã«ïŒæ¡åŒµ/æ¡åŒµ2ã¢ãŒãïŒã¯
SPH_RANK_PROXIMITY_BM25ãšåŒã°ãããã¬ãŒãºã®éã¿ãBM25ãšçµã¿åãããŸãã ãã¡ã¯ã¿ãŒBM25ã¯å°æ°ç¹ä»¥äž3æ¡ã«ããããã¬ãŒãºã®éã¿ã¯4çªç®ä»¥éããå§ãŸããŸãã é¢é£ãã2ã€ã®ã©ã³ã«ãŒã
SPH_RANK_PROXIMITYãš
SPH_RANK_BM25ããããŸãã 1ã€ç®ã¯ããã¬ãŒãºèªäœã®éã¿ä¿æ°ãéã¿ãšããŠåã«è¿ããŸãã 2çªç®ã¯æ£çŽã«BM25ã®ã¿ãèæ
®ããŠãããé·ãèšç®ã®ä»£ããã«ãåäžèŽãã£ãŒã«ãã®ãã¬ãŒãºã®éã¿ã¯ããã«1ã«çãããªããŸãã
// SPH_RANK_PROXIMITY_BM25 ( )
rank_proximity_bm25 = doc_phrase_weight*1000 + doc_bm25*999
// SPH_RANK_PROXIMITY ( SPH_MATCH_ALL)
rank_proximity = doc_phrase_weight
// SPH_RANK_BM25 (, . )
rank_bm25 = sum ( matching_field_weight )*1000 + doc_bm25*999
ããããããã©ã«ãã§SPH_RANK_PROXIMITY_BM25ãéžæãããŠããŸãã å©çšå¯èœãªãã®ã®ãã¡ãæãé¢é£æ§ã®é«ãçµæãåŸããããšããæèŠããããŸãã å°æ¥ã®ããã©ã«ãã¯å€æŽãããå¯èœæ§ããããŸãã ããè³¢ãäžè¬çã«åªããã©ã³ã¯ä»ãè
ãäœæããããã®èšç»ã¯ããªããããŸãã ã©ã³ã«ãŒSPH_RANK_PROXIMITYã¯ãSPH_MATCH_ALLã¢ãŒãããšãã¥ã¬ãŒãããŸãïŒã¡ãªã¿ã«ã2001幎ã«ãã¹ãŠãå§ãŸã£ãæåã®ã¢ãŒãã§ãïŒã äœããã®çç±ã§ãã¬ãŒãºã®éã¿ãéèŠã§ãªãå Žåã¯ãSPH_RANK_BM25ã䜿çšããå¿
èŠããããŸãã ãŸãã¯ãã¯ãšãªãé«éåãããå Žåã®ã¿ã ãšããã§ã
ã©ã³ã«ãŒã®
éžæã¯ã¯ãšãªã®é床ã«å€§ãã圱é¿ããŸã ïŒ éåžžãèšç®ã®æãé«äŸ¡ãªéšåã¯ãææžå
ã®åèªã®äœçœ®ã®åæã§ãã ãã®ãããSPH_RANK_PROXIMITY_BM25ã¯åžžã«SPH_RANK_BM25ãããé
ããªããSPH_RANK_NONEïŒãŸã£ããã«ãŠã³ããããªãïŒãããåžžã«é
ããªããŸãã
å±¥æŽæ€çŽ¢ã¢ãŒãã®åŠçã«äœ¿çšãããå¥ã®ã©ã³ã«ãŒã¯
SPH_RANK_MATCHANYã§ãã 圌ã¯ãä»ã®2ã€ã®è¿æ¥ã©ã³ã¯ä»ã圹ãšåæ§ã«ããµããã¬ãŒãºã®éã¿ãèæ
®ããŸãã ããããããã¯ããã«ãåãã£ãŒã«ãã§äžèŽããããŒã¯ãŒãã®æ°ãã«ãŠã³ããããµããã¬ãŒãºã®éã¿ãšçµã¿åãããŠãaïŒ
ä»»æã®ãã£ãŒã«ãã®é·ããµããã¬ãŒãºãäžèšã®ããã¥ã¡ã³ãå
šäœãé
眮ããããã«ããŸãã bïŒåããµããã¬ãŒãºã®é·ãã§ãäžèŽããåèªã®æ°ãå€ãããã¥ã¡ã³ãã»ã©ã©ã³ã¯ãé«ããªããŸããã æã§ã¯ãããã¥ã¡ã³ãAã«ããã¥ã¡ã³ãBãããã¯ãšãªã®ããæ£ç¢ºãªïŒé·ãïŒãµããã¬ãŒãºãããå Žåãããã¥ã¡ã³ãAãããé«ãã©ã³ã¯ä»ãããå¿
èŠããããŸããããã§ãªãå ŽåïŒãµããã¬ãŒãºãåãé·ãã§ããå ŽåïŒãåã«åèªã®æ°ã調ã¹ãŸãã ã¢ã«ãŽãªãºã ã¯æ¬¡ã®ãšããã§ãã
k = 0
foreach ( field in all_fields )
k += user_weight ( field ) * num_keywords ( query )
rank = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
field_rank = ( field_phrase_weight * k + num_matching_keywords ( field ) )
rank += user_weight ( field ) * field_rank
}
Ranker SPH_RANK_WORDCOUNTã¯ãåãã£ãŒã«ãã§äžèŽããããŒã¯ãŒãã®åºçŸæ°ã«ãã£ãŒã«ãã®éã¿ãæãããã®ãæãã«åèšããŸãã
SPH_RANK_NONEãããç°¡åã§ã
ãSPH_RANK_NONEã¯ãŸã£ããã«ãŠã³ããããŸããã
rank = 0
foreach ( field in matching_fields )
rank += user_weight ( field ) * num_matching_occurrences ( field )
æåŸã«ã
SPH_RANK_FIELDMASKã¯
ããªã¯ãšã¹ãã«
äžèŽãããã£ãŒã«ãã®ããããã¹ã¯ãè¿ããŸãã ïŒãŸãè€éã§ã¯ãªããã¯ã...ïŒ
rank = 0
foreach ( field in matching_fields )
rank |= ( 1<< index_of ( field ) )
æã«ã€ããŠ
å¯èœãªæ倧ééãçããçç±ãšãããããã¢ã¹ã¿ãªã¹ã¯ïŒéåžžã¯5ãã€ã³ãã§ãããå¯èœãªå ŽåïŒãããŒã»ã³ããŒãžããŸãã¯1ãã17ãŸã§ã®çŽæçãªã¹ã±ãŒã«ã®ãã€ã³ãã«æ£ããå€æããæ¹æ³ã«ã€ããŠã®è³ªåãå®æçã«çºçããŸãã
ã©ã³ã«ãŒããããããŠãªã¯ãšã¹ããã ã ããšãã°ãSPH_RANK_PROXIMITY_BM25ã®åºåã§ã®çµ¶å¯Ÿæ倧éã¿ã¯ãããŒã¯ãŒãã®æ°ãšãã£ãŒã«ãã®éã¿ã«äŸåããŸãã
max_proximity_bm25 = num_keywords * sum ( field_weights ) * 1000 + 999
ãã ãããã®
絶察æ倧å€ã¯ã
ãã¹ãŠã®ãã£ãŒã«ãã«1xã®ã¯ãšãªã®
æ£ç¢ºãªã³ããŒãšã2xã®ãã¹ãŠã®ãã£ãŒã«ãã®ã¯ãšãªæ€çŽ¢ãå«ãŸ
ããå Žåã«ã®ã¿å®çŸãããŸãã ãŸããç¹å®ã®ãã£ãŒã«ãïŒããšãã°ã
ã¿ã€ãã« hello worldïŒã«å¶éãèšå®ããŠãªã¯ãšã¹ããè¡ãããå Žåã¯ã©ããªããŸããïŒ
ãã®ç¹å®ã®ã¿ã€ãã®ãªã¯ãšã¹ãã§ã¯ã絶察çãªæ倧å€ã¯ååãšããŠéæã§ããŸãããå®éã«å¯èœãªæ倧ã®éã¿ã¯æ¬¡ã®å€ã«çãããªããŸãã
max_title_query_weight = num_keywords * title_field_weight * 1000 + 999
ãã®ããããæ£ãããæ倧ééãåæçã«æ£ç¢ºã«èšç®ããããšã¯ããªãå°é£ã§ãã æã®ãªãçåœãååšããªãå Žåãã絶察ãæ倧å€ïŒã»ãšãã©éæãããªãïŒãæ¯èŒçç°¡åã«æ€èšããããç¹å®ã®ãµã³ãã«ããšã«æ倧
ééãååŸããããã«é¢é£ãããã¹ãŠãæ£èŠåããŸãã è€æ°ã¯ãšãªã®ã¡ã«ããºã ã«ãããããã¯ã»ãŒç¡æã§è¡ãããŸãããããã¯ãã§ã«å¥ã®èšäºã®ãããã¯ã§ãã
å®å
šäžèŽïŒæŽæ°ïŒã«ã€ããŠ
ã³ã¡ã³ãã«ã¯ã質åãš
ãã£ãŒã«ãã®å®å
šäžèŽãäžäœã«ã©ã³ã¯ãããªãçç±ãšããããéæããæ¹æ³ã«ã€ããŠã®è¯ã質åããããŸããã
éèŠãªã®ã¯ãã©ã³ã¯ä»ãè
ã®ä»äºã®è«çã«ãããŸãã èŠæ±ããã£ãŒã«ãã§å®å
šã«æºããããå Žåã®ããã©ã«ãã®ãããã·ããã£ããã³proximity_bm25ã©ã³ã«ã¯ããã®ãããªãã£ãŒã«ãã«æ倧ãã¬ãŒãºãŠã§ã€ããå²ãåœãŠãŸãïŒãããäž»èŠãªã©ã³ãã³ã°èŠçŽ ã§ãïŒã åæã«ããã£ãŒã«ãèªäœã®
é·ãã¯èæ
®ãããŸããã
ãã£ãŒã«ãããªã¯ãšã¹ããšå®å
šã«äžèŽãããšããäºå®ãèæ
®ãããŸããã ãããæŽå²çã«ç¢ºç«ãããåäœã§ãã ã©ããããäœããã®çç±ã§ããªã¯ãšã¹ãã1ã€ãŸãã¯å¥ã®ãã£ãŒã«ããšå®å
šã«äžèŽããç¶æ³ã¯ã以åã¯ããŸãäžè¬çã§ã¯ãããŸããã§ããã
çŸåšã®ãã©ã³ã¯ïŒããŒãžã§ã³0.9.10ïŒã§ã¯ãäœæ¥ãæ¢ã«é²è¡äžã§ããå®éšçšã¬ã³ãžã£ãŒSPH_RANK_SPH04ãè¿œå ãããäžèšã®ãã£ãŒã«ãã®å®å
šäžèŽãã©ã³ã¯ä»ãããã ãã§ãã æè¡çãªæ©äŒã¯ããããã0.9.9ã«çŸããŸããã 0.9.8ã§ã¯ãã€ã³ããã¯ã¹åœ¢åŒã¯å¿
èŠãªããŒã¿ãæäŸããŸããïŒå¥åŠãªããšã«ãä¿åãããäœçœ®ã«ã¯ãããã¯ãã£ãŒã«ãã®çµããã§ããããšãããã©ã°ããããŸããïŒã
æ°ããã©ã³ã«ãŒãããªããŠããäœããããããšãã§ããŸãã
ããšãã°ãå®å
šã«äžèŽããå Žåã¯ãæåã§éã¿ãå¢ããããšãã§ããŸãã ãããè¡ãã«ã¯ããã³ã䜿çšããŠããã£ãŒã«ãèªäœïŒã€ã³ããã¯ã¹äœææïŒããã³èŠæ±ãããã¹ãŠã®å¥èªç¹ãšå€§æåãåé€ãããã£ãŒã«ãããcrc32ãèæ
®ããŠããã®ã€ã³ããã¯ã¹å±æ§ãä¿åããŸãã 次ã«ãæ€çŽ¢æã«ãåŒã®
éã¿ + ifïŒfieldcrc == querycrcã1000,0ïŒãèšç®ãããã®åŒã§çµæã䞊ã¹æ¿ããŸãã ããªãæ²ãã£ãŠããŸãããå Žåã«ãã£ãŠã¯åœ¹ç«ã¡ãŸãã
ããã§ãã¿ã¹ã¯ããããã«å€æŽããå®å
šãªäžèŽã®äºå®ã§ã¯ãªã
ãããã¥ã¡ã³ãïŒãŸãã¯å¥ã®ãã£ãŒã«ãïŒã®
é·ããèæ
®ããããšãã§ããŸãã ãããè¡ãã«ã¯ãã€ã³ããã¯ã¹äœææã«å±æ§ã«LENGTHïŒmyfieldïŒãä¿åããæ€çŽ¢æã«ãã©ãŒã ã®åŒã§ã©ã³ã¯ä»ãããŸãïŒããã¯åãªãäŸã§ãïŒïŒ
Weight + lnïŒmaxïŒlenã1ïŒïŒ* 1000
å Žåã«ãã£ãŠã¯ïŒã³ã¡ã³ãããæ²åã®ã€ã³ããã¯ã¹ãäœæããäŸãªã©ïŒããã£ãŒã«ããåå¥ã«ã§ã¯ãªããäžç·ã«ã€ã³ããã¯ã¹ä»ããããšæå³ãããããgroup-songããšãããã¬ãŒãºã®äžèŽããgluedããã£ãŒã«ãã®ãã¬ãŒãºã«ããå€ãã®éã¿ãäžããããšããããŸãã ãã以å€ã®å Žåããã¹ãŠã®ãã£ãŒã«ãã¯åå¥ãšèŠãªãããããã£ãŒã«ãå¢çãè¶ããŠãäžèŽã¯äžã«é
眮ãããŸããã
ãã¡ã€ã«ã¹ããŒã¹ã«ã€ããŠ
ãšã«ãããã®ãããã³å
šäœãããã«ããªãã«ããæ¹æ³ã¯ãããŸããïŒ ãã®ããã«ã æ¢åã®ã©ã³ã«ãŒãã©ã®ããã«é
眮ãããŠããããã©ã®ãããªèŠå ãèæ
®ãããã©ã®ããã«çµã¿åããããŠããããç解ããããããã«æèçã«å³åº§ã«
éã¿ãä¿®æ£ã§ããŸãïŒããæ£ç¢ºã«ã¯ã
éã¿ãå«ãæ°ããç®è¡åŒãèšå®ããããã«ãã£ãŠåºåããœãŒãã§ããŸãïŒ ããã«èå³æ·±ãããšã«ãæ°ããå°éã®ã©ã³ã«ãŒãè¿œå ããããšã¯æè¡çã«å¯èœã§ãïŒæåã®1åéã®åºåïŒã©ã³ã«ãŒSPH_RANK_WORDCOUNTãšSPH_RANK_FIELDMASKã¯ç§ã«ãã£ãŠçºæããããã®ã§ã¯ãããŸããã
åçšãŠãŒã¶ãŒãèŠæ±ããŸããïŒã ã©ã³ã¯ä»ãè
ã®C ++ã³ãŒããããã¯ãšãªãããŒã¯ãŒããã©ã³ãã³ã°ããã¥ã¡ã³ããããã³ïŒæãéèŠãªïŒã¯ãšãªã«äžèŽãããã¹ãŠã®ããŒã¯ãŒãã®ãªã¹ããšããã¥ã¡ã³ãå
ã®äœçœ®ãžã®å³æã¢ã¯ã»ã¹ããããŸãã æãéèŠãªããšã¯ãèŠäºã«ãã¡ã€ã«ããžãã¯ãé©çšããããšã§ãã