coreseek自动化增量索引配置
有这么一种常见的情况:整个数据集非常大,以至于难于经常性地重建索引,但是每次新增的记录却相对较少。一个典型的例子是:一个论坛有1000000个已经归档的帖子,但每天只有1000个新帖子。
在这种情况下可以用所谓的“主索引+增量索引”(main+delta)模式来实现“近实时”的索引更新。这种方法的基本思路是设置两个数据源和两个索引,对很少更新或根本不更新的数据建立主索引,而对新增文档建立增量索引。在上述例子中,那1000000个已经归档的帖子放在主索引中,而每天新增的1000个帖子则放在增量索引中。增量索引更新的频率可以非常快,而文档可以在出现几分种内就可以被检索到。
确定具体某一文档的分属哪个索引的工作可以自动完成。一个可选的方案是,建立一个计数表,记录将文档集分成两部分的那个文档ID,而每次构建索引时,这个表都会被更新。
增量索引配置
#定义subject主索引数据源
source pingo_subject_main
{
type = mysql
sql_host = 127.0.0.1
sql_user = username
sql_pass = password
sql_db = dbname
sql_port = 3306
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type = OFF
sql_query_pre = CREATE TABLE IF NOT EXISTS sph_counter(counter_id INTEGER PRIMARY KEY NOT NULL, max_doc_id INTEGER NOT NULL) DEFAULT CHARSET=utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 2, MAX(id) FROM subject
sql_query_range = SELECT MIN(id), MAX(id) FROM subject WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id = 2)
sql_range_step = 1000
sql_ranged_throttle = 0
sql_query = SELECT id, id AS subjectid, title, content, imageUrl, imageUrl2, posterUrl, topicCnt, readCnt, userCnt, orderVal, isActivity, activityTitle, activityUrl, UNIX_TIMESTAMP(addTime) AS addTime, UNIX_TIMESTAMP(updateTime) AS updateTime, UNIX_TIMESTAMP(onlineTime) AS onlineTime, isOfficial, isHot, state FROM subject \
WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id = 2) \
AND id >= $start AND id <= $end
sql_attr_uint = subjectid
sql_field_string = title
sql_attr_string = content
sql_attr_string = imageUrl
sql_attr_string = imageUrl2
sql_attr_string = posterUrl
sql_attr_uint = topicCnt
sql_attr_uint = readCnt
sql_attr_uint = userCnt
sql_attr_uint = orderVal
sql_attr_uint = isActivity
sql_attr_string = activityTitle
sql_attr_string = activityUrl
sql_attr_timestamp = addTime
sql_attr_timestamp = updateTime
sql_attr_timestamp = onlineTime
sql_attr_uint = isOfficial
sql_attr_uint = isHot
sql_attr_uint = state
sql_query_info = SELECT id, title, content, state FROM subject WHERE id = $id
}
#定义subject增量索引数据源
source pingo_subject_delta : pingo_subject_main
{
sql_query_pre = SET NAMES utf8
sql_query_post_index = UPDATE sph_counter SET max_doc_id = $maxid WHERE counter_id = 2 AND $maxid > 0
sql_query_range = SELECT MIN(id), MAX(id) FROM subject WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id = 2)
sql_query = SELECT id, id AS subjectid, title, content, imageUrl, imageUrl2, posterUrl, topicCnt, readCnt, userCnt, orderVal, isActivity, activityTitle, activityUrl, UNIX_TIMESTAMP(addTime) AS addTime, UNIX_TIMESTAMP(updateTime) AS updateTime, UNIX_TIMESTAMP(onlineTime) AS onlineTime, isOfficial, isHot, state FROM subject \
WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id = 2) \
AND id >= $start AND id <= $end
}
#subject主索引
index subject
{
source = pingo_subject_main
path = /usr/local/coreseek/var/data/subject
docinfo = extern
charset_dictpath = /usr/local/mmseg3/etc
charset_type = zh_cn.utf-8
ngram_len = 0
}
#subject增量索引
index subject_delta
{
source = pingo_subject_delta
path = /usr/local/coreseek/var/data/subject_delta
docinfo = extern
charset_dictpath = /usr/local/mmseg3/etc
charset_type = zh_cn.utf-8
ngram_len = 0
}
请注意,上例中我们显示设置了数据源pingo_subject_delta的sql_query_pre选项,覆盖了全局设置。必须显示地覆盖这个选项,否则对delta做索引的时候也会运行那条REPLACE查询,那样会导致delta源中选出的数据为空。可是简单地将delta的sql_query_pre设置成空也不行,因为在继承来的数据源上第一次运行这个指令的时候,继承来的所有值都会被清空,这样编码设置的部分也会丢失。因此需要再次显式调用编码设置查询。
通过以上配置以后,每次更新索引时,都会把数据源当前最大的id存储到sph_counter表中,下次更新索引,只是从这个位置开始,获取新增数据构建索引。
构建主索引
主索引在首次使用时构建,以后通过定时任务更新主索引,可以选择在系统压力小的时候更新主索引。
/usr/local/coreseek/bin/indexer --rotate --quiet --config /usr/local/coreseek/etc/csft.conf subject
构建增量索引
增量索引地更新频率可以比较快一点,视系统的数据写入频率而定。
/usr/local/coreseek/bin/indexer --rotate --quiet --config /usr/local/coreseek/etc/csft.conf subject_delta
合并索引
生成增量索引后,需要把它合并到主索引中,这样数据才能被查询到。可以在增量索引生成后,立即执行合并操作。
/usr/local/coreseek/bin/indexer --rotate --quiet --config /usr/local/coreseek/etc/csft.conf --merge subject subject_delta