2019-08-07

Elasticsearch删除特殊字符索引

长话短说，查看生产环境Elasticsearch (5.6版本) 时，发现一些如下有着非ascii码的索引

red zhangxin-xxx-༠༠༣༡.༠༣.༢༢
red zhangxin-xxx-༠༠༣༡.༠༣.༢༣
red zhangxin-xxx-༠༠༣༡.༠༣.༢༤
red zhangxin-xxx-༠༠༣༡.༠༤.༡༢
red zhangxin-xxx-༠༠༣༡.༠༤.༡༧
red zh날炷gxꆀ鍀ᒶ⒐ጆ䬯ꀳ20₨炠.021

上面是某系统因为历史缘故，使用用户的数据创建索引了，因暂时无法推动其做修改，于是需要考虑用脚本定时删除。

如果你对过程不感兴趣，可以考虑跳到总结2 直接看方法。
题外话：上述看起来是藏文，用了bing/sogou翻译，识别为北欧语言(为：我的天呐)，看起来不一样而且不像，不过用google翻译像日期格式(为:0031.03.22)倒是接近。i>

不过，如果直接用 curl -XDELETE ‘10.135.20.38:9200/zhangxin-xxx-༠༠༣༡.༠༣.༢༢’ 会提示索引不存在，因为需要转义。
而且，ES也不存在POST方式删除索引的方法，索引需要修改为：

curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a3.%e0%bc%a2%e0%bc%a2
curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a3.%e0%bc%a2%e0%bc%a3
curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a3.%e0%bc%a2%e0%bc%a4
curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a4.%e0%bc%a1%e0%bc%a2
curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a4.%e0%bc%a1%e0%bc%a7
curl -XDELETE 10.135.20.38:9200/zhangxin-xxx-%e0%bc%a0%e0%bc%a0%e0%bc%a3%e0%bc%a1.%e0%bc%a0%e0%bc%a4.%e0%bc%a1%e0%bc%a8
curl -XDELETE '10.135.20.38:9200/zh%EB%82%A0%E7%82%B7gx%EA%86%80%00%E9%8D%80%E1%92%B6%E2%92%90%E1%8C%86%01%00%E4%AC%AF%EA%80%B3%32%30%E2%82%A8%E7%82%A0%2E%30%1A%00%32%31'

方式删除，这里索引可以使用逗号分隔拼凑起来，不过为了脚本方便就一行一条了。

那么怎么去定位这些非正常字符的索引呢？

curl -s 10.135.20.38:9200/_cat/indices?v|grep -P '[\xB0\xA1-\xF7\xFE]+'

上面索引就是用该行grep出来，不过按上述删完后，发现‘zh날炷gxꆀ鍀ᒶ⒐ጆ䬯ꀳ20₨炠.021’ 这个索引还在。

这让我有点不知所措，直到我把grep出来的结果保存，并用16进制模式查看时，才发现，原来是自己手动从服务器拷贝该索引时把部分不可string化的字符拷贝丢了。

curl -s 10.135.20.38:9200/_cat/indices?v|grep -P '[\xB0\xA1-\xF7\xFE]+'
green  open   zh날炷gxꆀ鍀ᒶ⒐ጆ䬯ꀳ20₨炠.021          tfpRU2KeRCG6yBWhYq5J2w   5   1          1            0      9.2kb          4.6kb

# 将上述结果打开vi十六进制模式，部分如下
                                            7a68  green  open   zh
0000010: eb82 a0e7 82b7 6778 ea86 8000 e98d 80e1  ......gx........
0000020: 92b6 e292 90e1 8c86 0100 e4ac afea 80b3  ................
0000030: 3230 e282 a8e7 82a0 2e30 1a00 3231       20.......0..21

可以看到通过字符串拷贝时丢失，还是老老实实写脚本实现删除吧。

总结1

如下是完整实现：

# 首先是之前通过curl命令进行 urlencode 编码的函数
function urlencode() {
    local data
    if [[ $# != 1 ]]; then
        echo "Usage: $0 str"
        return 1
    fi
    data="$(curl -s -o /dev/null -w %{url_effective} --get --data-urlencode "$1" "")"
    # if [[ $? == 0 ]]; then
    echo "${data##/?}"
    # fi
    return 0
}
function callDel(){
    indx=$(urlencode $1)
    curl -s -XDELETE 10.135.20.38:9200/$indx
}

其次合起来完整的脚本就是

1 2	curl -s 10.135.20.38:9200/_cat/indices?v\|grep -P '[\xB0\xA1-\xF7\xFE]+'\| \ awk '{print $3}'\|xargs -I@ -P4 bash -c "$(declare -f urlencode; declare -f callDel) ; callDel @ ; echo @ "

题外话：起先怀疑这个urlencode有误，后来使用 python -c “import urllib;print urllib.quote(raw_input())” <<< “zhangxin-xxx-༢༥༦༢.༠༤.༠༡” 也是如此。

总结2

上述方法可以完美运行，但是觉得有点麻烦，实现的不是非常的 Elasticsearch。

无意翻看了下 Elasticsearch 的索引匹配支持，显然索引匹配是不支持正则表达式的，但是支持通配符，include，excluse，具体代码可以看 Elasticsearch 的 IndexNameExpressionResolver 实现，在innerResolve 会判断是否支持。有个exclude模式是支持的。
即，也可以用这种方式去删除：

1	curl -XDELETE '10.135.20.38:9200/zhangxin-xxx-,-zhangxin-xxx-2019.07.'

这句就表示删除 zhangxin-xxx- 除 zhangxin-xxx-2019.07- 开头的索引。

不过这个方案不如上面的通用，但是非常简单且清晰易懂。