About

参考了何晗 (开发 HanLP 的大佬) 写的自然语言处理入门一书.

主要是为了做一个玩具分词工具, 目标并不是为了能够有啥使用性.

词典的准备

在自然语言处理入门一书中, 提供了中文的词典用于使用, 但是我的目标是用来做日语分词的, 所以得找一些日语词典. 这里考虑的是使用 EDICT Dictionary File 来作为简单的日语词典, 并且仅使用其词条信息, 并不考虑其他的各种信息.

获取和处理方法

从网页上下载 edict.gz 文件包并解压可以得到一个 EUC-JP 编码的文件, 将其使用 iconv (iconv (Wikipedia), 对于 macOS 可以通过 homebrew 安装) 转换为 UTF-8 的编码:

下载词典 (去网站上直接下载也不是不行):

wget http://ftp.edrdg.org/pub/Nihongo/edict.gz

解压 .gz 文件:

gunzip edict.gz

然后转换文字编码:

iconv -f EUC-JP -t UTF-8 edict > edict-utf8

你可以用 head edict-utf8 命令来预览部分的词典.

tail edict-utf8 # head is also ok

凜々しい [りりしい] /(adj-i) gallant/manly/brave/imposing/awe-inspiring/chivalrous/dignified/
凜乎 [りんこ] /(adj-t,adv-to) commanding/awe-inspiring/
凜然 [りんぜん] /(adj-t,adv-to) (1) commanding/awe-inspiring/(adj-t,adv-to) (2) bitter (cold)/piercing/
凜烈 [りんれつ] /(adj-na,adj-t,adv-to) (form) biting (cold)/severe/intense/piercing/
凜冽 [りんれつ] /(adj-na,adj-t,adv-to) (form) biting (cold)/severe/intense/piercing/
凜凜 [りんりん] /(adj-t,adv-to) (1) severe/intense/awe-inspiring/commanding/(adj-t,adv-to) (2) biting/bitter (cold)/piercing/frigid/
凜凜しい [りりしい] /(adj-i) gallant/manly/brave/imposing/awe-inspiring/chivalrous/dignified/
熙々 [きき] /(adj-t,adv-to) (1) (rare) relaxed and enjoyable/(adj-t,adv-to) (2) (rare) spacious/vast/(adj-t,adv-to) (3) (rare) heavily trafficked/bustling/busy/
熙春茶 [ひちゅんちゃ] /(n) hyson (Chinese green tea)/
熙熙 [きき] /(adj-t,adv-to) (1) (rare) relaxed and enjoyable/(adj-t,adv-to) (2) (rare) spacious/vast/(adj-t,adv-to) (3) (rare) heavily trafficked/bustling/busy/

可以看到里面的数据大概如下:

词条 [读音] /(词性) 定义.../

用 Lisp 把它提取出来:

(defparameter *dictionary*
  (with-open-file (stream file)
    (let ((scaner (ppcre:create-scanner "[^ ]+")))
      (loop for line = (read-line stream nil)
            while line
            collect (ppcre:scan-to-strings scaner line))))
  "EDict Dictionary for Japanese word.")

在之后会考虑使用其他的日语词典来进行替换.

词典分词

完全切分

朴素的切分方法如下:

(defun simple-full-segment (text &optional (dic *dictionary*))
  "对 `text' 使用简单的匹配进行完全切分."
  (let ((word-list nil))
    (loop for i below (length text) do
      (loop for j from (1+ i) upto (length text)
            for word = (subseq text i j)
            if (find word dic :test #'string=)
              do (push word word-list)))
    ;; output shall be in sequence
    (nreverse word-list)))

一个简单的效果如下:

("大" "大学" "学" "で" "日" "日本" "日本語" "本" "語" "を" "勉強" "強" "し" "しま" "ま" "ます" "す")

显然易见的可以提升的部分

这个历遍搜索的部分就看起来让人很想去修改
然后字符串匹配的部分也很需要去修改, 用自动机的话估计可以快一些

最长匹配

目标是能够将句子划分成词的序列, 完全分割可能没啥用处.

正向最长匹配算法

一个简单的伪代码如下:

(defun forward-segment (text &optional (dict *dictionary*))
  "正向最长匹配算法"
  (if (zerop (length text))
      nil
      (multiple-value-bind (longest-word rest-text)
          (longest-prefix-segment text dict)
        (cons longest-word (forward-segment rest-text dict)))))

(defun longest-prefix-segment (text &optional (dict *dictionary*))
  "找到并返回在 `text' 中的开头在字典 `dict' 中的最长单词和剩下的字符串."
  (loop with longest-word = ""
        with longest-len  = 0
        with longest-idx  = 0

        for i from 1 upto (length text)
        for word = (subseq text 0 i)
        for word-len = (length word)
        if (and (find word dict :test #'string=)
                (> word-len longest-len))
          do (setf longest-word word
                   longest-len  word-len
                   longest-idx  i)

        finally (return
                  (if (zerop longest-len)
                      (values :fail "")
                      (values longest-word (subseq text longest-idx))))))

这里有一些补注

这里假定所有的匹配一定是最长匹配是正确的, 但是如果不一定最长匹配是对的, 或者第二长的才是正确的匹配? 但是这样的算法并没有做到这样的 fallback 处理, 所以这里有一个小小的修改.

(defun forward-segment* (text &optional (dict *dictionary*))
  "正向最长匹配算法"
  (if (zerop (length text))
      nil
      (let ((patterns (longest-prefix-segments text dict)))
        (loop for (word . shift-idx) in patterns
              if (not (eq word :fail))
                return (cons word
                             (forward-segment* (subseq text shift-idx) dict))))))

(defun longest-prefix-segments (text &optional (dict *dictionary*))
  "找到并返回在 `text' 中的开头在字典 `dict' 中的最长单词和剩下的字符串."
  (sort (loop for i from 1 upto (length text)
              for word = (subseq text 0 i)
              if (find word dict :test #'string=)
                collect (cons word i))
        #'string> :key #'car))

相当于变成了一个深度搜索的算法了.

反向匹配算法

反过来也不是不行, 这样就变成了反向最长匹配:

(defun backward-segment* (text &optional (dict *dictionary*))
  "反向最长匹配算法"
  (labels
      ((backward-segment (text)
         (if (zerop (length text))
             nil
             (loop for (word . shift-idx) in (longest-postfix-segments text dict)
                   if (not (eq word :fail))
                     return (cons word
                                  (backward-segment (subseq text 0 shift-idx)))))))
    (reverse (backward-segment text))))

(defun longest-postfix-segments (text &optional (dict *dictionary*))
  "找到并返回在 `text' 中的结尾在字典 `dict' 中的最长单词和剩下的字符串."
  (let ((len (length text)))
    (sort (loop for i from (1- (length text)) downto 0
                for word = (subseq text i len)
                if (find word dict :test #'string=)
                  collect (cons word i))
          #'> :key (lambda (pattern) (length (car pattern))))))

注: 但是感觉效果一般, 很多都匹配不出来, 并且还很慢… 一个原因是词典可能不够大, 另外一个可能是这个字符串匹配的过程有点太慢了.

双向匹配算法

对正向和反向的匹配结果进行比较, 选择较优的结果, 批判标准如下:

优先返回词数较少的
以及单字数量较少的

(defun bidirectional-segment (text &optional (dict *dictionary*))
  "双向匹配算法"
  (labels ((single-word-count (seq)
             (loop with count = 0
                   for word in seq
                   if (= (length word) 1)
                     do (incf count)
                   finally (return count))))
    (let* ((forward  (forward-segment* text dict))
           (backward (backward-segment* text dict))
           (forward-len  (length forward))
           (backward-len (length backward)))
      (if (= forward-len backward-len)
          (if (< forward-len backward-len) forward backward)
          (let ((forward-count  (single-word-count forward))
                (backward-count (single-word-count backward)))
            (if (< forward-count backward-count) forward backward))))))

缺点就是更慢了 (汗).

字典树

Lisp 的树的实现应该可以如下实现:

(defclass dictionary-tree-node ()
  ((chr       :accessor chr
              :initarg :chr)
   (terminal  :accessor dictionary-terminal-p
              :initform nil :initarg :terminal)
   (children  :accessor dictionary-children
              :initform (make-hash-table :test 'equal))
   (parent    :accessor dictionary-parent
              :initform nil))
  (:documentation "字典树的节点"))

(defun make-dictionary-tree-node (char &key terminal)
  "创建一个字典树的节点"
  (make-instance 'dictionary-tree-node :chr char :terminal terminal))

(defmethod insert-word ((tree dictionary-tree-node) word)
  (if (zerop (length word))
      ;; end of word -> terminal
      (setf  (dictionary-terminal-p tree) t)
    
      (let* ((char (subseq word 0 1))
             (node (gethash char (dictionary-children tree) nil)))        
        (unless node
          ;; append tree if not exists
          (let ((new-node (make-dictionary-tree-node char)))
            (setf (gethash char (dictionary-children tree)) new-node
                  (dictionary-parent new-node)              tree
                  node                                      new-node)))
        (insert-word node (subseq word 1)))))

(defun make-dictionary-tree (dict)
  "生成字典树"
  (let ((tree (make-dictionary-tree-node "")))
    (loop for word in dict do
      (insert-word tree word))
    tree))

看看效果, 生成一个词典所需用时:

(time
 (defparameter *dictionary-tree*
   (make-dictionary-tree *dictionary*)
   "简单的日语字典树"))

jp-nlp> (time
         (dotimes (i 100) (make-dictionary-tree *dictionary*)))
Evaluation took:
  63.475 seconds of real time
  63.065967 seconds of total run time (58.132774 user, 4.933193 system)
  [ Real times consist of 38.394 seconds GC time, and 25.081 seconds non-GC time. ]
  [ Run times consist of 37.901 seconds GC time, and 25.165 seconds non-GC time. ]
  99.36% CPU
  35,816,351,920 bytes consed

注: 感觉运算很多时间都花在 GC 上了, 估计是哪里程序写得不是很好…

一些没用的可视化工作

(defgeneric graphviz-render (tree &optional headers)
  (:documentation "Render the tree into graphviz code."))

(defmethod graphviz-render ((tree dictionary-tree-node) &optional (headers ""))
  (labels
      ((renderer (node name)
         (let* ((point (chr node))
                (type  (cond ((string= point "") "point")
                             ((dictionary-terminal-p node) "doublecircle")
                             (t "circle")))
                (nexts (alexandria:hash-table-alist (dictionary-children node))))
           (if (null nexts)
               (format nil "\"~a\" [shape=~a,label=\"~a\"];" name type point)
               (loop for (char . next) in nexts
                 for next-node-name = (gensym char)
                 collect next-node-name into next-node-names
                 collect (renderer next next-node-name) into next-nodes
                 finally (return (format nil "\"~a\" [shape=~a,label=\"~a\"];~&\"~a\" -> { ~{\"~a\"~^, ~} };~&~{~&~a~}"
                                         name type point name next-node-names
                                         next-nodes)))))))
    (format nil "digraph {~&~a~&~a~&}" headers (renderer tree "root"))))

(defun graphviz->svg (tree output &key (headers "") debug)
  "把 `tree' 用 Graphviz 渲染 SVG 输出到 `output' 处. "
  (with-input-from-string (input (if debug
                                     (print (graphviz-render tree headers))
                                     (graphviz-render tree headers)))
    (uiop:run-program "dot -Tsvg" :input input :output (pathname output))
  output))

简单的一些尝试:

(let ((pivot (random (- (length *dictionary*) 20))))
  (graphviz->svg (make-dictionary-tree (subseq *dictionary* pivot (+ 20 pivot)))
                 output :headers "rankdir=LR;"))

/_img/lisp/jp-word-segmentation/20-word-dict-tree.svg

注: 我有一个变态的想法… 不过只能说我的电脑跑不动那么大的输出渲染, 或者也有可能是我的算法还不够优化, 只能跑一个小的图显示一下:

(graphviz->svg (make-dictionary-tree (ryo:random-samples *dictionary* 100))
               output :headers "layout=fdp;")

/_img/lisp/jp-word-segmentation/crazy-word-dict-tree.svg

用字典树看看能不能修改一下原来的最大正向匹配算法, 让其跑得稍微快一点:

(defgeneric forward-segment (tree text)
  (:documentation "使用字典树 `tree' 来分割 `text'. "))

(defgeneric longest-prefix-segments (tree text)
  (:documentation "使用字典树 `tree' 来匹配所有 `text' 开头的词.
返回一个包含 (单词 . 相对开头位移) 的一个列表. "))

(defmethod longest-prefix-segments ((tree dictionary-tree-node) text)
  (let ((node tree))
    (loop with node = tree
          with word-list = nil
          for shift from 1 upto (length text)
          do (setf node (gethash (subseq text (1- shift) shift)
                                 (dictionary-children node) nil))
          while node
          if (dictionary-terminal-p node)
            do (push (cons (subseq text 0 shift) shift) word-list)
          finally (return word-list))))

(defmethod forward-segment ((tree dictionary-tree-node) text)
  (if (zerop (length text))
      nil
      (let ((patterns (longest-prefix-segments tree text)))
        (if patterns
            (let ((res (loop for (word . shift) in patterns
                             for next = (forward-segment tree (subseq text shift))
                             if (not (eq next :fail))
                               return (cons word next))))
              (if res res :fail))
            :fail))))

嗯, 这下子快多了. 不过感觉效果还是一般, 菜得很…

暂时停止

打算先去做点别的东西之后再回来继续, 反正也是玩玩.