又是一个新坑

我不管, 好玩玩它就是了.

又, 这本书里面现在看起来故事挺多的, 感觉挺有意思的.

Finite-State Techniques

Finite-state automata (FSAs) are among the simplest computing machines that can be envisaged. They are well understood mathematically, easy to implement and efficient at doing what they do. If an NLP problem can be conveniently solved with a finite-state automaton, then it is probably a good idea to solve it that way. However. FSAs are subject to certain formal limitations that render them ill suited to certain computational linguistic tasks. This chapter provides a fairly comprehensive introduction to these machines and their implementation, indicates possible areas of application and gives some concrete examples of their use, and examines their limitations.

Finite-state transition networks

如何判断输入的文本的语言类型?

词频分析:
比如对于不同的语言文本, 其字母出现的频率不同, 所以可以通过这样的方式来判断.
暴力枚举:
假设所有的语言都存在对应的处理程序, 对所有的处理程序, 都进行一次匹配, 如果不成功的话就使用下一个语言, 这样的坏处是比较慢.

简单的自动机的例子

一个开怀大笑的例子:

/_img/reading/lisp-nlp/laughing-machine.svg

上图即为一个简单的 (只有四个状态的) 描述 FSA (有限自动机) 的 FSTN (有限状态转移图). 其中在第三个状态 3 处, 在两条道路 ”h” 和 ”!” 之间会有一定的机率发生转换. 这个转换的依据可以是外部的一个随机数生成器这样之类的东西.

使用上述自动机, 还能够用来处理输入, 比如将输入的符号扔到里面进行匹配. (那么说起来, 这个的感觉有点像是一种正则表达式进行匹配的感觉. )

当然, 描述一个自动机可以有多种可能:

/_img/reading/lisp-nlp/laughing-machine-reading.svg

不过和前一种自动机不同的是, 这个自动机是 non-deterministic (非确定的) 的自动机. 从 2 状态的两个 a 转换的规则即可看出.

除了上面的未定分支, 还可以出现跳转分支 (jump arc, 或者说, 无标签的跳转 unlabelled arcs):

/_img/reading/lisp-nlp/laughing-machine-with-jump-arc.svg

注: 这一段是想要说明 FSTN 和 FSA 的一些细微的区别, 即对于 FSTN 来说, 上面的网络都是可行的, 但是对于 FSA 来说, 这样的网络并不都是可行的. 但是一般为了方便, 所以会在不出现歧义的情况下将二者等同.

于是上面就给出了一个如何处理语言 (或者说, 如何识别特定语言) 的思路: 即对一种语言, 构建出关于其的自动机网络 FSTN, 然后对其识别即可.

A notation for networks

为了达到使用 FSTN 对语言进行识别的目的, 首先需要想一种用来描述 FSTN 的形式标记. 而对于一个 FSTN, 其应该由如下组成: (显然, 这就是一个有向图)

网络需要有一个名称
尽管在描述 FSTN 的时候, 一个名字仅仅只是助记用途 (mnemonic), 但是在 RTN 种, the name of network plays an absolutely crucial role in the definition of RTNs.

在这里用:
```
Name TO-LAUGH:
```
来标记之前的 laughing 机.
网络的节点的集合
对于节点的集合, 其必须包含一个初始节点 Initial, 以及一个终止节点 Final. 于是一个图的节点集合可以记为:
```
Initial 1
Final 4
```
对于这样的一个集合, 一个更加方便的方法则是对集合进行划分 (引入子集和缩写 abbreviates 的方式来进行描述):
```
V abbreviates:
  a, e, i, o, u.
C abbreviates:
  b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z.
```
对于这样的缩写, 在绘图的时候通过用缩写名字来代替各种线的方式来简化图的结构.
连接网络中节点的有向线段
对于有向线段的描述, 通过 From <node> to <node> by <label> 的方式进行定义. 其中在用 # 来表述 <label> 的时候, 认为其为无 label 的跳转.

于是一个完整的描述如下: (为上面的最后一个自动机的图)
```
Name TO-LAUGH:
  Initial 1
  Final 4
  From 1 to 2 by h
  From 2 to 3 by a
  From 3 to 1 by #
  From 3 to 4 by !.
```
其中, <label> 可以是多个字符的子集.

并且还能够通过这样的方式来对其进行描述: (通过一个转移矩阵来描述和储存)

h a !

1 2 0 0

2 0 3 0

3 2 0 4

4 0 0 0

其中用 0 表示没有线段, 用其他的数字来表示到达的目的地.

英语的一些例子

英语语句

一个英语的例子 (代码以及解释)

对于一个英语的语句结构, 可以用如下的 FSTN 来进行表述:

Name ENGLISH-1:
  Initial 1
  Final 9
  From 1 to 3 by NP
  From 1 to 2 by DET
  From 2 to 3 by N
  From 3 to 4 by BV
  From 4 to 5 by ADV
  From 4 to 5 by #
  From 5 to 6 by DET
  From 5 to 7 by DET
  From 5 to 8 by #
  From 6 to 6 by MOD
  From 6 to 7 by ADJ
  From 7 to 9 by N
  From 8 to 8 by MOD
  From 8 to 9 by ADJ
  From 9 to 4 by CNJ
  From 9 to 1 by CNJ

  NP abbreviates:
    kim, sandy, lee.
  DET abbreviates:
    a, the, her.
  N abbreviates:
    consumer, man, woman.
  BV abbreviates:
    is, was.
  CNJ abbreviates:
    and, or.
  ADJ abbreviates:
    happy, stupid.
  MOD abbreviates:
    very.
  ADV abbreviates:
    often, always, sometimes.

其中的缩写分别对应:

N: noun 名词
NP: noun phrase 名词性短语
DET: determiners 定冠词 (words that can come before (common) nouns)
ADJ: adjectives 形容词
MOD: modify adjecitves 形容词修饰短语, 如 “stupid” 和 “very stupid”
V: verb 动词
VP: verb phrase 动词短语
BV: be 动词
CNJ: conjuction 连接词

说明:

尽管这样的 FSTN 看起来并不是很容易理解, 但是却很容易实现.

感觉可以用类似于正则表达式或者是 ENBF 的方式来表述上面的语句:

<sentence> ::= <object> <BV> <description>
<object> ::= <DET> <N> | <NP>
<description> ::= (<adj_pharse> | <noun_pharse>) <more>*
<adj_pharse> ::= <ADV> <MOD>* <ADJ>
<noun_pharse> ::= (<ADV> | " ") (<DET> (<MOD>* <ADJ> | " ") <N> | <MOD>* <ADJ>)
<more> ::= <CNJ> (<description> | <sentence>)

(差不多这样的感觉, 写得不是很干净… )

不过上面的 FSTN 仅仅实现了 A is B 这样的语句. 因为语句比较简单, 所以可以通过一些非常直接的方式来进行构造:
1. 主体肯定是 <A> <BV> <B> 这样的构造
2. 对于 <A> 的部分, 需要构造名词性短语, 即 <DET> <N> 或者 <NP> 短语
3. 对于 <B> 的部分, 需要构造的是一个形容词性短语或者是名词性短语.
  对于形容词性短语, 简单的方式就是 <ADV> <MOD>* <ADJ> 这样构造
  
  对于名词性短语, 通过形容词性短语修饰的名词即可得到
并且通过拓展语料库的方式, 应该就能够实现更多复杂的语句. 不过这个目前并没有上下文一致性的判断.

/_img/reading/lisp-nlp/english-1-fstn.svg

一些其他的例子

英语的单音节词

其中对于英语的单音节的单词:

/_img/reading/lisp-nlp/eng-monosyl-fstn.svg

代码以及解释

Name ENG-MONOSYL:
  Initial 1, 2
  Final 3, 4, 5
  From 1 to 2 by C0
  From 2 to 3 by V
  From 3 to 4 by C8
  From 4 to 5 by s
  From 1 to 7 by C3
  From 7 to 2 by w
  From 1 to 6 by C2
  From 6 to 2 by l
  From 6 to 5 by #
  From 1 to 5 by C1
  From 5 to 2 by r
  From 1 to 8 by s
  From 8 to 5 by C4
  From 8 to 2 by C5
  From 3 to 9 by l
  From 3 to 10 by s
  From 3 to 11 by C7
  From 9 to 4 by C6
  From 10 to 4 by C4
  From 11 to 4 by th.

  V abbreviates:
    a, ae, ai, au, e, ea, ee, ei, eu, i, ia, ie, o, oa, oe, oi, oo, ou, ue, ui.
  CO abbreviates:
    b, c, ch, d, f, g, h, j, k, I, m, n, p, qu, r, s, sh, t, th, v, w, x.
  C1 abbreviates:
    d, sh, th.
  C2 abbreviates:
    b, c, f, g, k.
  C3 abbreviates:
    d, g, h, t, th.
  C4 abbreviates:
    c, k, p, t.
  C5 abbreviates:
    c, k, l, m, n, p, pl, qu, t, w.
  C6 abbreviates:
    b, t, m.
  C7 abbreviates:
    d, f, i, n, x.
  C8 abbreviates:
    b, c, ch, ck, d, f, g, h, k, l, m, mp, mph, n, ng, p, que, r, s, sh, th, v, w, x, y, z.

题外话

随机的英文单词

突然很好奇一个想法, 如果我有一个足够大的英语单词库 (比如这个), 然后对其进行统计, 比如说将 26 个字母每个字母都做一遍统计, 统计其从前一个字母, 或者说前几个字母向下一个字母变化的概率.

比如说用这样的一个程序来统计: (请忽略我丑陋的四不像代码)

从这里下载一个 10000 词的词典:

require 'open-uri'
words = URI.open(URI("https://www.mit.edu/~ecprice/wordlist.10000"))\
	    .read.split(/\s+/)\
	    .map { |word| word.downcase.split("").map { |c| c.ord - 97 } }

words.length # => 10000

其中统计的方法是这样的:

对于一个 arc, 用 From A to B by P 的形式来记录
通过使用一个矩阵 trans 来记录数据. 规定第 A 行为 From, 第 B 列为 to, 即第 A 行的第 B 列的元素记录了 From A to B 的 arc 次数.
```
trans = 27.times.map { 27.times.map { 0 } }

puts "You got a 27x27 matrix with default value of 0."
```

count = ->(word) {
  pre = word[0] # input as C style word
  (1...word.length).each do |i| # start from second character
    trans[pre][word[i]] += 1
    pre = word[i]
  end
  trans[word[-1]][26] += 1 # end of word
}

words.each { |word| count.call(word) }

puts "Count each arcs to the matrix."

归一化前

6	132	323	237	8	40	147	21	218	3	59	653	215	727	6	170	10	678	311	781	83	79	34	17	101	20	299
160	15	8	4	170	0	1	2	128	8	1	150	11	3	128	3	0	116	51	7	101	2	3	0	15	1	53
362	2	63	9	388	2	2	314	192	1	153	125	3	6	591	3	5	131	41	303	120	2	1	0	35	2	169
148	6	9	36	489	4	26	5	362	9	1	31	15	9	115	5	1	71	117	9	85	28	11	1	34	0	880
377	48	322	685	176	94	94	18	44	3	16	370	224	844	38	126	36	1114	941	276	28	120	73	175	45	4	1310
101	1	4	2	121	76	3	1	192	0	0	61	1	0	126	1	0	61	7	30	62	0	2	1	11	0	63
124	6	3	3	274	2	23	104	103	0	1	33	8	50	63	5	0	133	62	15	67	0	0	0	24	1	613
248	4	3	6	294	0	0	1	188	0	1	16	10	16	231	3	2	44	14	69	54	1	5	0	30	3	186
258	87	456	174	282	93	158	1	10	6	25	270	160	1325	603	110	8	170	441	455	19	184	1	22	1	50	92
33	0	2	1	40	0	0	0	9	1	0	0	1	0	43	3	0	1	1	0	37	1	0	0	0	0	10
27	6	0	2	156	4	3	2	96	2	1	11	3	21	12	2	0	4	68	2	6	0	1	0	10	0	153
372	14	14	84	572	21	9	1	469	0	19	302	15	7	290	16	0	3	106	88	97	24	3	0	217	0	488
338	64	6	2	430	5	2	2	254	1	0	6	68	7	193	185	0	3	61	4	47	1	3	1	21	0	208
276	9	257	358	414	46	706	11	274	10	48	25	13	98	135	4	5	8	454	661	60	53	6	2	46	5	838
62	63	124	116	21	43	90	10	39	6	56	282	284	1067	154	149	0	613	175	186	275	100	131	20	34	8	144
249	3	7	9	308	2	8	95	124	1	2	178	11	5	232	100	0	307	66	90	74	1	0	0	12	1	142
1	0	1	0	0	0	0	0	1	0	0	3	0	0	0	0	0	0	1	2	103	0	0	0	0	0	11
593	34	86	130	999	25	77	7	559	1	61	53	109	100	410	36	1	114	320	231	98	51	12	1	152	1	599
140	12	143	10	473	12	7	204	360	0	37	53	38	13	184	161	7	7	262	646	208	2	28	0	35	0	2043
379	9	40	5	798	6	4	256	995	0	2	67	28	13	306	10	0	353	278	122	166	6	18	1	137	1	760
101	72	93	58	87	16	65	1	86	2	10	142	113	232	7	70	2	317	217	184	1	3	2	5	8	8	37
116	2	5	3	415	0	2	1	205	0	0	0	0	0	59	2	0	1	4	2	2	0	0	1	4	0	25
129	4	3	3	104	3	0	31	104	2	1	9	2	31	61	3	0	18	32	6	1	1	3	1	5	0	75
18	1	24	0	24	1	0	6	25	0	0	1	2	1	2	53	0	0	0	32	8	0	0	4	4	0	58
18	10	14	8	48	0	2	1	24	0	0	16	23	16	21	15	0	11	49	10	5	0	7	0	0	2	727
22	1	0	2	44	0	0	0	14	0	0	2	0	0	16	0	0	1	1	0	5	0	0	0	4	7	17
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

归一化代码: (保留百分号后两位小数)

26.times do |alphabet|
  total = trans[alphabet].sum
  trans[alphabet].map! { |count| (count * 100.0 / total).round(2) }
end

归一化后 (单位 %)

0.11	2.45	6.01	4.41	0.15	0.74	2.73	0.39	4.05	0.06	1.1	12.14	4.0	13.52	0.11	3.16	0.19	12.61	5.78	14.52	1.54	1.47	0.63	0.32	1.88	0.37	5.56
14.02	1.31	0.7	0.35	14.9	0.0	0.09	0.18	11.22	0.7	0.09	13.15	0.96	0.26	11.22	0.26	0.0	10.17	4.47	0.61	8.85	0.18	0.26	0.0	1.31	0.09	4.65
11.97	0.07	2.08	0.3	12.83	0.07	0.07	10.38	6.35	0.03	5.06	4.13	0.1	0.2	19.54	0.1	0.17	4.33	1.36	10.02	3.97	0.07	0.03	0.0	1.16	0.07	5.59
5.9	0.24	0.36	1.44	19.51	0.16	1.04	0.2	14.44	0.36	0.04	1.24	0.6	0.36	4.59	0.2	0.04	2.83	4.67	0.36	3.39	1.12	0.44	0.04	1.36	0.0	35.1
4.96	0.63	4.24	9.01	2.32	1.24	1.24	0.24	0.58	0.04	0.21	4.87	2.95	11.1	0.5	1.66	0.47	14.66	12.38	3.63	0.37	1.58	0.96	2.3	0.59	0.05	17.23
10.9	0.11	0.43	0.22	13.05	8.2	0.32	0.11	20.71	0.0	0.0	6.58	0.11	0.0	13.59	0.11	0.0	6.58	0.76	3.24	6.69	0.0	0.22	0.11	1.19	0.0	6.8
7.22	0.35	0.17	0.17	15.96	0.12	1.34	6.06	6.0	0.0	0.06	1.92	0.47	2.91	3.67	0.29	0.0	7.75	3.61	0.87	3.9	0.0	0.0	0.0	1.4	0.06	35.7
17.35	0.28	0.21	0.42	20.57	0.0	0.0	0.07	13.16	0.0	0.07	1.12	0.7	1.12	16.17	0.21	0.14	3.08	0.98	4.83	3.78	0.07	0.35	0.0	2.1	0.21	13.02
4.72	1.59	8.35	3.19	5.16	1.7	2.89	0.02	0.18	0.11	0.46	4.94	2.93	24.26	11.04	2.01	0.15	3.11	8.08	8.33	0.35	3.37	0.02	0.4	0.02	0.92	1.68
18.03	0.0	1.09	0.55	21.86	0.0	0.0	0.0	4.92	0.55	0.0	0.0	0.55	0.0	23.5	1.64	0.0	0.55	0.55	0.0	20.22	0.55	0.0	0.0	0.0	0.0	5.46
4.56	1.01	0.0	0.34	26.35	0.68	0.51	0.34	16.22	0.34	0.17	1.86	0.51	3.55	2.03	0.34	0.0	0.68	11.49	0.34	1.01	0.0	0.17	0.0	1.69	0.0	25.84
11.51	0.43	0.43	2.6	17.7	0.65	0.28	0.03	14.52	0.0	0.59	9.35	0.46	0.22	8.98	0.5	0.0	0.09	3.28	2.72	3.0	0.74	0.09	0.0	6.72	0.0	15.1
17.68	3.35	0.31	0.1	22.49	0.26	0.1	0.1	13.28	0.05	0.0	0.31	3.56	0.37	10.09	9.68	0.0	0.16	3.19	0.21	2.46	0.05	0.16	0.05	1.1	0.0	10.88
5.72	0.19	5.33	7.42	8.59	0.95	14.64	0.23	5.68	0.21	1.0	0.52	0.27	2.03	2.8	0.08	0.1	0.17	9.42	13.71	1.24	1.1	0.12	0.04	0.95	0.1	17.38
1.46	1.48	2.92	2.73	0.49	1.01	2.12	0.24	0.92	0.14	1.32	6.63	6.68	25.09	3.62	3.5	0.0	14.42	4.12	4.37	6.47	2.35	3.08	0.47	0.8	0.19	3.39
12.28	0.15	0.35	0.44	15.19	0.1	0.39	4.69	6.12	0.05	0.1	8.78	0.54	0.25	11.45	4.93	0.0	15.15	3.26	4.44	3.65	0.05	0.0	0.0	0.59	0.05	7.01
0.81	0.0	0.81	0.0	0.0	0.0	0.0	0.0	0.81	0.0	0.0	2.44	0.0	0.0	0.0	0.0	0.0	0.0	0.81	1.63	83.74	0.0	0.0	0.0	0.0	0.0	8.94
12.2	0.7	1.77	2.67	20.56	0.51	1.58	0.14	11.5	0.02	1.26	1.09	2.24	2.06	8.44	0.74	0.02	2.35	6.58	4.75	2.02	1.05	0.25	0.02	3.13	0.02	12.33
2.75	0.24	2.81	0.2	9.3	0.24	0.14	4.01	7.08	0.0	0.73	1.04	0.75	0.26	3.62	3.17	0.14	0.14	5.15	12.7	4.09	0.04	0.55	0.0	0.69	0.0	40.18
7.96	0.19	0.84	0.11	16.76	0.13	0.08	5.38	20.9	0.0	0.04	1.41	0.59	0.27	6.43	0.21	0.0	7.42	5.84	2.56	3.49	0.13	0.38	0.02	2.88	0.02	15.97
5.21	3.71	4.8	2.99	4.49	0.83	3.35	0.05	4.44	0.1	0.52	7.32	5.83	11.96	0.36	3.61	0.1	16.35	11.19	9.49	0.05	0.15	0.1	0.26	0.41	0.41	1.91
13.66	0.24	0.59	0.35	48.88	0.0	0.24	0.12	24.15	0.0	0.0	0.0	0.0	0.0	6.95	0.24	0.0	0.12	0.47	0.24	0.24	0.0	0.0	0.12	0.47	0.0	2.94
20.41	0.63	0.47	0.47	16.46	0.47	0.0	4.91	16.46	0.32	0.16	1.42	0.32	4.91	9.65	0.47	0.0	2.85	5.06	0.95	0.16	0.16	0.47	0.16	0.79	0.0	11.87
6.82	0.38	9.09	0.0	9.09	0.38	0.0	2.27	9.47	0.0	0.0	0.38	0.76	0.38	0.76	20.08	0.0	0.0	0.0	12.12	3.03	0.0	0.0	1.52	1.52	0.0	21.97
1.75	0.97	1.36	0.78	4.67	0.0	0.19	0.1	2.34	0.0	0.0	1.56	2.24	1.56	2.04	1.46	0.0	1.07	4.77	0.97	0.49	0.0	0.68	0.0	0.0	0.19	70.79
16.18	0.74	0.0	1.47	32.35	0.0	0.0	0.0	10.29	0.0	0.0	1.47	0.0	0.0	11.76	0.0	0.0	0.74	0.74	0.0	3.68	0.0	0.0	0.0	2.94	5.15	12.5
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

其中关于绘图的代码: 在输出的时候为了美观起见, 最多只输出前 3 (n = 3) 个可能的 arc. 且并不输出休止符.

n = 3
res = ""

26.times do |alphabet|
  index = -1
  res << (trans[alphabet][0...-1]
            .map { |percent| [percent, (index += 1)] }
            .sort_by { |x| x[0] }
            .reverse)[0...n]
           .map { |item| "#{(alphabet + 97).chr} -> #{(item[1] + 97).chr} [label = \"#{item[0]}\"];\n" }
           .join("")
end

res

最终得到的图的结果:

/_img/reading/lisp-nlp/cout-of-english-words.svg

没什么想法美化了, 只能说真的丑啊… 看来之后还要想想办法钻研一下如何出漂亮的图.

造词部分则通过如下的方式来实现

准备的代码

require 'pickup' # https://github.com/fl00r/pickup

hashed_trans = {}

trans.each_with_index do |tos, from_index|
  hashed_tos = {}

  tos.each_with_index do |possibility, to_index|
    hashed_tos[(to_index + 97).chr] = possibility
  end

  hashed_trans[(from_index + 97).chr] = Pickup.new(hashed_tos)
end

word = -> (iter = 0, char = '') {
  c = (rand(26) + 97).chr
  if iter == 0
    c + word.call(1, c)
  elsif iter > 10
    c
  else
    c = hashed_trans[char].pick
    c == '{' ? "" : c + word.call(iter + 1, c)
  end
}

注: 其中使用了一个叫做 pickup 的 gem, 原因是我懒得写随机生成的算法了.

来输出一段话吧:

10.times.map { word.call }.join(" ")

thice baddicana b ran gug on f ronstintotag veacadonamaw erarivore

感觉有点样子了.

不过这样的方法还是太粗鲁了一些. (有一种概率论里面的无记忆性掷色子的感觉. ) 一些可能可以改进的方向:

增加对前一字母的判断
细化对字母的一些处理, 比如上面的计数部分就做得不是很好
增加对开始标记的处理 (我真的不想再重新写了… )

和上面的题外话类似的, 也能够根据上面的方式来进行处理生成.

[2023-1-25]: 先暂停一下, 去搞一个

(乐: 这个就是一个例题里面的一个东西. )

[2023-2-21]: 回来继续

FSTN in LISP

使用代码来储存 FSTN 的结构

注: 如果有时间的话, 可以考虑直接用 defmarco 来实现类似的操作. 并且为了防止占用 #, 在 clisp 里面用 |#| 的形式来描述.

(defvar haha-fstn "FSTN for haha! ")
(setq haha-fstn
      '((Initial (1))
        (Final (3))
        (From 1 to 2 by h)
        (From 2 to 3 by a)
        (From 3 to 1 by !)))

(defvar swahili-1 "FSTN for Swahili. ")
(setq swahili-1
      '((Initial (1))
        (Final (5))
        (From 1 to 2 by subj)
        (From 2 to 3 by tense)
        (From 3 to 4 by obj)
        (From 4 to 5 by stem)))

(defvar abbreviations)
(setq abbreviations
      '((subj ni u a tu wa)
        (obj ni ku m tu wa)
        (tense ta na me li)
        (stem penda piga sumbua lipa)))

一些数据结构的其他代码

;;; The Network Rules
(defun initial-nodes (network)
  "Find Initial nodes in NETWORK. "
  (nth 1 (assoc 'Initial network)))

(defun final-nodes (network)
  "Find Final nodes in NETWORK. "
  (nth 1 (assoc 'Final network)))

(defun transitions (network)
  "Find Transitions for NETWORK. "
  (cddr network))

;;; The Transistion Rules
(defun trans-node (transition)
  "Get From node from the TRANSITION. "
  (getf transition 'From))

(defun trans-newnode (transition)
  "Get To node from the TRANSISTION. "
  (getf transition 'To))

(defun trans-label (transition)
  "Get By node from the TRANSITION. "
  (getf transition 'by))

注: 这里有一个小小的吐槽. 这样的话就相当于把一个 FSTN 的形式给写死了. (指 transitions 的代码, 相当于是认为 FSTN 的前两行就是 Initial 和 Final, 然后后面的全都是转移关系.

根据 FSTN 的历遍机器 - Recognizer [低端版本]

先定义算法的过程:

根据可能的 Initial nodes 建立一组可能的状态 (initial pool)
执行如下操作:
1. 选择其中的一个可能的状态 (将其从 pool 中移除)
2. 如果它处于 Final 状态, 则停止并说明识别成功, 否则执行下面的操作:
  1. 计算它下一步可能的状态
  2. 将那些可能的状态放到 initial pool 里面
如果可选的 pool 被清空了, 那么就说明识别不成功

那么一个可能的状态长什么样呢? 即希望通过: (state , remaining input) 的形式来记录. 这个计算下一步可能的状态的话, 分为三种情况:

若 label 和 remaining input 的符合
若 abbreviate 和 remaining input 符合
若 label 是 #

于是代码如下:

(defun recognize (network tape)
  "Test if the TAPE matches NETWORK, return t if true, nil otherwise. "
  (catch 'stop
    (dolist (initial (initial-nodes network))
      ;; Iterating the inital pool
      (recognize-next initial tape network))
    ;; return nil if fails
    nil))

(defun recognize-next (state tape network)
  "Iterate the tape by shorten the TAPE, move to next state by NETWORK. 
If TAPE is empties and in final STATE, meaning case matched, it will throw 'stop."
  (if (and (member state (final-nodes network))
           (null tape))                 ; final state and empty tape
      (throw 'stop t)                   ; success and abort
      (dolist (transition (transitions network))
        (if (equal state (trans-node transition)) ; select start
            ;; for every possible next-nodes
            (dolist (newtape (recognize-move (trans-label transition) tape))
              (recognize-next (trans-newnode transition) newtape network))))))

(defun recognize-move (label tape)
  "Return a list of possible new tape by label. "
  (if (equal (car tape) label)          ; first of tape matches label 
      (list (cdr tape))
      (if (member (car tape) (assoc label abbreviations))
          (list (cdr tape))             ; if matches abbreviations
          (if (equal label '|#|)        ; jump if #
              (list tape)
              '()))))                   ; return nil if all fails

注: 关于这个代码的一些注释

在计科导里面我们实现了一个图灵机, 这里的状态转移基本上就非常像图灵机了, 只是规则已经固定且形式上是一个变短的形式.

不过现在还没有搞定如何处理位置和无限长度纸带的一个写法, 之后有时间再试试看用 Common Lisp 来写一个图灵机.

不过这个我觉得实在有点不太合理, 因为如果要按照上面的算法来进行的话, 我们输入的文本就有很多的要求, 比如我们对于一个 Swahili 的语言, 就需要提前分词: '(wa me tu lipa) 否则就无法处理. 这显然是和一开始的想法是违背的吧, 毕竟一般可能会希望通过 'wametulip 这样直接的形式来.

一个可能的想法就是将 Abbreviations 通过展开的方式, 来替代提前分割的一个做法.

并且还有一个问题就是这个问题里面的算法, 如果对于很多分支的话, 就会因为搜索空间过大而搜索无力而死机了. 这样就非常痛苦. 现实世界中这样的单词空间可以说是成百上千的量级了, 如果不能够解决的话, 就会有一个搜索爆炸的问题.

关于这个的话, 我觉得 Steven Wolfram 的一个视频应该讲得挺好的, 大致的一个思路就是通过概率的方式来减少历遍所有的可能域. 而通过谈话模型来缩小可能的空间或者改变倾向. 有点像是蚁群算法, 或者和我之前整的那个完全随机的那个单词生成的感觉很像, 但是需要更加精细的操作.

而生成的代码就变得更加简单一点了:

(defun generate (network)
  "Print all possible tapes by NETWORK. "
  (dolist (initial (initial-nodes network))
    (generate-next initial '() network)))

(defun generate-next (node tape network)
  "Append the TAPE and change NODE by NETWORK. "
  (if (member node (final-nodes network))
      (print tape)
      (dolist (transition (transitions network))
        (if (equal node (trans-node transition))
            (generate-move transition tape network))
        '())))

(defun generate-move (transition tape network)
  (let ((by (trans-label transition))
        (to (trans-newnode transition)))
    (cond ((equal by '|#|)
           (generate-next to tape network))
          ((assoc by abbreviations)
           (dolist (pattern (rest (assoc by abbreviations)))
             (generate-next to (generate-append tape pattern) network)))
          (t
           (generate-next to (generate-append tape by) network)))))

(defun generate-append (list value)
  (append list (list value)))

注: 这样的话如果有那种死循环的 FSTN. 那么你运行的时候就会非常痛苦.

(setq deadly-fstn
      '((Initial 1)
        (Final 3)
        (From 1 to 2 by h)
        (From 2 to 1 by a)
        (From 2 to 3 by -)))

一个可能的做法就是加入栈深度检查, 只要过深就退出, 或者别的什么之类的. 并且这个程序也不能够让 Final node 上的节点去选择其他可能的点.

对上面网络的一个简单解释 (无 # 分支版本):

若当前状态不是 Final State:
- 若纸带为空, 则退出 (nil)
- 若纸带非空, 则:
  - 若满足纸带匹配缩写规则或者缩写规则, 那么就按照规则缩短纸带
  - 若不满足任何的以上两种规则, 就退出 (nil)
若当前状态是 Final State
- 若纸带为空, 则停止查找 (stop)
- 若纸带非空, 则退出 (nil)
因为纸带长度有限, 且因为不存在 #, 即纸带长度总会减少, 所以纸带总会为空, 若纸带为空: 若当前状态是 Final State, 则停止 (stop); 若当前状态不是 Final State, 则退出 (nil).
综上, 是成立的.

Finite-state Transducer

An FST is a more interesting kind of FSA that allows a string of output symbols to be produced as an input string is recognized.

就是什么呢, 如果看过上面的一些注释或者看过代码的话, 肯定会对这个代码感到十分不爽, 因为只能认证, 或者说识别, 没法做更多的事情.

于是就有了 Finite-State Transducer 的一个需求, 即通过 FSTN 的规则来将将输入转换为输出. (有点像是带有写功能的图灵机? 很像啊. )

于是有一个简单的记法:

Name ENG_FRE:
  Initial 1
  Final 5
  From 1 to 2 by WHERE
  From 2 to 3 by BV
  From 3 to 4 by DET-FEMN
  From 4 to 5 by N-FEMN
  From 3 to 6 by DET-MASC
  From 6 to 5 by N-MASC.

WHERE abbreviates:
  where_ou.
BV abbreviates:
  is_est.
DET-FEMN abbreviates:
  the_la.
DET-MASC abbreviates:
  the_le.
N-FEMN abbreviates:
  exit_sortie, shop_boutique, toliet_toilette.
N-MASC abbreviates:
  policeman_gendarme.

上面的例子实际上就是一个简单的 English-French 的翻译 FSTN. 其中通过 from_to 的形式来记录 arc, 表示将读到的 from 转换为 to.

于是在 Lisp 表述里面, 将 from_to 用 (from to) 的 List 形式来记录.

大概类似这样吧

(defvar eng-fre "English-French FSTN. ")
(setq eng-fre
      '((Initial (1))
        (Final (5))
        (From 1 to 2 by WHERE)
        (From 2 to 3 by BV)
        (From 3 to 4 by DET-FEMN)
        (From 4 to 5 by DET-MASC)
        (From 3 to 6 by DET-MASC)
        (From 6 to 5 by N-MASC)))

(setq abbreviations
      '((WHERE (where ou))
        (BE (is est))
        (DET-FEMN (the la))
        (DET-MASC (the le))
        (N-FEMN (exit sortie) (shop boutique) (toliet toilette))
        (N-MASC (policeman gendarme))))

那么前面的 generate 和 recognize 的函数实际上现在就是被重新合在一起了. 通过 recognize 来读取, 再通过 generate 来转换并生成新的结果.

[2023/02/24]: 先滚去学一点 Lisp, 不然写代码太难受了.

附录

Common Lisp 补注

可以参考的一个在线文档.

Common Lisp 补注

简单的数据结构和类型

嘿嘿, 想不到吧, 在 Lisp (Common Lisp) 里面, 只要用 List 就能够干翻一切乐.

是否厌倦了在 List 里面使用 car, cdr? 试试 nth?

(nth 1 '(0 1 2 3 4))

比如要一个 Hash 表?

(assoc 'key
       '((key value list)
         (other-key value list and so on)))

Debugging and Testing

平时写代码的时候我都忽略了一点, 就是如何调试和测试. (尤其是后者)

在 Lisp 里面有各种调试的方法, 其中我认为比较好用的 (~~我会的~~):

trace 以及 untrace, 可以记录 (或取消) 一个方法的调用的参数

黑话表

ATN	augmented transition network
CF-PSG	context-free phrase structure grammar
CFL	context-free language
DAG	directed acyclic graph
DBQ	database query (language)
DCG	definite clause grammar
FSA	finite-state automaton
FST	finite-state transducer
FSTN	finite-state transition network
MRL	meaning representation language
MT	machine translation
NATR	network and transducer representation
PA	pushdown automation
PT	pushdown transducer
RTN	recursive transition network
WFC	word form clause
WFST	well-formed substring table

一些无关的代码

简单的自动机绘图代码

nodes = /(Initial|Final)\s+((((\w+),\s*)|\w+)+)/
arcs = /From\s+(\w+)\sto\s+(\w+)\s+by\s+(\w+|#)/

res = "  rankdir = LR;\n" + \
      "  node [shape = point]; qi qa;\n" + \
      "  node [shape = circle];\n"

raw.each_line do |line|
  if line.match(nodes)
    m = Regexp.last_match
    type, node_names = m[1], m[2]

    if type == "Initial"
      node_names.split(/,\s*/).each do |node_name|
        res << "  qi -> #{node_name};\n" unless node_name.empty?
      end
    else # type == "Final"
      node_names.split(/,\s*/).each do |node_name|
        res << "  #{node_name} -> qa;\n" unless node_name.empty?
      end
    end
  elsif line.match(arcs)
    m = Regexp.last_match
    from, to = m[1], m[2]

    if m[3] == '#'
      label = ";\n"
    else
      label = " [label = \"#{m[3]}\"];\n"
    end
    res << "  \"#{from}\" -> \"#{to}\"#{label}"
  end
end

return res

FSTN Parser

一个简单的 FSTN Parser 在 Ruby and EBNF 里面介绍了.