freemt commited on
Commit
7fd4e54
1 Parent(s): d6448a5

Update slow-track for more lang pairs

Browse files
data/xiyouji-ch1-de.txt CHANGED
@@ -2,125 +2,10 @@ Wu Ch’êng-ên
2
 
3
  Monkeys Pilgerfahrt
4
 
5
- Hugendubel
6
-
7
-
8
-
9
 
10
 
11
  Nach der englischen Übersetzung von Arthur Waley übertragen von Georgette Boner und Maria Nils.
12
 
13
- 1980 © der deutschen Ausgabe Heinrich Hugendubel Verlag, München, Titel der Originalausgabe MONKEY © George Allen & Unwin Ltd. London
14
-
15
- Alle Rechte vorbehalten
16
-
17
- Umschlaggestaltung: Dieter Bonhorst, mit einer Illustration von Maja Weber
18
-
19
- Druck und Bindung: May & Co., Darmstadt
20
-
21
- ISBN 3 88034 9
22
-
23
- Printed in Germany
24
-
25
-
26
-
27
- * * *
28
-
29
-
30
-
31
- Die Rechtschreibung und Interpunktion der Originalausgabe sind unverändert. Offensichtliche Fehler wurden stillschweigend korrigiert.
32
-
33
-
34
-
35
-
36
-
37
- Inhalt
38
-
39
-
40
- Vorwort zur englischen Ausgabe von Arthur Waley
41
-
42
- 1. Kapitel: Die Geburt des magischen Affen Monkey
43
-
44
- 2. Kapitel: Monkey’s Lehrjahre beim Patriarchen
45
-
46
- 3. Kapitel: Die Waffen des Drachenkönigs; Monkey streicht seinen Namen aus der Liste Yamas, des Königs der Toten und erregt den Zorn des Jade-Kaisers
47
-
48
- 4. Kapitel: Monkey erhält den Posten eines Pferdeknechts im Himmel und kehrt wegen dieser Beleidigung schnellstens auf die Erde zurück
49
-
50
- 5. Kapitel: ›Der Große Weise Himmelsebenbürtige‹
51
-
52
- 6. Kapitel: Der Zauberer Erh-lang und Lao-tsu nehmen Monkey gefangen
53
-
54
- 7. Kapitel: Monkey verliert eine Wette gegen Buddha
55
-
56
- 8. Kapitel: Ein Bote für die Heiligen Schriften
57
-
58
- 9. Kapitel: Die Gesetze des Karma
59
-
60
- 10. Kapitel: Ein gebrochenes Versprechen
61
-
62
- 11. Kapitel: Der Kaiser vor dem Totengericht
63
-
64
- 12. Kapitel: Tripitaka erhält den Auftrag, die Heiligen Schriften aus Indien zu holen
65
-
66
- 13. Kapitel: Der Tod von Tripitakas Reisegefährten
67
-
68
- 14. Kapitel: Tripitaka hebt den Bann von Monkey auf und macht ihn zu seinem Reisegefährten
69
-
70
- 15. Kapitel: Monkeys Kampf mit dem verwunschenen Drachen
71
-
72
- 16. Kapitel: Monkey vertreibt einen ›Unhold‹
73
-
74
- 17. Kapitel: Der ›Unhold‹ Pigsy beschließt, Tripitaka und Monkey zu begleiten
75
-
76
- 18. Kapitel: ›Das Ungeheuer vom Strom‹ schließt sich der Pilgerfahrt an
77
-
78
- 19. Kapitel: Der Geist des toten Königs bittet Monkey um seine Hilfe
79
-
80
- 20. Kapitel: Die durch bösen Zauber verwunschene Stadt Kräh-Hahn
81
-
82
- 21. Kapitel: Lao-tsu’s Elexier erweckt den toten König wieder zum Leben; der falsche Zauberer wird in seine ursprüngliche Gestalt, einen Löwen, zurückverwandelt
83
-
84
- 22. Kapitel: 500 Buddhisten werden von Monkey aus der Sklaverei befreit
85
-
86
- 23. Kapitel: Monkey verulkt Taoisten, die einen Gottesdienst feiern
87
-
88
- 24. Kapitel: Eine Wette mit tödlichem Ausgang
89
-
90
- 25. Kapitel: Menschenopfer
91
-
92
- 26. Kapitel: Der Flußkönig stellt Tripitaka eine Falle
93
-
94
- 27. Kapitel: Göttliche Intervention und Rettung Tripitakas
95
-
96
- 28. Kapitel: Tripitaka erhält die Heiligen Schriften
97
-
98
- 29. Kapitel: Die Heimreise
99
-
100
- 30. Kapitel: Willkommensfest in Ch’ang-an
101
-
102
- Arthur Waley zur deutschen Ausgabe
103
-
104
-
105
-
106
-
107
-
108
- Vorwort zur englischen Ausgabe von Arthur Waley
109
-
110
-
111
- Die vorliegende Erzählung wurde von Wu Ch’êng-ên aus Huai-an in Kiangsu niedergeschrieben. Seine genauen Daten sind nicht bekannt. Doch scheint er zwischen 1505 und 1580 n. Chr. gelebt und sich als Dichter eines gewissen Ruhmes erfreut zu haben. Einige seiner eher unbedeutenden Verse sind in einer Anthologie der Ming-Dichtung überliefert.
112
-
113
- Tripitaka, dessen Pilgerfahrt nach Indien das Thema der Erzählung bildet, ist eine wirkliche Person, in der Geschichte besser bekannt als Hsüan Tsang. Er lebte im siebten Jahrhundert n. Chr. Über seine Reise gibt es eingehende zeitgenössische Berichte. Bereits im zehnten Jahrhundert, und vermutlich schon früher, war Tripitakas Pilgerfahrt Gegenstand eines ganzen Zyklus phantastischer Legenden. Seit dem dreizehnten Jahrhundert sind diese Legenden ständig auf der chinesischen Bühne dargestellt worden. Wu Ch’êng-ên standen daher für seine lange Märchenerzählung eine Menge Bausteine zur Verfügung. Das ursprüngliche Buch ist von unendlichem Umfang und wird gewöhnlich in gekürzten Fassungen gelesen. Bei diesen Bearbeitungen blieb die ursprüngliche Anzahl der einzelnen Episoden bestehen; ihre Länge jedoch wurde, besonders durch Streichen von Dialogen, erheblich gekürzt. — Ich habe meist das entgegengesetzte Prinzip angewandt, indem ich zahlreiche Episoden ausließ, die beibehaltenen jedoch nahezu ungekürzt übersetzte, mit Ausnahme der meisten eingestreuten, für eine Übertragung ins Englische ungeeigneten Verse.
114
-
115
- Monkey ist ein wahrhaft einzigartiges Werk in seiner Verbindung von Schönheit mit Ungereimtheit, von Tiefe mit Unsinn. Folklore, Allegorie, Religion, Geschichte, antibürokratische Satire und reine Poesie — dies sind die außerordentlich verschiedenen Elemente, aus denen das Buch sich zusammenfügt. Die Bürokraten der Erzählung sind Heilige im Himmel, und man könnte auf die Vermutung kommen, daß die Satire sich noch eher gegen die Religion als gegen die Bürokratie wandte. Dem ist aber nicht so. Es ist nämlich eine in China geläufige Anschauung, daß die Hierarchie im Himmel ein Spiegelbild der Regierungsform auf Erden sei. Hier wie so oft lassen die Chinesen die Katze aus dem Sack, wo andere Völker uns Rätsel aufgeben. Es ist häufig als Theorie geltend gemacht worden, daß eines Volkes Götter die Spiegelung seiner irdischen Regenten darstellen. In den meisten Fällen bleibt die Ableitung im Dunkeln. Im Volksglauben der Chinesen jedoch gibt es keinerlei Doppelsinn. Der Himmel ist einfach das gesamte bürokratische System, leibhaftig ins Empyreum versetzt.
116
-
117
- Was die Allegorie anbelangt, so versinnbildlicht Tripitaka unverkennbar den ängstlich und beflissen durch die Schwierigkeiten des Lebens tappenden Menschen, während Monkey die ewige Unruhe des Genies personifiziert. Pigsy wiederum symbolisiert offensichtlich die physischen Begierden, primitive Kraft und eine Art schwerfälliger Geduld. Sandy ist rätselhafter. Die Kommentatoren sagen, er stelle ch’êng dar, was gewöhnlich mit ›Redlichkeit‹ übersetzt wird, allein noch eher etwas im Sinne von ›Integrität des Herzens‹ bedeutet. Er kam nicht als nachträglicher Einfall in die Erzählung, erscheint er doch bereits in einigen der frühesten Fassungen der Legende. Aber es muß zugegeben werden, daß sein Bild, obgleich für die Erzählung in unerklärlicher Weise nötig, dennoch in den Umrissen seltsam undeutlich und farblos bleibt.
118
-
119
- Auszüge des vorliegenden Buches sind erschienen in Giles’ History of Chinese Literature und in Timothy Richard’s Mission to Heaven, zu einer Zeit, als nur die gekürzten Fassungen bekannt waren. Eine zugängliche, doch recht ungenaue Beschreibung des Werkes gibt Helen Hayes in A Buddhist Pilgrim’s Progress (Wisdom of the East Series). Ferner existiert eine recht freie japanische Paraphrase von verschiedenen Händen, mit einer 1806 datierten Einleitung des bekannten Novellisten Bakin und Illustrationen, deren einige von Hokusai stammen. Einer der Übersetzer, Hokusais Schüler Gakutei, gesteht, daß er keine Kenntnis von der Chinesischen Umgangssprache hatte, als er die Arbeit unternahm.
120
-
121
- Der meiner Übersetzung zugrundeliegende Text erschien 1921 in der Oriental Press, Shanghai, mit einer ausführlichen und gelehrten Einleitung von Dr. Hu Shih, derzeitigem chinesischen Botschafter in Washington.
122
-
123
-
124
 
125
 
126
 
 
2
 
3
  Monkeys Pilgerfahrt
4
 
 
 
 
 
5
 
6
 
7
  Nach der englischen Übersetzung von Arthur Waley übertragen von Georgette Boner und Maria Nils.
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
 
11
 
docs/build/doctrees/environment.pickle CHANGED
Binary files a/docs/build/doctrees/environment.pickle and b/docs/build/doctrees/environment.pickle differ
 
docs/build/doctrees/intro.doctree CHANGED
Binary files a/docs/build/doctrees/intro.doctree and b/docs/build/doctrees/intro.doctree differ
 
docs/build/doctrees/userguide-zh.doctree CHANGED
Binary files a/docs/build/doctrees/userguide-zh.doctree and b/docs/build/doctrees/userguide-zh.doctree differ
 
docs/build/html/_sources/intro.rst.txt CHANGED
@@ -3,19 +3,19 @@ Introduction
3
 
4
  ``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
5
 
6
- The aim here was to provide an interface to align two texts.
7
 
8
  The current implementation has been developed in Python 3 and ``gradio``.
9
 
10
  Motivation
11
  **********
12
 
13
- Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
14
 
15
  Limitations
16
  ***********
17
 
18
- Currently, only zh-en/en-zh pairs are supported for fast-track alignment although further pairs will be added if and when time permits.
19
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
20
 
21
- An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnages pairs.
 
3
 
4
  ``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
5
 
6
+ The aim is to provide an interface to align two texts.
7
 
8
  The current implementation has been developed in Python 3 and ``gradio``.
9
 
10
  Motivation
11
  **********
12
 
13
+ Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
14
 
15
  Limitations
16
  ***********
17
 
18
+ Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
19
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
20
 
21
+ An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.
docs/build/html/_sources/userguide-zh.rst.txt CHANGED
@@ -3,7 +3,7 @@
3
 
4
  - ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
5
 
6
- - ``radiobee`` 目前仅支持中英、英中对齐。
7
  - ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
8
  - ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
9
  - 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
 
3
 
4
  - ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
5
 
6
+ - ``radiobee`` 快对模式目前仅支持中英、英中对齐。
7
  - ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
8
  - ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
9
  - 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
docs/build/html/intro.html CHANGED
@@ -77,17 +77,17 @@
77
  <section id="introduction">
78
  <h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
79
  <p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> (or <code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> in full) is a powerful dualtext aligner.</p>
80
- <p>The aim here was to provide an interface to align two texts.</p>
81
  <p>The current implementation has been developed in Python 3 and <code class="docutils literal notranslate"><span class="pre">gradio</span></code>.</p>
82
  <section id="motivation">
83
  <h2>Motivation<a class="headerlink" href="#motivation" title="Permalink to this headline"></a></h2>
84
- <p>Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.</p>
85
  </section>
86
  <section id="limitations">
87
  <h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
88
- <p>Currently, only zh-en/en-zh pairs are supported for fast-track alignment although further pairs will be added if and when time permits.
89
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
90
- <p>An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnages pairs.</p>
91
  </section>
92
  </section>
93
 
 
77
  <section id="introduction">
78
  <h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
79
  <p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> (or <code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> in full) is a powerful dualtext aligner.</p>
80
+ <p>The aim is to provide an interface to align two texts.</p>
81
  <p>The current implementation has been developed in Python 3 and <code class="docutils literal notranslate"><span class="pre">gradio</span></code>.</p>
82
  <section id="motivation">
83
  <h2>Motivation<a class="headerlink" href="#motivation" title="Permalink to this headline"></a></h2>
84
+ <p>Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.</p>
85
  </section>
86
  <section id="limitations">
87
  <h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
88
+ <p>Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
89
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
90
+ <p>An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.</p>
91
  </section>
92
  </section>
93
 
docs/build/html/searchindex.js CHANGED
@@ -1 +1 @@
1
- Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":2,"316287378":[5,6],"4":[5,6],"5":2,"500":[2,6],"8":[5,6],"\u4e00\u822c\u65e0\u9700\u7406\u4f1a\u8fd9\u4e9b\u53c2\u6570":6,"\u4e2d\u82f1\u975e\u7a7a\u884c\u9650\u5236\u5728":6,"\u4e3a\u4e2d\u82f1\u6587\u6df7\u5408\u6587\u672c\u53ca\u8bd5\u7740\u5206\u79bb\u4e2d\u82f1\u6587":6,"\u4e3a\u7a7a\u767d\u65f6":6,"\u4e86\u89e3\u8fd9\u4e9b\u5bf9\u9f50\u5de5\u5177":6,"\u4ee5\u5185":6,"\u4ee5\u540e\u53ef\u80fd\u4f1a\u652f\u6301":6,"\u4f18\u8d28\u5bf9":6,"\u4f7f\u7528\u8bf4\u660e":1,"\u5176\u4ed6\u8bed\u8a00\u5bf9\u7684\u5bf9\u9f50":6,"\u5219\u4f1a\u89c6":6,"\u5219\u9650\u5236\u5728":6,"\u53e6\u4e00\u65b9\u9762":6,"\u53ef\u4ee5\u53f3\u51fb\u62f7\u51fa\u56fe\u7684\u94fe\u63a5\u7528\u6d4f\u89c8\u5668\u72ec\u7acb\u8bbf\u95ee\u62f7\u51fa\u6765\u7684\u94fe\u63a5\u6216\u53f3\u51fb\u5b58\u76d8\u518d\u7528\u770b\u56fe\u7a0b\u5e8f\u6253\u5f00\u5b58\u76d8\u7684\u56fe\u6587\u4ef6":6,"\u548c":6,"\u5acc\u56fe\u592a\u5c0f\u7684\u8bdd":6,"\u5b58\u4e0b\u6709\u5173\u53c2\u6570\u67e5\u770b\u6216\u901a\u77e5\u5f00\u53d1\u8005":6,"\u5bf9\u7ea6\u97005\u5206\u949f":6,"\u662f":6,"\u6700\u5c0f":6,"\u7136\u540e\u8fdb\u884c\u5bf9\u9f50":6,"\u7684\u5b6a\u751f\u5144\u5f1f":6,"\u7684\u5efa\u8bae\u503c":6,"\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":6,"\u76ee\u524d\u4ec5\u652f\u6301\u7eaf\u6587\u672c\u6587\u4ef6\u4e0a\u8f7d":6,"\u7b2c\u4e8c\u6b21\u4e0a\u8f7d\u6587\u4ef6\u524d\u8bf7\u70b9\u51fb":6,"\u7b49":6,"\u7b49\u683c\u5f0f":6,"\u82f1\u4e2d":6,"\u82f1\u4e2d\u5bf9\u9f50":6,"\u8bbe\u5927\u4e9b\u5219\u4f1a\u5f97\u5230\u5c11\u4e00\u4e9b\u5bf9\u9f50\u5bf9\u56e0\u4e3a\u53ef\u80fd\u9519\u5931\u4e86\u4e00\u4e9b":6,"\u8bbe\u5927\u4e9b\u6216":6,"\u8bbe\u5c0f\u4e9b\u53ef\u4ee5\u5f97\u5230\u66f4\u591a\u7684\u5bf9\u9f50\u5bf9\u4f46\u4e5f\u4f1a\u6709\u66f4\u591a":6,"\u8bbe\u5c0f\u4e9b\u6216":6,"\u8bef\u62a5\u5bf9":6,"\u8bf7\u52a0\u5165qq\u7fa4":6,"\u8fd0\u884c\u51fa\u9519\u65f6\u53ef\u4ee5\u70b9\u51fb":6,"\u9519\u8bef\u5224\u65ad\u4e3a\u5bf9\u9f50\u7684\u5bf9":6,"do":5,"new":5,As:0,For:0,If:[2,5],On:5,The:[2,5],To:5,about:5,ad:2,address:5,aim:2,align:[0,2,5,6],align_s:[1,3],align_text:[1,3],also:5,although:2,amend_avec:[1,3],an:2,app:[1,3],applic:2,approxim:2,ar:[2,5],attempt:5,been:[0,2],befor:5,better:5,blank:5,browser:5,built:0,bumblebe:[5,6],can:5,candid:5,cannot:0,cat:2,chines:5,clear:[5,6],click:[0,5],cmat2tset:[1,3],co:0,contact:2,content:3,copi:5,csv:[5,6],current:2,de:2,develop:[2,5],dl_type:[5,6],docterm_scor:[1,3],docx:[5,6],download:0,dual:2,dualtext:2,e:2,ebook:2,educ:2,en2zh:[1,3],en2zh_token:[1,3],en:[2,5],english:5,epsilon:[5,6],esp:[5,6],etc:[2,5],exampl:[1,2,5],experiment:2,fals:5,fast:2,file2text:[1,3],file:[5,6],files2df:[1,3],find:2,first:5,flag:[5,6],format:5,full:2,further:2,g:2,gen_aset:[1,3],gen_eps_minsampl:[1,3],gen_model:[1,3],gen_pset:[1,3],gen_row_align:[1,3],go:5,good:5,gradio:2,group:5,ha:[0,2],hand:5,have:5,help:2,here:2,how:1,html:[5,6],http:0,huggingfac:0,identifi:5,idf_typ:[5,6],imag:5,implement:2,index:1,inform:5,insert_spac:[1,3],instal:1,interfac:2,interpolate_pset:[1,3],introduct:1,introduec:2,ja:2,join:5,just:0,know:5,languag:2,languang:5,larger:5,later:5,laugnag:2,learn:2,left:5,limit:[1,5],line:5,lists2cmat:[1,3],loadtext:[1,3],look:5,machin:2,mai:5,md:[5,6],mdx_e2c:[1,3],method:0,mikee:0,min_sampl:[5,6],minimum:5,minut:2,miss:5,mix:5,mode:2,modul:[1,3],more:5,motiv:1,need:5,non:5,norm:[5,6],normal:5,now:0,number:5,one:0,onli:2,onlin:0,open:5,other:[2,5],output:5,packag:[0,1,3],page:1,pair:[2,5],paragraph:2,particular:2,pdf:[5,6],per:2,permit:2,pip:0,pleas:5,plot_cmat:[1,3],plot_df:[1,3],posit:5,power:2,proced:5,process_upload:[1,3],properli:2,provid:2,publish:0,pure:5,pypi:0,python:2,qq:5,radiobe:[0,2,5,6],result:5,right:5,row:0,ru:2,save:5,search:1,seg_text:[1,3],select:5,sentenc:2,separ:5,should:5,shuffle_s:[1,3],sibl:5,slow:2,smaller:5,smatrix:[1,3],someth:5,space:0,srt:[5,6],submit:[0,5],submodul:[1,3],subsequ:5,suggest:[0,5],support:[2,5],tab:5,tabl:0,tend:5,term:2,testrun:0,text:[2,5],tf_type:[5,6],them:5,time:2,tmx:2,touch:5,track:2,translat:2,treat:5,trim_df:[1,3],two:2,txt:[5,6],unless:5,upload:5,us:[0,1],usag:1,valu:5,version:0,wa:2,welcom:2,what:5,when:[2,5],willing:2,wrong:5,yet:0,you:[2,5],zh:[2,5],zip:0},titles:["Examples","Welcome to radiobee\u2019s documentation!","Introduction","radiobee","radiobee package","How to use","\u4f7f\u7528\u8bf4\u660e"],titleterms:{"\u4f7f\u7528\u8bf4\u660e":6,align_s:4,align_text:4,amend_avec:4,app:4,cmat2tset:4,content:[1,4],docterm_scor:4,document:1,en2zh:4,en2zh_token:4,exampl:0,file2text:4,files2df:4,gen_aset:4,gen_eps_minsampl:4,gen_model:4,gen_pset:4,gen_row_align:4,how:5,indic:1,insert_spac:4,instal:0,interpolate_pset:4,introduct:2,limit:2,lists2cmat:4,loadtext:4,mdx_e2c:4,modul:4,motiv:2,packag:4,plot_cmat:4,plot_df:4,process_upload:4,radiobe:[1,3,4],s:1,seg_text:4,shuffle_s:4,smatrix:4,submodul:4,tabl:1,trim_df:4,us:5,usag:0,welcom:1}})
 
1
+ Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":2,"316287378":[5,6],"4":[5,6],"5":2,"500":[2,6],"8":[5,6],"\u4e00\u822c\u65e0\u9700\u7406\u4f1a\u8fd9\u4e9b\u53c2\u6570":6,"\u4e2d\u82f1\u975e\u7a7a\u884c\u9650\u5236\u5728":6,"\u4e3a\u4e2d\u82f1\u6587\u6df7\u5408\u6587\u672c\u53ca\u8bd5\u7740\u5206\u79bb\u4e2d\u82f1\u6587":6,"\u4e3a\u7a7a\u767d\u65f6":6,"\u4e86\u89e3\u8fd9\u4e9b\u5bf9\u9f50\u5de5\u5177":6,"\u4ee5\u5185":6,"\u4ee5\u540e\u53ef\u80fd\u4f1a\u652f\u6301":6,"\u4f18\u8d28\u5bf9":6,"\u4f7f\u7528\u8bf4\u660e":1,"\u5176\u4ed6\u8bed\u8a00\u5bf9\u7684\u5bf9\u9f50":6,"\u5219\u4f1a\u89c6":6,"\u5219\u9650\u5236\u5728":6,"\u53e6\u4e00\u65b9\u9762":6,"\u53ef\u4ee5\u53f3\u51fb\u62f7\u51fa\u56fe\u7684\u94fe\u63a5\u7528\u6d4f\u89c8\u5668\u72ec\u7acb\u8bbf\u95ee\u62f7\u51fa\u6765\u7684\u94fe\u63a5\u6216\u53f3\u51fb\u5b58\u76d8\u518d\u7528\u770b\u56fe\u7a0b\u5e8f\u6253\u5f00\u5b58\u76d8\u7684\u56fe\u6587\u4ef6":6,"\u548c":6,"\u5acc\u56fe\u592a\u5c0f\u7684\u8bdd":6,"\u5b58\u4e0b\u6709\u5173\u53c2\u6570\u67e5\u770b\u6216\u901a\u77e5\u5f00\u53d1\u8005":6,"\u5bf9\u7ea6\u97005\u5206\u949f":6,"\u5feb\u5bf9\u6a21\u5f0f\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":6,"\u662f":6,"\u6700\u5c0f":6,"\u7136\u540e\u8fdb\u884c\u5bf9\u9f50":6,"\u7684\u5b6a\u751f\u5144\u5f1f":6,"\u7684\u5efa\u8bae\u503c":6,"\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":[],"\u76ee\u524d\u4ec5\u652f\u6301\u7eaf\u6587\u672c\u6587\u4ef6\u4e0a\u8f7d":6,"\u7b2c\u4e8c\u6b21\u4e0a\u8f7d\u6587\u4ef6\u524d\u8bf7\u70b9\u51fb":6,"\u7b49":6,"\u7b49\u683c\u5f0f":6,"\u82f1\u4e2d":6,"\u82f1\u4e2d\u5bf9\u9f50":6,"\u8bbe\u5927\u4e9b\u5219\u4f1a\u5f97\u5230\u5c11\u4e00\u4e9b\u5bf9\u9f50\u5bf9\u56e0\u4e3a\u53ef\u80fd\u9519\u5931\u4e86\u4e00\u4e9b":6,"\u8bbe\u5927\u4e9b\u6216":6,"\u8bbe\u5c0f\u4e9b\u53ef\u4ee5\u5f97\u5230\u66f4\u591a\u7684\u5bf9\u9f50\u5bf9\u4f46\u4e5f\u4f1a\u6709\u66f4\u591a":6,"\u8bbe\u5c0f\u4e9b\u6216":6,"\u8bef\u62a5\u5bf9":6,"\u8bf7\u52a0\u5165qq\u7fa4":6,"\u8fd0\u884c\u51fa\u9519\u65f6\u53ef\u4ee5\u70b9\u51fb":6,"\u9519\u8bef\u5224\u65ad\u4e3a\u5bf9\u9f50\u7684\u5bf9":6,"do":5,"new":5,As:0,For:0,If:[2,5],On:5,The:[2,5],To:5,about:5,ad:2,address:5,aim:2,align:[0,2,5,6],align_s:[1,3],align_text:[1,3],also:5,although:2,amend_avec:[1,3],an:2,app:[1,3],applic:2,approxim:2,ar:[2,5],attempt:5,been:[0,2],befor:5,better:5,blank:5,browser:5,built:0,bumblebe:[5,6],can:5,candid:5,cannot:0,cat:2,chines:5,clear:[5,6],click:[0,5],cmat2tset:[1,3],co:0,contact:2,content:3,copi:5,csv:[5,6],current:2,de:2,develop:[2,5],dl_type:[5,6],docterm_scor:[1,3],docx:[5,6],download:0,dual:2,dualtext:2,e:2,ebook:2,educ:2,en2zh:[1,3],en2zh_token:[1,3],en:[2,5],english:5,epsilon:[5,6],esp:[5,6],etc:[2,5],exampl:[1,2,5],experiment:2,fals:5,fast:2,file2text:[1,3],file:[5,6],files2df:[1,3],find:2,first:5,flag:[5,6],format:5,full:2,further:2,g:2,gen_aset:[1,3],gen_eps_minsampl:[1,3],gen_model:[1,3],gen_pset:[1,3],gen_row_align:[1,3],go:5,good:5,gradio:2,group:5,ha:[0,2],hand:5,have:5,help:2,here:[],how:1,html:[5,6],http:0,huggingfac:0,identifi:5,idf_typ:[5,6],imag:5,implement:2,index:1,inform:5,insert_spac:[1,3],instal:1,interfac:2,interpolate_pset:[1,3],introduct:1,introduec:2,ja:2,join:5,just:0,know:5,languag:2,languang:5,larger:5,later:5,laugnag:2,learn:2,left:5,limit:[1,5],line:5,lists2cmat:[1,3],loadtext:[1,3],look:5,machin:2,mai:5,mani:2,md:[5,6],mdx_e2c:[1,3],method:0,mikee:0,min_sampl:[5,6],minimum:5,minut:2,miss:5,mix:5,mode:2,modul:[1,3],more:5,motiv:1,need:5,non:5,norm:[5,6],normal:5,now:0,number:5,one:0,onli:2,onlin:0,open:5,other:[2,5],output:5,packag:[0,1,3],page:1,pair:[2,5],paragraph:2,particular:2,pdf:[5,6],per:2,permit:2,pip:0,pleas:5,plot_cmat:[1,3],plot_df:[1,3],posit:5,power:2,proced:5,process_upload:[1,3],properli:2,provid:2,publish:0,pure:5,pypi:0,python:2,qq:5,radiobe:[0,2,5,6],result:5,right:5,row:0,ru:2,save:5,search:1,seg_text:[1,3],select:5,sentenc:2,separ:5,should:5,shuffle_s:[1,3],sibl:5,slow:2,smaller:5,smatrix:[1,3],someth:5,space:0,srt:[5,6],submit:[0,5],submodul:[1,3],subsequ:5,suggest:[0,5],support:[2,5],tab:5,tabl:0,tend:5,term:2,testrun:0,text:[2,5],tf_type:[5,6],them:5,time:2,tmx:2,touch:5,track:2,translat:2,treat:5,trim_df:[1,3],two:2,txt:[5,6],unless:5,upload:5,us:[0,1],usag:1,valu:5,version:0,wa:[],welcom:2,what:5,when:[2,5],willing:2,wrong:5,yet:0,you:[2,5],zh:[2,5],zip:0},titles:["Examples","Welcome to radiobee\u2019s documentation!","Introduction","radiobee","radiobee package","How to use","\u4f7f\u7528\u8bf4\u660e"],titleterms:{"\u4f7f\u7528\u8bf4\u660e":6,align_s:4,align_text:4,amend_avec:4,app:4,cmat2tset:4,content:[1,4],docterm_scor:4,document:1,en2zh:4,en2zh_token:4,exampl:0,file2text:4,files2df:4,gen_aset:4,gen_eps_minsampl:4,gen_model:4,gen_pset:4,gen_row_align:4,how:5,indic:1,insert_spac:4,instal:0,interpolate_pset:4,introduct:2,limit:2,lists2cmat:4,loadtext:4,mdx_e2c:4,modul:4,motiv:2,packag:4,plot_cmat:4,plot_df:4,process_upload:4,radiobe:[1,3,4],s:1,seg_text:4,shuffle_s:4,smatrix:4,submodul:4,tabl:1,trim_df:4,us:5,usag:0,welcom:1}})
docs/build/html/userguide-zh.html CHANGED
@@ -74,7 +74,7 @@
74
  <h1>使用说明<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h1>
75
  <ul class="simple">
76
  <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> 是 <code class="docutils literal notranslate"><span class="pre">bumblebee</span> <span class="pre">aligner</span></code> 的孪生兄弟。请加入qq群 <code class="docutils literal notranslate"><span class="pre">316287378</span></code> 了解这些对齐工具。</p></li>
77
- <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 目前仅支持中英、英中对齐。</p></li>
78
  <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 <code class="docutils literal notranslate"><span class="pre">docx</span></code>, <code class="docutils literal notranslate"><span class="pre">pdf</span></code>, <code class="docutils literal notranslate"><span class="pre">srt</span></code>, <code class="docutils literal notranslate"><span class="pre">html</span></code> 等格式。</p></li>
79
  <li><p><code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">2</span></code> 为空白时,<code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 则会视 <code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">1</span></code> 为中英文混合文本及试着分离中英文,然后进行对齐。</p></li>
80
  <li><p>英中、中英非空行限制在 <code class="docutils literal notranslate"><span class="pre">2000</span></code> 以内,其他语言对的对齐(<code class="docutils literal notranslate"><span class="pre">500</span></code> 对约需5分钟)则限制在 <code class="docutils literal notranslate"><span class="pre">200</span></code> 以内。</p></li>
 
74
  <h1>使用说明<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h1>
75
  <ul class="simple">
76
  <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> 是 <code class="docutils literal notranslate"><span class="pre">bumblebee</span> <span class="pre">aligner</span></code> 的孪生兄弟。请加入qq群 <code class="docutils literal notranslate"><span class="pre">316287378</span></code> 了解这些对齐工具。</p></li>
77
+ <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 快对模式目前仅支持中英、英中对齐。</p></li>
78
  <li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 <code class="docutils literal notranslate"><span class="pre">docx</span></code>, <code class="docutils literal notranslate"><span class="pre">pdf</span></code>, <code class="docutils literal notranslate"><span class="pre">srt</span></code>, <code class="docutils literal notranslate"><span class="pre">html</span></code> 等格式。</p></li>
79
  <li><p><code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">2</span></code> 为空白时,<code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 则会视 <code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">1</span></code> 为中英文混合文本及试着分离中英文,然后进行对齐。</p></li>
80
  <li><p>英中、中英非空行限制在 <code class="docutils literal notranslate"><span class="pre">2000</span></code> 以内,其他语言对的对齐(<code class="docutils literal notranslate"><span class="pre">500</span></code> 对约需5分钟)则限制在 <code class="docutils literal notranslate"><span class="pre">200</span></code> 以内。</p></li>
docs/source/intro.rst CHANGED
@@ -3,19 +3,19 @@ Introduction
3
 
4
  ``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
5
 
6
- The aim here was to provide an interface to align two texts.
7
 
8
  The current implementation has been developed in Python 3 and ``gradio``.
9
 
10
  Motivation
11
  **********
12
 
13
- Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
14
 
15
  Limitations
16
  ***********
17
 
18
- Currently, only zh-en/en-zh pairs are supported for fast-track alignment although further pairs will be added if and when time permits.
19
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
20
 
21
- An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnages pairs.
 
3
 
4
  ``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
5
 
6
+ The aim is to provide an interface to align two texts.
7
 
8
  The current implementation has been developed in Python 3 and ``gradio``.
9
 
10
  Motivation
11
  **********
12
 
13
+ Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
14
 
15
  Limitations
16
  ***********
17
 
18
+ Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
19
  If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
20
 
21
+ An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.
docs/source/userguide-zh.rst CHANGED
@@ -3,7 +3,7 @@
3
 
4
  - ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
5
 
6
- - ``radiobee`` 目前仅支持中英、英中对齐。
7
  - ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
8
  - ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
9
  - 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
 
3
 
4
  - ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
5
 
6
+ - ``radiobee`` 快对模式目前仅支持中英、英中对齐。
7
  - ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
8
  - ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
9
  - 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
gradio_queue.db CHANGED
Binary files a/gradio_queue.db and b/gradio_queue.db differ
 
img/plt.png CHANGED
radiobee/__main__.py CHANGED
@@ -309,7 +309,7 @@ if __name__ == "__main__":
309
  else:
310
  raise SystemExit(f"Tried {numb} times to no avail, giving up...")
311
 
312
- description = "WIP showcasing a blazing fast dualtext aligner, currrently supported language pairs: en-zh/zh-en"
313
 
314
  # moved to userguide.rst in docs
315
  article = dedent(
 
309
  else:
310
  raise SystemExit(f"Tried {numb} times to no avail, giving up...")
311
 
312
+ description = "WIP showcasing a blazing fast dualtext aligner, currrently supported language pairs: en-zh/zh-en for fast-track, other language pairs are handled by slow-track"
313
 
314
  # moved to userguide.rst in docs
315
  article = dedent(
radiobee/detect.py CHANGED
@@ -27,12 +27,23 @@ def with_func_attrs(**attrs: Any) -> Callable:
27
  # @with_func_attrs(set_languages=None)
28
  # def detect(text: str) -> str:
29
  def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
30
- """Detect language via polyglot and fastlid."""
 
 
 
 
 
31
  # if not text.strip(): return "en"
 
 
 
 
 
 
32
  try:
33
- _ = [(elm.code[:2], elm.confidence) for elm in Detector(text).languages]
34
- detect.lang_conf = _
35
- lang, conf = _[0]
36
  except UnknownLanguage:
37
  if set_languages is None:
38
  def_lang = "en"
@@ -40,26 +51,31 @@ def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
40
  # def_lang = set_languages[-1]
41
  def_lang = set_languages[0]
42
  logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
43
- lang, conf = def_lang, 0
44
  except Exception as exc:
45
  logger.error(exc)
46
- lang, conf = "en", 0
47
 
48
  del conf
49
 
50
- # set_languages = detect.set_languages
51
  if set_languages is None:
52
- return lang
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  # set_languages is set
55
  if not isinstance(set_languages, (list, tuple)):
56
  logger.warning("set_languages (%s) ought to be a list/tuple")
57
 
58
- if lang in set_languages:
59
- return lang
60
-
61
- # lang not in set_languages, use fastlid
62
- fastlid.set_languages = set_languages
63
- lang, _ = fastlid(text)
64
-
65
- return lang
 
27
  # @with_func_attrs(set_languages=None)
28
  # def detect(text: str) -> str:
29
  def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
30
+ """Detect language via polyglot and fastlid.
31
+
32
+ check first with fastlid, if conf < 0.3, check with
33
+
34
+ Alternative in detec_alt.py
35
+ """
36
  # if not text.strip(): return "en"
37
+ fastlid.set_languages = set_languages
38
+ lang, conf = fastlid(text)
39
+ detect.lang_conf = lang, conf
40
+ if conf >= 0.3 or lang in ["zh"]:
41
+ return lang
42
+
43
  try:
44
+ langs = [(elm.code[:2], elm.confidence) for elm in Detector(text).languages]
45
+ detect.lang_conf = langs
46
+ # lang, conf = _[0]
47
  except UnknownLanguage:
48
  if set_languages is None:
49
  def_lang = "en"
 
51
  # def_lang = set_languages[-1]
52
  def_lang = set_languages[0]
53
  logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
54
+ langs = [(def_lang, 0)]
55
  except Exception as exc:
56
  logger.error(exc)
57
+ langs = [("en", 0)]
58
 
59
  del conf
60
 
61
+ # return first enrty's lang
62
  if set_languages is None:
63
+ def_lang = langs[0][0]
64
+ else:
65
+ def_lang = "en"
66
+
67
+ # pick the first in Detector(text).languages
68
+
69
+ # just to silence pyright
70
+ # set_languages_: List[str] = [""] if set_languages is None else set_languages
71
+
72
+ for elm in langs:
73
+ if elm[0] in set_languages: # type: ignore
74
+ def_lang = elm[0]
75
+ break
76
 
77
  # set_languages is set
78
  if not isinstance(set_languages, (list, tuple)):
79
  logger.warning("set_languages (%s) ought to be a list/tuple")
80
 
81
+ return def_lang
 
 
 
 
 
 
 
radiobee/detect_alt.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Detect language via polyglot and fastlid."""
2
+ # pylint: disable=
3
+
4
+ from typing import Any, Callable, List, Optional
5
+
6
+ from polyglot.text import Detector
7
+ import polyglot.detect.base
8
+ from polyglot.detect.base import UnknownLanguage
9
+ from fastlid import fastlid
10
+
11
+ from logzero import logger
12
+
13
+ polyglot.detect.base.logger.setLevel("ERROR")
14
+
15
+
16
+ def with_func_attrs(**attrs: Any) -> Callable:
17
+ """Define func_attrs."""
18
+
19
+ def with_attrs(fct: Callable) -> Callable:
20
+ for key, val in attrs.items():
21
+ setattr(fct, key, val)
22
+ return fct
23
+
24
+ return with_attrs
25
+
26
+
27
+ # @with_func_attrs(set_languages=None)
28
+ # def detect(text: str) -> str:
29
+ def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
30
+ """Detect language via polyglot and fastlid."""
31
+ # if not text.strip(): return "en"
32
+ try:
33
+ _ = [(elm.code[:2], elm.confidence) for elm in Detector(text).languages]
34
+ detect.lang_conf = _
35
+ lang, conf = _[0]
36
+ except UnknownLanguage:
37
+ if set_languages is None:
38
+ def_lang = "en"
39
+ else:
40
+ # def_lang = set_languages[-1]
41
+ def_lang = set_languages[0]
42
+ logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
43
+ lang, conf = def_lang, 0
44
+ except Exception as exc:
45
+ logger.error(exc)
46
+ lang, conf = "en", 0
47
+
48
+ del conf
49
+
50
+ # if set_languages is None,
51
+ # trust polyglot.text.Detector
52
+ if set_languages is None:
53
+ return lang
54
+
55
+ # set_languages is set
56
+ if not isinstance(set_languages, (list, tuple)):
57
+ logger.warning("set_languages (%s) ought to be a list/tuple")
58
+
59
+ if lang in set_languages:
60
+ return lang
61
+
62
+ # lang not in set_languages, use fastlid
63
+ fastlid.set_languages = set_languages
64
+ lang, _ = fastlid(text)
65
+
66
+ return lang
radiobee/gradiobee.py CHANGED
@@ -2,6 +2,7 @@
2
  # pylint: disable=invalid-name
3
  from pathlib import Path
4
  import platform
 
5
  from itertools import zip_longest
6
 
7
  # import tempfile
@@ -32,7 +33,7 @@ uname = platform.uname()
32
  HFSPACES = False
33
  if "amzn2" in uname.release: # on hf spaces
34
  HFSPACES = True
35
- import SentenceTransformer
36
  model_s = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
37
  sns.set()
38
  sns.set_style("darkgrid")
@@ -102,7 +103,7 @@ def gradiobee(
102
  # process file1/text1: split text1 to text1 text2 to zh-en
103
 
104
  len_max = 2000
105
- if not text2.strip():
106
  _ = [elm.strip() for elm in text1.splitlines() if elm.strip()]
107
  if not _: # essentially empty file1
108
  return error_msg("Nothing worthy of processing in file 1")
@@ -151,7 +152,9 @@ def gradiobee(
151
  # return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
152
 
153
  # end if single file
 
154
  else: # file1 file 2: proceed
 
155
  lang1, _ = fastlid(text1)
156
  lang2, _ = fastlid(text2)
157
 
@@ -175,13 +178,14 @@ def gradiobee(
175
  df_trimmed = trim_df(df1)
176
  # --- end else single
177
 
 
 
178
  logger.debug("lang1: %s, lang2: %s", lang1, lang2)
179
  if debug:
180
- print("gradiobee ln 179 lang1: %s, lang2: %s" % (lang1, lang2))
181
  print("fast track? ", lang1 in lang_en_zh and lang2 in lang_en_zh)
182
 
183
  # fast track
184
- lang_en_zh = ["en", "zh"]
185
  if lang1 in lang_en_zh and lang2 in lang_en_zh:
186
  try:
187
  cmat = lists2cmat(
@@ -208,10 +212,11 @@ def gradiobee(
208
  try:
209
  vec1 = model_s.encode(list1)
210
  vec2 = model_s.encode(list2)
211
- cmat = vec1.dot(vec2.T)
 
212
  except Exception as exc:
213
  logger.error(exc)
214
- return error_msg(exc)
215
 
216
  tset = pd.DataFrame(cmat2tset(cmat))
217
  tset.columns = ["x", "y", "cos"]
 
2
  # pylint: disable=invalid-name
3
  from pathlib import Path
4
  import platform
5
+ import inspect
6
  from itertools import zip_longest
7
 
8
  # import tempfile
 
33
  HFSPACES = False
34
  if "amzn2" in uname.release: # on hf spaces
35
  HFSPACES = True
36
+ from sentence_transformers import SentenceTransformer
37
  model_s = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
38
  sns.set()
39
  sns.set_style("darkgrid")
 
103
  # process file1/text1: split text1 to text1 text2 to zh-en
104
 
105
  len_max = 2000
106
+ if not text2.strip(): # empty file2
107
  _ = [elm.strip() for elm in text1.splitlines() if elm.strip()]
108
  if not _: # essentially empty file1
109
  return error_msg("Nothing worthy of processing in file 1")
 
152
  # return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
153
 
154
  # end if single file
155
+ # not single file
156
  else: # file1 file 2: proceed
157
+ fastlid.set_languages = None
158
  lang1, _ = fastlid(text1)
159
  lang2, _ = fastlid(text2)
160
 
 
178
  df_trimmed = trim_df(df1)
179
  # --- end else single
180
 
181
+ lang_en_zh = ["en", "zh"]
182
+
183
  logger.debug("lang1: %s, lang2: %s", lang1, lang2)
184
  if debug:
185
+ print("gradiobee.py ln 82 lang1: %s, lang2: %s" % (lang1, lang2))
186
  print("fast track? ", lang1 in lang_en_zh and lang2 in lang_en_zh)
187
 
188
  # fast track
 
189
  if lang1 in lang_en_zh and lang2 in lang_en_zh:
190
  try:
191
  cmat = lists2cmat(
 
212
  try:
213
  vec1 = model_s.encode(list1)
214
  vec2 = model_s.encode(list2)
215
+ # cmat = vec1.dot(vec2.T)
216
+ cmat = vec2.dot(vec1.T)
217
  except Exception as exc:
218
  logger.error(exc)
219
+ return error_msg(f"{exc}, {__file__} {inspect.currentframe().f_lineno}, period")
220
 
221
  tset = pd.DataFrame(cmat2tset(cmat))
222
  tset.columns = ["x", "y", "cos"]
radiobee/text2lists.py CHANGED
@@ -7,6 +7,7 @@ from typing import Iterable, List, Optional, Tuple, Union # noqa
7
  import numpy as np
8
 
9
  # from fastlid import fastlid
 
10
  from logzero import logger
11
 
12
  from radiobee.lists2cmat import lists2cmat
@@ -21,9 +22,8 @@ def text2lists(
21
 
22
  Args:
23
  text: mixed text
24
- set_languages: default to ["en", "zh"];
25
- if set_languages is None:
26
- set_languages = ["en", "zh"]
27
 
28
  Attributes:
29
  cmat: correlation matrix (len(list_l) x len(list_r))
@@ -42,7 +42,19 @@ def text2lists(
42
 
43
  # set_languages default to ["en", "zh"]
44
  if set_languages is None:
45
- set_languages = ["en", "zh"]
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  # fastlid.set_languages = set_languages
48
 
@@ -51,6 +63,7 @@ def text2lists(
51
 
52
  # lang0, _ = fastlid(text[:15000])
53
  lang0 = detect(text, set_languages)
 
54
  res = []
55
  left = True # start with left list1
56
 
 
7
  import numpy as np
8
 
9
  # from fastlid import fastlid
10
+ from polyglot.text import Detector
11
  from logzero import logger
12
 
13
  from radiobee.lists2cmat import lists2cmat
 
22
 
23
  Args:
24
  text: mixed text
25
+ set_languages: no default (open-end)
26
+ use polyglot.text.Detector to pick two languages
 
27
 
28
  Attributes:
29
  cmat: correlation matrix (len(list_l) x len(list_r))
 
42
 
43
  # set_languages default to ["en", "zh"]
44
  if set_languages is None:
45
+ lang12 = [elm.code for elm in Detector(text).languages]
46
+
47
+ # set_languages = ["en", "zh"]
48
+
49
+ # set 'un' to 'en'
50
+ # set_languages = ['en' if elm in ['un'] else elm for elm in lang12[:2]]
51
+ set_languages = []
52
+ for elm in lang12[:2]:
53
+ if elm in ["un"]:
54
+ logger.warning(" Unknown language, set to en")
55
+ set_languages.append("en")
56
+ else:
57
+ set_languages.append(elm)
58
 
59
  # fastlid.set_languages = set_languages
60
 
 
63
 
64
  # lang0, _ = fastlid(text[:15000])
65
  lang0 = detect(text, set_languages)
66
+
67
  res = []
68
  left = True # start with left list1
69
 
tests/test_detect.py CHANGED
@@ -21,6 +21,20 @@ def test_detect(test_input, expected):
21
 
22
  def test_detect_de():
23
  """Test detect de."""
24
- text = "4\u3000In der Beschränkung zeigt sich erst der Meister, / Und das Gesetz nur kann uns Freiheit geben. 参见http://www.business-it.nl/files/7d413a5dca62fc735a072b16fbf050b1-27.php." # noqa
25
- assert detect(text) == "de"
26
- assert detect(text, ["en", "zh"]) == "zh"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  def test_detect_de():
23
  """Test detect de."""
24
+ text_de = "4\u3000In der Beschränkung zeigt sich erst der Meister, / Und das Gesetz nur kann uns Freiheit geben. 参见http://www.business-it.nl/files/7d413a5dca62fc735a072b16fbf050b1-27.php." # noqa
25
+ assert detect(text_de) == "de"
26
+ assert detect(text_de, ["en", "zh"]) == "zh"
27
+
28
+
29
+ def test_elm1():
30
+ """Test ——撰文:Thomas Gibbons-Neff和Fahim Abed,摄影:Jim Huylebroek=."""
31
+ elm1 = "——撰文:Thomas Gibbons-Neff和Fahim Abed,摄影:Jim Huylebroek"
32
+ assert detect(elm1) == "ja"
33
+ assert detect(elm1, ["en", "zh"]) == "zh"
34
+
35
+
36
+ def test_elm2():
37
+ """Test 在卢旺达基加利的一家牛奶吧。 JACQUES NKINZINGABO FOR THE NEW YORK TIMES."""
38
+ elm2 = "在卢旺达基加利的一家牛奶吧。 JACQUES NKINZINGABO FOR THE NEW YORK TIMES"
39
+ assert detect(elm2) == "zh"
40
+ assert detect(elm2, ["en", "zh"]) == "zh"
tests/test_text2lists.py CHANGED
@@ -4,18 +4,19 @@ from radiobee.loadtext import loadtext
4
  from radiobee.text2lists import text2lists
5
 
6
 
7
- def test_text2lists():
8
  """Test text2lists data\test-dual.txt."""
9
  filename = r"data\test-dual.txt"
10
  text = loadtext(filename) # noqa
11
  l1, l2 = text2lists(text)
12
  assert l2[0] in [""]
13
- assert "国际\n中\n双语" in l1[0]
 
14
 
15
 
16
  def test_shakespeare1000():
17
  """Separate first 1000.
18
-
19
  from pathlib import Path
20
  import zipfile
21
  dir_loc = r""
@@ -34,11 +35,11 @@ def test_shakespeare1000():
34
  break
35
  line += 1
36
  Path(f"data/shakespeare-zh-en-{numb_lines}.txt").write_text("\n".join(text1000), encoding="utf8")
37
-
38
  tset = cmat2test(cmat)
39
  df = pd.DataFrame(tset).rename(columns=dict(zip(range(0, 3), ['x', 'y', 'cos'])))
40
  plot_df(df)
41
-
42
  """
43
  # text1000a = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
44
  # text2000 = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
@@ -46,5 +47,12 @@ def test_shakespeare1000():
46
 
47
  # l1000a, l10002b = text2lists(text1000)
48
  # l2000a, l2000b = text2lists(text2000)
49
-
50
  l4000, r4000 = text2lists(text4000)
 
 
 
 
 
 
 
 
4
  from radiobee.text2lists import text2lists
5
 
6
 
7
+ def test_text2lists_dual1():
8
  """Test text2lists data\test-dual.txt."""
9
  filename = r"data\test-dual.txt"
10
  text = loadtext(filename) # noqa
11
  l1, l2 = text2lists(text)
12
  assert l2[0] in [""]
13
+ assert "国际\n中\n双语"[:2] in l1[0]
14
+ assert '2021' in l2[5]
15
 
16
 
17
  def test_shakespeare1000():
18
  """Separate first 1000.
19
+
20
  from pathlib import Path
21
  import zipfile
22
  dir_loc = r""
 
35
  break
36
  line += 1
37
  Path(f"data/shakespeare-zh-en-{numb_lines}.txt").write_text("\n".join(text1000), encoding="utf8")
38
+
39
  tset = cmat2test(cmat)
40
  df = pd.DataFrame(tset).rename(columns=dict(zip(range(0, 3), ['x', 'y', 'cos'])))
41
  plot_df(df)
42
+
43
  """
44
  # text1000a = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
45
  # text2000 = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
 
47
 
48
  # l1000a, l10002b = text2lists(text1000)
49
  # l2000a, l2000b = text2lists(text2000)
50
+
51
  l4000, r4000 = text2lists(text4000)
52
+
53
+
54
+ def test_test_dual2():
55
+ """Test data/test-dual.txt."""
56
+ test_dual = Path("data/test-dual.txt").read_text(encoding="utf8")
57
+
58
+ l_dual, r_dual = text2lists(test_dual)
tests/test_text2lists_bug2.py CHANGED
@@ -7,10 +7,8 @@ from radiobee.text2lists import text2lists
7
  def test_text2lists_bug2():
8
  """Test text2lists data\问题2测试文件.txt."""
9
  filename = r"data\问题2测试文件.txt"
10
- text = loadtext(filename) # noqa
11
- l1, l2 = text2lists(text)
12
- # assert l2[0] in [""]
13
- # assert "国际\n中\n双语" in l1[0]
14
 
15
- assert len(l1) == 4
16
- assert len(l2) == 5
 
7
  def test_text2lists_bug2():
8
  """Test text2lists data\问题2测试文件.txt."""
9
  filename = r"data\问题2测试文件.txt"
10
+ textbug2 = loadtext(filename) # noqa
11
+ l1, l2 = text2lists(textbug2)
 
 
12
 
13
+ assert len(l1) == 5
14
+ assert len(l2) == 4