Spaces:

terapyon
/

podcast-search

Sleeping

App Files Files Community

podcast-search / README.md

terapyon

srtの分割を1分にし、configなどを整え、READMEを書いた

d788666 about 1 month ago

preview code

raw

history blame

3.76 kB

	# podcast-search

	Podcast terapyon channelを検索する仕組み


	## 使い方


	### タイトルリスト

	- 以下のファイルを`store` フォルダに置く
	- `title-list-202301-202501.parquet`
	- 以下のカラムを持つ
	- id: int
	- date: str (2023-01-09)
	- length: int
	- audio: str (オーディオファイルURL)
	- title: str

	タイトルリストファイルの例

	<div>
	<style scoped>
	.dataframe tbody tr th:only-of-type {
	vertical-align: middle;
	}

	.dataframe tbody tr th {
	vertical-align: top;
	}

	.dataframe thead th {
	text-align: right;
	}
	</style>
	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: right;">
	<th></th>
	<th>id</th>
	<th>date</th>
	<th>length</th>
	<th>audio</th>
	<th>title</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<th>0</th>
	<td>69</td>
	<td>2023-01-09</td>
	<td>20993616</td>
	<td>https://anchor.fm/s/14480e04/podcast/play/6323...</td>
	<td>#69 2023年新年挨拶から 2022年の振り返りと2023年の抱負</td>
	</tr>
	<tr>
	<th>1</th>
	<td>70</td>
	<td>2023-03-09</td>
	<td>103287296</td>
	<td>https://anchor.fm/s/14480e04/podcast/play/6621...</td>
	<td>#70 PyCon JP Association代表理事退任と今後の展望をIqbalさんと語る</td>
	</tr>
	<tr>
	<th>2</th>
	<td>71</td>
	<td>2023-03-22</td>
	<td>116393694</td>
	<td>https://anchor.fm/s/14480e04/podcast/play/6706...</td>
	<td>#71 hirokikyさんをゲストに自然言語処理系AI Chat GPT / Whisp...</td>
	</tr>
	<tr>
	<th>3</th>
	<td>72</td>
	<td>2023-05-04</td>
	<td>49642320</td>
	<td>https://anchor.fm/s/14480e04/podcast/play/6976...</td>
	<td>#72 PyCon US 2023 ひとり振り返り</td>
	</tr>
	<tr>
	<th>4</th>
	<td>73</td>
	<td>2023-05-24</td>
	<td>150643013</td>
	<td>https://anchor.fm/s/14480e04/podcast/play/7094...</td>
	<td>#73 Nyohoさんをゲストに Scratchからディープラーニングや数学の話</td>
	</tr>
	</tbody>
	</table>
	</div>

	### 文字データ作成

	- dataフォルダをを作る(srcと同じ階層)
	- dataフォルダに、srtファイルを入れる
	- (以下に従うと、srtファイルからIDが取得できる)
	- 拡張子を `.srt` とする
	- ファイル名に、ID(整数)が1つだけ入ってること
	- IDの前後に、 `-` または `_` で区切られいること
	- 以下のスクリプトを実行する。 `store` フォルダに `parquet` ファイルが srtファイル分できる

	```
	% python src/episode.py
	```

	### データベース作成

	以下のコマンドで、テーブル作成から必要な3つのデータをDuckDB(永続化)を作る

	```
	% python src/store.py all
	```

	上記のコマンドの詳細

	- テーブル作成 create table
	- `python src/store.py create`
	- タイトルリスト insert
	- `python src/store.py podcastinsert`
	- エピソードとテキスト insert
	- `python src/store.py episodeinsert`
	- ベクトル化 embedding
	- `python src/store.py embed`
	- ベクトルデータ index
	- `python src/store.py index`


	### 検索UI

	```
	% streamlit run src/app.py
	```

	- Podcastタイトル(複数)を選ぶ。未選択の場合すべてとなる
	- 検索したいワードをテキストボックスに入力
	- 10個のセンテンス(文章)候補が出てくる
	- 表の左をクリックすると、下部に文字列が表示される
	- 音声のタイミング（分・秒）が表示される・・未実装
	- そのタイミングの音声がその場で聞ける・・将来的に実装したいが実現方法未確定