leenag commited on
Commit
1ad250f
·
verified ·
1 Parent(s): 60d4380

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Speaker dependent Acoustic-to-Articulatory Inversion Model
2
+
3
+ ## Model Overview
4
+ This model performs speaker dependent Acoustic-to-Articulatory Inversion (AAI), predicting articulatory trajectories from acoustic features. The neural network is built using PyTorch and leverages BiLSTM (Bidirectional Long Short-Term Memory) layers to capture temporal dependencies in the acoustic data. CNN is used used to smooth the predicted trajectories to make it natural. The input is composed of multi-frame MFCC (Mel-Frequency Cepstral Coefficients) features, and the output is a set of predicted articulatory positions over time. This model trained on 80% samples of a a particular speaker and tested on 20% sample of the same speaker.
5
+
6
+ ## Intended Use
7
+ The model is designed for speech researchers and professionals who are interested in understanding the relationship between speech acoustics and articulatory movements. It can be applied in linguistic research, speech synthesis, and speech therapy.
8
+
9
+ ### Use Cases
10
+ 1. **Speech Analysis:** To study how different speech sounds relate to articulatory positions.
11
+ 2. **Speech Synthesis:** As a part of systems generating speech from articulatory features.
12
+ 3. **Speech Therapy:** Analyzing articulatory trajectories for individuals with speech disorders.
13
+
14
+ # Dependencies
15
+ - python 3.7.3
16
+ - numpy 1.16.3
17
+ - pytorch 1.1.0
18
+ - scipy 1.2.1
19
+ - librosa 0.6.3
20
+ - matplotlib
21
+ - psutil
22
+
23
+ # Trained Datasets
24
+ Being Speaker dependent this model trained on one particular speaker:
25
+ - mocha : http://data.cstr.ed.ac.uk/mocha/mjjn0 <br/>
26
+
27
+
28
+ ## Model Architecture
29
+ - **Hidden Dimension:** 400
30
+ - **Input Dimension:** 429 (acoustic features per frame)
31
+ - **Output Dimension:** 16 (articulatory trajectories)
32
+ - **Batch Size:** 8
33
+ - **BiLSTM Layers:** 2 bidirectional LSTM layers
34
+ - **CNN:** 1DCNN
35
+ - **Linear Layers:** Input and output layers with batch normalization
36
+
37
+ The architecture is designed to accommodate smoothing of articulatory in preprocessing with customizable cutoff frequencies.
38
+
39
+ ## Model Training
40
+ - **Optimizer:** Adam
41
+ - **Loss Functions:** Combination of RMSE and Pearson correlation to capture both error minimization and correlation maximization.
42
+ - **Training Procedure:** Early stopping based on validation loss was employed to prevent overfitting, with periodic adjustments of learning rate if the validation loss increased.
43
+ - **Epochs:** Trained over multiple epochs with batch updates and dynamic learning rate adjustments.
44
+
45
+ ## Evaluation
46
+
47
+ The model was evaluated on the 20% sample of the same speaker, with metrics such as RMSE (Root Mean Square Error) and Pearson correlation used to quantify performance. The evaluation result is:
48
+ - **RMSE:** 0.761
49
+ - **PCC:** 0.810
50
+