griko commited on
Commit
49c777b
·
verified ·
1 Parent(s): a1d56e8

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +101 -0
  2. config.json +1 -0
  3. model.h5 +3 -0
  4. requirements.txt +8 -0
  5. scaler.joblib +3 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ license: apache-2.0
4
+ datasets:
5
+ - voxceleb2
6
+ - timit
7
+ libraries:
8
+ - speechbrain
9
+ - librosa
10
+ tags:
11
+ - age-estimation
12
+ - speaker-characteristics
13
+ - speaker-recognition
14
+ - audio-regression
15
+ - voice-analysis
16
+ ---
17
+
18
+ # Age Estimation Model
19
+
20
+ This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an ANN regressor to predict speaker age from audio input. The model uses ECAPA embeddings and Librosa acoustic features, trained on both VoxCeleb2 and TIMIT datasets.
21
+
22
+ ## Model Performance Comparison
23
+
24
+ We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:
25
+
26
+ | Model | Architecture | Features | Training Data | Test MAE | Best For |
27
+ |-------|-------------|----------|---------------|-----------|----------|
28
+ | VoxCeleb2 SVR (223) | SVR | ECAPA + Librosa (223-dim) | VoxCeleb2 | 7.88 years | Best performance on VoxCeleb2 |
29
+ | VoxCeleb2 SVR (192) | SVR | ECAPA only (192-dim) | VoxCeleb2 | 7.89 years | Lightweight deployment |
30
+ | TIMIT ANN (192) | ANN | ECAPA only (192-dim) | TIMIT | 4.95 years | Clean studio recordings |
31
+ | Combined ANN (223) | ANN | ECAPA + Librosa (223-dim) | VoxCeleb2 + TIMIT | 6.93 years | Best general performance |
32
+
33
+ You may find other models [here](https://huggingface.co/griko).
34
+
35
+ ## Model Details
36
+ - Input: Audio file (will be converted to 16kHz, mono, single channel)
37
+ - Output: Predicted age in years (continuous value)
38
+ - Features:
39
+ - SpeechBrain ECAPA-TDNN embedding [192 features]
40
+ - Additional Librosa features [31 features]
41
+ - Regressor: Artificial Neural Network optimized through Optuna
42
+ - Performance:
43
+ - Combined test set: 6.93 years Mean Absolute Error (MAE)
44
+
45
+ ## Features
46
+ 1. SpeechBrain ECAPA-TDNN embeddings (192 dimensions)
47
+ 2. Librosa acoustic features (31 dimensions):
48
+ - 13 MFCCs
49
+ - 13 Delta MFCCs
50
+ - Zero crossing rate
51
+ - Spectral centroid
52
+ - Spectral bandwidth
53
+ - Spectral contrast
54
+ - Spectral flatness
55
+
56
+ ## Training Data
57
+ The model was trained on a combination of datasets:
58
+ - VoxCeleb2:
59
+ - YouTube interview recordings
60
+ - Age data from Wikidata and public sources
61
+ - Voice activity detection applied
62
+ - TIMIT:
63
+ - Studio-quality recordings
64
+ - Original age annotations
65
+ - All audio preprocessed to 16kHz, mono
66
+ ## Installation
67
+
68
+ ```bash
69
+ pip install git+https://github.com/griko/voice-age-regression.git[full]
70
+ ```
71
+
72
+ ## Usage
73
+
74
+ ```python
75
+ from age_regressor import AgeRegressionPipeline
76
+
77
+ # Load the pipeline
78
+ regressor = AgeRegressionPipeline.from_pretrained(
79
+ "griko/age_reg_ann_ecapa_librosa_combined"
80
+ )
81
+
82
+ # Single file prediction
83
+ result = regressor("path/to/audio.wav")
84
+ print(f"Predicted age: {result[0]:.1f} years")
85
+
86
+ # Batch prediction
87
+ results = regressor(["audio1.wav", "audio2.wav"])
88
+ print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")
89
+ ```
90
+
91
+ ## Limitations
92
+ - Model was trained on a mix of YouTube interviews and studio recordings recordings
93
+ - Performance may vary on different audio qualities or recording conditions
94
+ - Age predictions are estimates and should not be used for medical or legal purposes
95
+ - Age estimations should be treated as approximate values, not exact measurements
96
+
97
+ ## Citation
98
+ If you use this model in your research, please cite:
99
+ ```bibtex
100
+ TBD
101
+ ```
config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"feature_names": ["0_speechbrain_embedding", "1_speechbrain_embedding", "2_speechbrain_embedding", "3_speechbrain_embedding", "4_speechbrain_embedding", "5_speechbrain_embedding", "6_speechbrain_embedding", "7_speechbrain_embedding", "8_speechbrain_embedding", "9_speechbrain_embedding", "10_speechbrain_embedding", "11_speechbrain_embedding", "12_speechbrain_embedding", "13_speechbrain_embedding", "14_speechbrain_embedding", "15_speechbrain_embedding", "16_speechbrain_embedding", "17_speechbrain_embedding", "18_speechbrain_embedding", "19_speechbrain_embedding", "20_speechbrain_embedding", "21_speechbrain_embedding", "22_speechbrain_embedding", "23_speechbrain_embedding", "24_speechbrain_embedding", "25_speechbrain_embedding", "26_speechbrain_embedding", "27_speechbrain_embedding", "28_speechbrain_embedding", "29_speechbrain_embedding", "30_speechbrain_embedding", "31_speechbrain_embedding", "32_speechbrain_embedding", "33_speechbrain_embedding", "34_speechbrain_embedding", "35_speechbrain_embedding", "36_speechbrain_embedding", "37_speechbrain_embedding", "38_speechbrain_embedding", "39_speechbrain_embedding", "40_speechbrain_embedding", "41_speechbrain_embedding", "42_speechbrain_embedding", "43_speechbrain_embedding", "44_speechbrain_embedding", "45_speechbrain_embedding", "46_speechbrain_embedding", "47_speechbrain_embedding", "48_speechbrain_embedding", "49_speechbrain_embedding", "50_speechbrain_embedding", "51_speechbrain_embedding", "52_speechbrain_embedding", "53_speechbrain_embedding", "54_speechbrain_embedding", "55_speechbrain_embedding", "56_speechbrain_embedding", "57_speechbrain_embedding", "58_speechbrain_embedding", "59_speechbrain_embedding", "60_speechbrain_embedding", "61_speechbrain_embedding", "62_speechbrain_embedding", "63_speechbrain_embedding", "64_speechbrain_embedding", "65_speechbrain_embedding", "66_speechbrain_embedding", "67_speechbrain_embedding", "68_speechbrain_embedding", "69_speechbrain_embedding", "70_speechbrain_embedding", "71_speechbrain_embedding", "72_speechbrain_embedding", "73_speechbrain_embedding", "74_speechbrain_embedding", "75_speechbrain_embedding", "76_speechbrain_embedding", "77_speechbrain_embedding", "78_speechbrain_embedding", "79_speechbrain_embedding", "80_speechbrain_embedding", "81_speechbrain_embedding", "82_speechbrain_embedding", "83_speechbrain_embedding", "84_speechbrain_embedding", "85_speechbrain_embedding", "86_speechbrain_embedding", "87_speechbrain_embedding", "88_speechbrain_embedding", "89_speechbrain_embedding", "90_speechbrain_embedding", "91_speechbrain_embedding", "92_speechbrain_embedding", "93_speechbrain_embedding", "94_speechbrain_embedding", "95_speechbrain_embedding", "96_speechbrain_embedding", "97_speechbrain_embedding", "98_speechbrain_embedding", "99_speechbrain_embedding", "100_speechbrain_embedding", "101_speechbrain_embedding", "102_speechbrain_embedding", "103_speechbrain_embedding", "104_speechbrain_embedding", "105_speechbrain_embedding", "106_speechbrain_embedding", "107_speechbrain_embedding", "108_speechbrain_embedding", "109_speechbrain_embedding", "110_speechbrain_embedding", "111_speechbrain_embedding", "112_speechbrain_embedding", "113_speechbrain_embedding", "114_speechbrain_embedding", "115_speechbrain_embedding", "116_speechbrain_embedding", "117_speechbrain_embedding", "118_speechbrain_embedding", "119_speechbrain_embedding", "120_speechbrain_embedding", "121_speechbrain_embedding", "122_speechbrain_embedding", "123_speechbrain_embedding", "124_speechbrain_embedding", "125_speechbrain_embedding", "126_speechbrain_embedding", "127_speechbrain_embedding", "128_speechbrain_embedding", "129_speechbrain_embedding", "130_speechbrain_embedding", "131_speechbrain_embedding", "132_speechbrain_embedding", "133_speechbrain_embedding", "134_speechbrain_embedding", "135_speechbrain_embedding", "136_speechbrain_embedding", "137_speechbrain_embedding", "138_speechbrain_embedding", "139_speechbrain_embedding", "140_speechbrain_embedding", "141_speechbrain_embedding", "142_speechbrain_embedding", "143_speechbrain_embedding", "144_speechbrain_embedding", "145_speechbrain_embedding", "146_speechbrain_embedding", "147_speechbrain_embedding", "148_speechbrain_embedding", "149_speechbrain_embedding", "150_speechbrain_embedding", "151_speechbrain_embedding", "152_speechbrain_embedding", "153_speechbrain_embedding", "154_speechbrain_embedding", "155_speechbrain_embedding", "156_speechbrain_embedding", "157_speechbrain_embedding", "158_speechbrain_embedding", "159_speechbrain_embedding", "160_speechbrain_embedding", "161_speechbrain_embedding", "162_speechbrain_embedding", "163_speechbrain_embedding", "164_speechbrain_embedding", "165_speechbrain_embedding", "166_speechbrain_embedding", "167_speechbrain_embedding", "168_speechbrain_embedding", "169_speechbrain_embedding", "170_speechbrain_embedding", "171_speechbrain_embedding", "172_speechbrain_embedding", "173_speechbrain_embedding", "174_speechbrain_embedding", "175_speechbrain_embedding", "176_speechbrain_embedding", "177_speechbrain_embedding", "178_speechbrain_embedding", "179_speechbrain_embedding", "180_speechbrain_embedding", "181_speechbrain_embedding", "182_speechbrain_embedding", "183_speechbrain_embedding", "184_speechbrain_embedding", "185_speechbrain_embedding", "186_speechbrain_embedding", "187_speechbrain_embedding", "188_speechbrain_embedding", "189_speechbrain_embedding", "190_speechbrain_embedding", "191_speechbrain_embedding", "zero_crossing_rate", "spectral_centroid", "spectral_bandwidth", "spectral_contrast", "spectral_flatness", "mfcc_0", "mfcc_1", "mfcc_2", "mfcc_3", "mfcc_4", "mfcc_5", "mfcc_6", "mfcc_7", "mfcc_8", "mfcc_9", "mfcc_10", "mfcc_11", "mfcc_12", "d_mfcc_0", "d_mfcc_1", "d_mfcc_2", "d_mfcc_3", "d_mfcc_4", "d_mfcc_5", "d_mfcc_6", "d_mfcc_7", "d_mfcc_8", "d_mfcc_9", "d_mfcc_10", "d_mfcc_11", "d_mfcc_12"], "model_type": "ann", "feature_set": "ecapa_librosa"}
model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d6188ff5cd01ebd4140611f82c0cec1159819ba02efefa2fd64bbf592a5138c
3
+ size 309660
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ librosa
2
+ scikit-learn
3
+ pandas
4
+ soundfile
5
+ speechbrain
6
+ tensorflow
7
+ torch
8
+ torchaudio
scaler.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a7cf3b23a79abaefb9bdc88822a3904247bf4a3800fe9db7b5cd0d169419b1d
3
+ size 12767