Upload folder using huggingface_hub
Browse files- README.md +101 -0
- config.json +1 -0
- model.h5 +3 -0
- requirements.txt +8 -0
- scaler.joblib +3 -0
README.md
ADDED
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: multilingual
|
3 |
+
license: apache-2.0
|
4 |
+
datasets:
|
5 |
+
- voxceleb2
|
6 |
+
- timit
|
7 |
+
libraries:
|
8 |
+
- speechbrain
|
9 |
+
- librosa
|
10 |
+
tags:
|
11 |
+
- age-estimation
|
12 |
+
- speaker-characteristics
|
13 |
+
- speaker-recognition
|
14 |
+
- audio-regression
|
15 |
+
- voice-analysis
|
16 |
+
---
|
17 |
+
|
18 |
+
# Age Estimation Model
|
19 |
+
|
20 |
+
This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an ANN regressor to predict speaker age from audio input. The model uses ECAPA embeddings and Librosa acoustic features, trained on both VoxCeleb2 and TIMIT datasets.
|
21 |
+
|
22 |
+
## Model Performance Comparison
|
23 |
+
|
24 |
+
We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:
|
25 |
+
|
26 |
+
| Model | Architecture | Features | Training Data | Test MAE | Best For |
|
27 |
+
|-------|-------------|----------|---------------|-----------|----------|
|
28 |
+
| VoxCeleb2 SVR (223) | SVR | ECAPA + Librosa (223-dim) | VoxCeleb2 | 7.88 years | Best performance on VoxCeleb2 |
|
29 |
+
| VoxCeleb2 SVR (192) | SVR | ECAPA only (192-dim) | VoxCeleb2 | 7.89 years | Lightweight deployment |
|
30 |
+
| TIMIT ANN (192) | ANN | ECAPA only (192-dim) | TIMIT | 4.95 years | Clean studio recordings |
|
31 |
+
| Combined ANN (223) | ANN | ECAPA + Librosa (223-dim) | VoxCeleb2 + TIMIT | 6.93 years | Best general performance |
|
32 |
+
|
33 |
+
You may find other models [here](https://huggingface.co/griko).
|
34 |
+
|
35 |
+
## Model Details
|
36 |
+
- Input: Audio file (will be converted to 16kHz, mono, single channel)
|
37 |
+
- Output: Predicted age in years (continuous value)
|
38 |
+
- Features:
|
39 |
+
- SpeechBrain ECAPA-TDNN embedding [192 features]
|
40 |
+
- Additional Librosa features [31 features]
|
41 |
+
- Regressor: Artificial Neural Network optimized through Optuna
|
42 |
+
- Performance:
|
43 |
+
- Combined test set: 6.93 years Mean Absolute Error (MAE)
|
44 |
+
|
45 |
+
## Features
|
46 |
+
1. SpeechBrain ECAPA-TDNN embeddings (192 dimensions)
|
47 |
+
2. Librosa acoustic features (31 dimensions):
|
48 |
+
- 13 MFCCs
|
49 |
+
- 13 Delta MFCCs
|
50 |
+
- Zero crossing rate
|
51 |
+
- Spectral centroid
|
52 |
+
- Spectral bandwidth
|
53 |
+
- Spectral contrast
|
54 |
+
- Spectral flatness
|
55 |
+
|
56 |
+
## Training Data
|
57 |
+
The model was trained on a combination of datasets:
|
58 |
+
- VoxCeleb2:
|
59 |
+
- YouTube interview recordings
|
60 |
+
- Age data from Wikidata and public sources
|
61 |
+
- Voice activity detection applied
|
62 |
+
- TIMIT:
|
63 |
+
- Studio-quality recordings
|
64 |
+
- Original age annotations
|
65 |
+
- All audio preprocessed to 16kHz, mono
|
66 |
+
## Installation
|
67 |
+
|
68 |
+
```bash
|
69 |
+
pip install git+https://github.com/griko/voice-age-regression.git[full]
|
70 |
+
```
|
71 |
+
|
72 |
+
## Usage
|
73 |
+
|
74 |
+
```python
|
75 |
+
from age_regressor import AgeRegressionPipeline
|
76 |
+
|
77 |
+
# Load the pipeline
|
78 |
+
regressor = AgeRegressionPipeline.from_pretrained(
|
79 |
+
"griko/age_reg_ann_ecapa_librosa_combined"
|
80 |
+
)
|
81 |
+
|
82 |
+
# Single file prediction
|
83 |
+
result = regressor("path/to/audio.wav")
|
84 |
+
print(f"Predicted age: {result[0]:.1f} years")
|
85 |
+
|
86 |
+
# Batch prediction
|
87 |
+
results = regressor(["audio1.wav", "audio2.wav"])
|
88 |
+
print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")
|
89 |
+
```
|
90 |
+
|
91 |
+
## Limitations
|
92 |
+
- Model was trained on a mix of YouTube interviews and studio recordings recordings
|
93 |
+
- Performance may vary on different audio qualities or recording conditions
|
94 |
+
- Age predictions are estimates and should not be used for medical or legal purposes
|
95 |
+
- Age estimations should be treated as approximate values, not exact measurements
|
96 |
+
|
97 |
+
## Citation
|
98 |
+
If you use this model in your research, please cite:
|
99 |
+
```bibtex
|
100 |
+
TBD
|
101 |
+
```
|
config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"feature_names": ["0_speechbrain_embedding", "1_speechbrain_embedding", "2_speechbrain_embedding", "3_speechbrain_embedding", "4_speechbrain_embedding", "5_speechbrain_embedding", "6_speechbrain_embedding", "7_speechbrain_embedding", "8_speechbrain_embedding", "9_speechbrain_embedding", "10_speechbrain_embedding", "11_speechbrain_embedding", "12_speechbrain_embedding", "13_speechbrain_embedding", "14_speechbrain_embedding", "15_speechbrain_embedding", "16_speechbrain_embedding", "17_speechbrain_embedding", "18_speechbrain_embedding", "19_speechbrain_embedding", "20_speechbrain_embedding", "21_speechbrain_embedding", "22_speechbrain_embedding", "23_speechbrain_embedding", "24_speechbrain_embedding", "25_speechbrain_embedding", "26_speechbrain_embedding", "27_speechbrain_embedding", "28_speechbrain_embedding", "29_speechbrain_embedding", "30_speechbrain_embedding", "31_speechbrain_embedding", "32_speechbrain_embedding", "33_speechbrain_embedding", "34_speechbrain_embedding", "35_speechbrain_embedding", "36_speechbrain_embedding", "37_speechbrain_embedding", "38_speechbrain_embedding", "39_speechbrain_embedding", "40_speechbrain_embedding", "41_speechbrain_embedding", "42_speechbrain_embedding", "43_speechbrain_embedding", "44_speechbrain_embedding", "45_speechbrain_embedding", "46_speechbrain_embedding", "47_speechbrain_embedding", "48_speechbrain_embedding", "49_speechbrain_embedding", "50_speechbrain_embedding", "51_speechbrain_embedding", "52_speechbrain_embedding", "53_speechbrain_embedding", "54_speechbrain_embedding", "55_speechbrain_embedding", "56_speechbrain_embedding", "57_speechbrain_embedding", "58_speechbrain_embedding", "59_speechbrain_embedding", "60_speechbrain_embedding", "61_speechbrain_embedding", "62_speechbrain_embedding", "63_speechbrain_embedding", "64_speechbrain_embedding", "65_speechbrain_embedding", "66_speechbrain_embedding", "67_speechbrain_embedding", "68_speechbrain_embedding", "69_speechbrain_embedding", "70_speechbrain_embedding", "71_speechbrain_embedding", "72_speechbrain_embedding", "73_speechbrain_embedding", "74_speechbrain_embedding", "75_speechbrain_embedding", "76_speechbrain_embedding", "77_speechbrain_embedding", "78_speechbrain_embedding", "79_speechbrain_embedding", "80_speechbrain_embedding", "81_speechbrain_embedding", "82_speechbrain_embedding", "83_speechbrain_embedding", "84_speechbrain_embedding", "85_speechbrain_embedding", "86_speechbrain_embedding", "87_speechbrain_embedding", "88_speechbrain_embedding", "89_speechbrain_embedding", "90_speechbrain_embedding", "91_speechbrain_embedding", "92_speechbrain_embedding", "93_speechbrain_embedding", "94_speechbrain_embedding", "95_speechbrain_embedding", "96_speechbrain_embedding", "97_speechbrain_embedding", "98_speechbrain_embedding", "99_speechbrain_embedding", "100_speechbrain_embedding", "101_speechbrain_embedding", "102_speechbrain_embedding", "103_speechbrain_embedding", "104_speechbrain_embedding", "105_speechbrain_embedding", "106_speechbrain_embedding", "107_speechbrain_embedding", "108_speechbrain_embedding", "109_speechbrain_embedding", "110_speechbrain_embedding", "111_speechbrain_embedding", "112_speechbrain_embedding", "113_speechbrain_embedding", "114_speechbrain_embedding", "115_speechbrain_embedding", "116_speechbrain_embedding", "117_speechbrain_embedding", "118_speechbrain_embedding", "119_speechbrain_embedding", "120_speechbrain_embedding", "121_speechbrain_embedding", "122_speechbrain_embedding", "123_speechbrain_embedding", "124_speechbrain_embedding", "125_speechbrain_embedding", "126_speechbrain_embedding", "127_speechbrain_embedding", "128_speechbrain_embedding", "129_speechbrain_embedding", "130_speechbrain_embedding", "131_speechbrain_embedding", "132_speechbrain_embedding", "133_speechbrain_embedding", "134_speechbrain_embedding", "135_speechbrain_embedding", "136_speechbrain_embedding", "137_speechbrain_embedding", "138_speechbrain_embedding", "139_speechbrain_embedding", "140_speechbrain_embedding", "141_speechbrain_embedding", "142_speechbrain_embedding", "143_speechbrain_embedding", "144_speechbrain_embedding", "145_speechbrain_embedding", "146_speechbrain_embedding", "147_speechbrain_embedding", "148_speechbrain_embedding", "149_speechbrain_embedding", "150_speechbrain_embedding", "151_speechbrain_embedding", "152_speechbrain_embedding", "153_speechbrain_embedding", "154_speechbrain_embedding", "155_speechbrain_embedding", "156_speechbrain_embedding", "157_speechbrain_embedding", "158_speechbrain_embedding", "159_speechbrain_embedding", "160_speechbrain_embedding", "161_speechbrain_embedding", "162_speechbrain_embedding", "163_speechbrain_embedding", "164_speechbrain_embedding", "165_speechbrain_embedding", "166_speechbrain_embedding", "167_speechbrain_embedding", "168_speechbrain_embedding", "169_speechbrain_embedding", "170_speechbrain_embedding", "171_speechbrain_embedding", "172_speechbrain_embedding", "173_speechbrain_embedding", "174_speechbrain_embedding", "175_speechbrain_embedding", "176_speechbrain_embedding", "177_speechbrain_embedding", "178_speechbrain_embedding", "179_speechbrain_embedding", "180_speechbrain_embedding", "181_speechbrain_embedding", "182_speechbrain_embedding", "183_speechbrain_embedding", "184_speechbrain_embedding", "185_speechbrain_embedding", "186_speechbrain_embedding", "187_speechbrain_embedding", "188_speechbrain_embedding", "189_speechbrain_embedding", "190_speechbrain_embedding", "191_speechbrain_embedding", "zero_crossing_rate", "spectral_centroid", "spectral_bandwidth", "spectral_contrast", "spectral_flatness", "mfcc_0", "mfcc_1", "mfcc_2", "mfcc_3", "mfcc_4", "mfcc_5", "mfcc_6", "mfcc_7", "mfcc_8", "mfcc_9", "mfcc_10", "mfcc_11", "mfcc_12", "d_mfcc_0", "d_mfcc_1", "d_mfcc_2", "d_mfcc_3", "d_mfcc_4", "d_mfcc_5", "d_mfcc_6", "d_mfcc_7", "d_mfcc_8", "d_mfcc_9", "d_mfcc_10", "d_mfcc_11", "d_mfcc_12"], "model_type": "ann", "feature_set": "ecapa_librosa"}
|
model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8d6188ff5cd01ebd4140611f82c0cec1159819ba02efefa2fd64bbf592a5138c
|
3 |
+
size 309660
|
requirements.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
librosa
|
2 |
+
scikit-learn
|
3 |
+
pandas
|
4 |
+
soundfile
|
5 |
+
speechbrain
|
6 |
+
tensorflow
|
7 |
+
torch
|
8 |
+
torchaudio
|
scaler.joblib
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4a7cf3b23a79abaefb9bdc88822a3904247bf4a3800fe9db7b5cd0d169419b1d
|
3 |
+
size 12767
|