mamed0v
commited on
Commit
·
2aa8b90
1
Parent(s):
43c6748
Loaded W2V model
Browse files
README.md
CHANGED
@@ -1,3 +1,84 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
4 |
+
# Turkmen Word2Vec Model 🇹🇲💬
|
5 |
+
|
6 |
+
Welcome to the Turkmen Word2Vec Model project! This open-source initiative aims to provide a robust word embedding solution for the Turkmen language. 🚀
|
7 |
+
|
8 |
+
## Introduction 🌟
|
9 |
+
|
10 |
+
The Turkmen Word2Vec Model is designed to create high-quality word embeddings for the Turkmen language. By leveraging the power of Word2Vec and incorporating Turkmen-specific preprocessing, this project offers a valuable resource for various natural language processing tasks in Turkmen.
|
11 |
+
|
12 |
+
## Requirements 📋
|
13 |
+
|
14 |
+
To use this project, you'll need:
|
15 |
+
|
16 |
+
- Python 3.6+
|
17 |
+
- NLTK
|
18 |
+
- Gensim
|
19 |
+
- tqdm
|
20 |
+
|
21 |
+
## Installation 🔧
|
22 |
+
|
23 |
+
1. Clone this repository:
|
24 |
+
```
|
25 |
+
git clone https://github.com/yourusername/turkmen-word2vec.git
|
26 |
+
cd turkmen-word2vec
|
27 |
+
```
|
28 |
+
|
29 |
+
2. Create a virtual environment (optional but recommended):
|
30 |
+
```
|
31 |
+
python -m venv venv
|
32 |
+
source venv/bin/activate
|
33 |
+
```
|
34 |
+
|
35 |
+
3. Install the required packages:
|
36 |
+
```
|
37 |
+
pip install -r requirements.txt
|
38 |
+
```
|
39 |
+
|
40 |
+
## Usage 🚀
|
41 |
+
|
42 |
+
1. Prepare your Turkmen text data in a file (one sentence per line).
|
43 |
+
|
44 |
+
2. Update the `CONFIG` dictionary in `train_turkmen_word2vec.py` with your desired parameters and file paths.
|
45 |
+
|
46 |
+
3. Run the script:
|
47 |
+
```
|
48 |
+
python train_turkmen_word2vec.py
|
49 |
+
```
|
50 |
+
|
51 |
+
4. The script will preprocess the data, train the model, and save it along with its metadata.
|
52 |
+
|
53 |
+
5. You can then use the trained model in your projects:
|
54 |
+
```python
|
55 |
+
from gensim.models import Word2Vec
|
56 |
+
|
57 |
+
model = Word2Vec.load("tkm_w2v/turkmen_word2vec.model")
|
58 |
+
similar_words = model.wv.most_similar("salam", topn=10) # Example usage
|
59 |
+
```
|
60 |
+
|
61 |
+
## Configuration ⚙️
|
62 |
+
|
63 |
+
You can customize the model's behavior by modifying the `CONFIG` dictionary in `train_turkmen_word2vec.py`. Here are the available options:
|
64 |
+
|
65 |
+
- `input_file`: Path to your input text file
|
66 |
+
- `output_dir`: Directory to save the model and metadata
|
67 |
+
- `model_name`: Name of your model
|
68 |
+
- `vector_size`: Dimensionality of the word vectors
|
69 |
+
- `window`: Maximum distance between the current and predicted word
|
70 |
+
- `min_count`: Minimum frequency of words to be included in the model
|
71 |
+
- `sg`: Training algorithm (1 for skip-gram, 0 for CBOW)
|
72 |
+
- `epochs`: Number of training epochs
|
73 |
+
- `negative`: Number of negative samples for negative sampling
|
74 |
+
- `sample`: Threshold for downsampling higher-frequency words
|
75 |
+
|
76 |
+
## Contact 📬
|
77 |
+
|
78 |
+
Bahtiyar Mamedov - [@gargamelix](https://t.me/gargamelix) - [email protected]
|
79 |
+
|
80 |
+
Project Link: [https://huggingface.co/mamed0v/turkmen-word2vec](https://huggingface.co/mamed0v/turkmen-word2vec)
|
81 |
+
|
82 |
+
---
|
83 |
+
|
84 |
+
Happy embedding! 🎉 If you find this project useful, please give it a star ⭐️ and share it with others who might benefit from it.
|
tkm_w2v/turkmen_word2vec.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e0dcfca41a2df6e9b33407c9e554ae6bb11f971ccc9d6b11284f2bf1be94cdec
|
3 |
+
size 4997173
|
tkm_w2v/turkmen_word2vec.model.syn1neg.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:93ee559be4c33938e84af1f5c19cbc98c77701d9ba176edcb592ffd90b5d3d83
|
3 |
+
size 184434128
|
tkm_w2v/turkmen_word2vec.model.wv.vectors.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0206dffa00d0a6f0130d6ef109ba557a81a62d6905a66926d88cda3c5445c285
|
3 |
+
size 184434128
|
tkm_w2v/turkmen_word2vec_metadata.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Model: turkmen_word2vec
|
2 |
+
Vocabulary size: 153695
|
3 |
+
Vector size: 300
|
4 |
+
Window size: 5
|
5 |
+
Min count: 15
|
6 |
+
Training epochs: 10
|
7 |
+
Final training loss: 80079792.0
|
train_turkmen_word2vec.py
ADDED
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Turkmen Word2Vec Model
|
3 |
+
|
4 |
+
This script preprocesses Turkmen text data and trains a Word2Vec model.
|
5 |
+
It's designed for open-source use and easy adaptation to other projects.
|
6 |
+
|
7 |
+
Requirements:
|
8 |
+
- Python 3.6+
|
9 |
+
- Dependencies: nltk, gensim, tqdm
|
10 |
+
|
11 |
+
Usage:
|
12 |
+
1. Prepare your Turkmen text data in a file (one sentence per line).
|
13 |
+
2. Update the CONFIG dictionary with your desired parameters.
|
14 |
+
3. Run the script: python turkmen_word2vec.py
|
15 |
+
|
16 |
+
The script will preprocess the data, train the model, and save it for future use.
|
17 |
+
"""
|
18 |
+
|
19 |
+
import re
|
20 |
+
import nltk
|
21 |
+
import logging
|
22 |
+
import multiprocessing
|
23 |
+
from pathlib import Path
|
24 |
+
from typing import List, Dict, Tuple
|
25 |
+
|
26 |
+
import nltk
|
27 |
+
from nltk.tokenize import word_tokenize
|
28 |
+
from nltk.corpus import stopwords
|
29 |
+
from tqdm import tqdm
|
30 |
+
from concurrent.futures import ProcessPoolExecutor
|
31 |
+
from gensim.models import Word2Vec
|
32 |
+
|
33 |
+
# Configuration
|
34 |
+
CONFIG = {
|
35 |
+
"input_file": "path/to/your/input/file.txt",
|
36 |
+
"output_dir": "path/to/output/directory",
|
37 |
+
"model_name": "turkmen_word2vec",
|
38 |
+
"vector_size": 300,
|
39 |
+
"window": 5,
|
40 |
+
"min_count": 15,
|
41 |
+
"sg": 1,
|
42 |
+
"epochs": 10,
|
43 |
+
"negative": 15,
|
44 |
+
"sample": 1e-5,
|
45 |
+
}
|
46 |
+
|
47 |
+
# Setup logging
|
48 |
+
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
|
49 |
+
|
50 |
+
# Ensure required NLTK data is available
|
51 |
+
nltk.download('stopwords', quiet=True)
|
52 |
+
nltk.download('punkt', quiet=True)
|
53 |
+
|
54 |
+
# Load stop words (using Turkish as a close approximation to Turkmen)
|
55 |
+
STOP_WORDS = set(stopwords.words('turkish'))
|
56 |
+
|
57 |
+
# Character replacements for Turkmen-specific letters
|
58 |
+
REPLACEMENTS = {
|
59 |
+
'ä': 'a', 'ç': 'ch', 'ö': 'o', 'ü': 'u', 'ň': 'n', 'ý': 'y', 'ğ': 'g', 'ş': 's',
|
60 |
+
'Ç': 'Ch', 'Ö': 'O', 'Ü': 'U', 'Ä': 'A', 'Ň': 'N', 'Ş': 'S', 'Ý': 'Y', 'Ğ': 'G'
|
61 |
+
}
|
62 |
+
|
63 |
+
def preprocess_sentence(sentence: str) -> List[str]:
|
64 |
+
"""
|
65 |
+
Preprocess a single sentence.
|
66 |
+
|
67 |
+
Args:
|
68 |
+
sentence (str): Input sentence.
|
69 |
+
|
70 |
+
Returns:
|
71 |
+
List[str]: List of preprocessed tokens.
|
72 |
+
"""
|
73 |
+
for original, replacement in REPLACEMENTS.items():
|
74 |
+
sentence = sentence.replace(original, replacement)
|
75 |
+
|
76 |
+
sentence = re.sub(r'[^a-zA-Z ]', ' ', sentence)
|
77 |
+
sentence = sentence.lower()
|
78 |
+
|
79 |
+
tokens = word_tokenize(sentence)
|
80 |
+
return [word for word in tokens if word not in STOP_WORDS and len(word) > 2]
|
81 |
+
|
82 |
+
def process_chunk(chunk: List[str]) -> List[List[str]]:
|
83 |
+
"""
|
84 |
+
Process a chunk of sentences in parallel.
|
85 |
+
|
86 |
+
Args:
|
87 |
+
chunk (List[str]): List of sentences to process.
|
88 |
+
|
89 |
+
Returns:
|
90 |
+
List[List[str]]: List of preprocessed sentences (as token lists).
|
91 |
+
"""
|
92 |
+
return [preprocess_sentence(sentence) for sentence in chunk]
|
93 |
+
|
94 |
+
def load_and_preprocess(file_path: str) -> List[List[str]]:
|
95 |
+
"""
|
96 |
+
Load and preprocess the input file.
|
97 |
+
|
98 |
+
Args:
|
99 |
+
file_path (str): Path to the input file.
|
100 |
+
|
101 |
+
Returns:
|
102 |
+
List[List[str]]: List of preprocessed sentences (as token lists).
|
103 |
+
"""
|
104 |
+
with open(file_path, 'r', encoding='utf-8') as f:
|
105 |
+
sentences = f.readlines()
|
106 |
+
|
107 |
+
chunk_size = len(sentences) // multiprocessing.cpu_count()
|
108 |
+
chunks = [sentences[i:i + chunk_size] for i in range(0, len(sentences), chunk_size)]
|
109 |
+
|
110 |
+
processed_sentences = []
|
111 |
+
with ProcessPoolExecutor() as executor:
|
112 |
+
for chunk_result in tqdm(executor.map(process_chunk, chunks), total=len(chunks), desc="Preprocessing"):
|
113 |
+
processed_sentences.extend(chunk_result)
|
114 |
+
|
115 |
+
return processed_sentences
|
116 |
+
|
117 |
+
def train_word2vec(sentences: List[List[str]], params: Dict) -> Word2Vec:
|
118 |
+
"""
|
119 |
+
Train the Word2Vec model.
|
120 |
+
|
121 |
+
Args:
|
122 |
+
sentences (List[List[str]]): Preprocessed sentences.
|
123 |
+
params (Dict): Model parameters.
|
124 |
+
|
125 |
+
Returns:
|
126 |
+
Word2Vec: Trained Word2Vec model.
|
127 |
+
"""
|
128 |
+
model = Word2Vec(sentences=sentences, vector_size=params['vector_size'],
|
129 |
+
window=params['window'], min_count=params['min_count'],
|
130 |
+
workers=multiprocessing.cpu_count(), sg=params['sg'],
|
131 |
+
epochs=params['epochs'], negative=params['negative'],
|
132 |
+
sample=params['sample'], compute_loss=True)
|
133 |
+
return model
|
134 |
+
|
135 |
+
def save_model(model: Word2Vec, output_dir: str, model_name: str) -> None:
|
136 |
+
"""
|
137 |
+
Save the trained model and its metadata.
|
138 |
+
|
139 |
+
Args:
|
140 |
+
model (Word2Vec): Trained Word2Vec model.
|
141 |
+
output_dir (str): Directory to save the model.
|
142 |
+
model_name (str): Name of the model.
|
143 |
+
"""
|
144 |
+
output_path = Path(output_dir)
|
145 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
146 |
+
|
147 |
+
model_path = output_path / f"{model_name}.model"
|
148 |
+
model.save(str(model_path))
|
149 |
+
logging.info(f"Model saved to {model_path}")
|
150 |
+
|
151 |
+
# Save model metadata
|
152 |
+
metadata_path = output_path / f"{model_name}_metadata.txt"
|
153 |
+
with open(metadata_path, 'w', encoding='utf-8') as f:
|
154 |
+
f.write(f"Model: {model_name}\n")
|
155 |
+
f.write(f"Vocabulary size: {len(model.wv.key_to_index)}\n")
|
156 |
+
f.write(f"Vector size: {model.vector_size}\n")
|
157 |
+
f.write(f"Window size: {model.window}\n")
|
158 |
+
f.write(f"Min count: {model.min_count}\n")
|
159 |
+
f.write(f"Training epochs: {model.epochs}\n")
|
160 |
+
f.write(f"Final training loss: {model.get_latest_training_loss()}\n")
|
161 |
+
logging.info(f"Model metadata saved to {metadata_path}")
|
162 |
+
|
163 |
+
def main():
|
164 |
+
"""Main execution function."""
|
165 |
+
logging.info("Starting Turkmen Word2Vec model training")
|
166 |
+
|
167 |
+
# Load and preprocess data
|
168 |
+
processed_sentences = load_and_preprocess(CONFIG['input_file'])
|
169 |
+
logging.info(f"Preprocessed {len(processed_sentences)} sentences")
|
170 |
+
|
171 |
+
# Train model
|
172 |
+
model = train_word2vec(processed_sentences, CONFIG)
|
173 |
+
logging.info("Model training completed")
|
174 |
+
|
175 |
+
# Save model and metadata
|
176 |
+
save_model(model, CONFIG['output_dir'], CONFIG['model_name'])
|
177 |
+
|
178 |
+
logging.info("Process completed successfully")
|
179 |
+
|
180 |
+
if __name__ == "__main__":
|
181 |
+
main()
|