mamed0v commited on
Commit
2aa8b90
·
1 Parent(s): 43c6748

Loaded W2V model

Browse files
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+ # Turkmen Word2Vec Model 🇹🇲💬
5
+
6
+ Welcome to the Turkmen Word2Vec Model project! This open-source initiative aims to provide a robust word embedding solution for the Turkmen language. 🚀
7
+
8
+ ## Introduction 🌟
9
+
10
+ The Turkmen Word2Vec Model is designed to create high-quality word embeddings for the Turkmen language. By leveraging the power of Word2Vec and incorporating Turkmen-specific preprocessing, this project offers a valuable resource for various natural language processing tasks in Turkmen.
11
+
12
+ ## Requirements 📋
13
+
14
+ To use this project, you'll need:
15
+
16
+ - Python 3.6+
17
+ - NLTK
18
+ - Gensim
19
+ - tqdm
20
+
21
+ ## Installation 🔧
22
+
23
+ 1. Clone this repository:
24
+ ```
25
+ git clone https://github.com/yourusername/turkmen-word2vec.git
26
+ cd turkmen-word2vec
27
+ ```
28
+
29
+ 2. Create a virtual environment (optional but recommended):
30
+ ```
31
+ python -m venv venv
32
+ source venv/bin/activate
33
+ ```
34
+
35
+ 3. Install the required packages:
36
+ ```
37
+ pip install -r requirements.txt
38
+ ```
39
+
40
+ ## Usage 🚀
41
+
42
+ 1. Prepare your Turkmen text data in a file (one sentence per line).
43
+
44
+ 2. Update the `CONFIG` dictionary in `train_turkmen_word2vec.py` with your desired parameters and file paths.
45
+
46
+ 3. Run the script:
47
+ ```
48
+ python train_turkmen_word2vec.py
49
+ ```
50
+
51
+ 4. The script will preprocess the data, train the model, and save it along with its metadata.
52
+
53
+ 5. You can then use the trained model in your projects:
54
+ ```python
55
+ from gensim.models import Word2Vec
56
+
57
+ model = Word2Vec.load("tkm_w2v/turkmen_word2vec.model")
58
+ similar_words = model.wv.most_similar("salam", topn=10) # Example usage
59
+ ```
60
+
61
+ ## Configuration ⚙️
62
+
63
+ You can customize the model's behavior by modifying the `CONFIG` dictionary in `train_turkmen_word2vec.py`. Here are the available options:
64
+
65
+ - `input_file`: Path to your input text file
66
+ - `output_dir`: Directory to save the model and metadata
67
+ - `model_name`: Name of your model
68
+ - `vector_size`: Dimensionality of the word vectors
69
+ - `window`: Maximum distance between the current and predicted word
70
+ - `min_count`: Minimum frequency of words to be included in the model
71
+ - `sg`: Training algorithm (1 for skip-gram, 0 for CBOW)
72
+ - `epochs`: Number of training epochs
73
+ - `negative`: Number of negative samples for negative sampling
74
+ - `sample`: Threshold for downsampling higher-frequency words
75
+
76
+ ## Contact 📬
77
+
78
+ Bahtiyar Mamedov - [@gargamelix](https://t.me/gargamelix) - [email protected]
79
+
80
+ Project Link: [https://huggingface.co/mamed0v/turkmen-word2vec](https://huggingface.co/mamed0v/turkmen-word2vec)
81
+
82
+ ---
83
+
84
+ Happy embedding! 🎉 If you find this project useful, please give it a star ⭐️ and share it with others who might benefit from it.
tkm_w2v/turkmen_word2vec.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0dcfca41a2df6e9b33407c9e554ae6bb11f971ccc9d6b11284f2bf1be94cdec
3
+ size 4997173
tkm_w2v/turkmen_word2vec.model.syn1neg.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93ee559be4c33938e84af1f5c19cbc98c77701d9ba176edcb592ffd90b5d3d83
3
+ size 184434128
tkm_w2v/turkmen_word2vec.model.wv.vectors.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0206dffa00d0a6f0130d6ef109ba557a81a62d6905a66926d88cda3c5445c285
3
+ size 184434128
tkm_w2v/turkmen_word2vec_metadata.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Model: turkmen_word2vec
2
+ Vocabulary size: 153695
3
+ Vector size: 300
4
+ Window size: 5
5
+ Min count: 15
6
+ Training epochs: 10
7
+ Final training loss: 80079792.0
train_turkmen_word2vec.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Turkmen Word2Vec Model
3
+
4
+ This script preprocesses Turkmen text data and trains a Word2Vec model.
5
+ It's designed for open-source use and easy adaptation to other projects.
6
+
7
+ Requirements:
8
+ - Python 3.6+
9
+ - Dependencies: nltk, gensim, tqdm
10
+
11
+ Usage:
12
+ 1. Prepare your Turkmen text data in a file (one sentence per line).
13
+ 2. Update the CONFIG dictionary with your desired parameters.
14
+ 3. Run the script: python turkmen_word2vec.py
15
+
16
+ The script will preprocess the data, train the model, and save it for future use.
17
+ """
18
+
19
+ import re
20
+ import nltk
21
+ import logging
22
+ import multiprocessing
23
+ from pathlib import Path
24
+ from typing import List, Dict, Tuple
25
+
26
+ import nltk
27
+ from nltk.tokenize import word_tokenize
28
+ from nltk.corpus import stopwords
29
+ from tqdm import tqdm
30
+ from concurrent.futures import ProcessPoolExecutor
31
+ from gensim.models import Word2Vec
32
+
33
+ # Configuration
34
+ CONFIG = {
35
+ "input_file": "path/to/your/input/file.txt",
36
+ "output_dir": "path/to/output/directory",
37
+ "model_name": "turkmen_word2vec",
38
+ "vector_size": 300,
39
+ "window": 5,
40
+ "min_count": 15,
41
+ "sg": 1,
42
+ "epochs": 10,
43
+ "negative": 15,
44
+ "sample": 1e-5,
45
+ }
46
+
47
+ # Setup logging
48
+ logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
49
+
50
+ # Ensure required NLTK data is available
51
+ nltk.download('stopwords', quiet=True)
52
+ nltk.download('punkt', quiet=True)
53
+
54
+ # Load stop words (using Turkish as a close approximation to Turkmen)
55
+ STOP_WORDS = set(stopwords.words('turkish'))
56
+
57
+ # Character replacements for Turkmen-specific letters
58
+ REPLACEMENTS = {
59
+ 'ä': 'a', 'ç': 'ch', 'ö': 'o', 'ü': 'u', 'ň': 'n', 'ý': 'y', 'ğ': 'g', 'ş': 's',
60
+ 'Ç': 'Ch', 'Ö': 'O', 'Ü': 'U', 'Ä': 'A', 'Ň': 'N', 'Ş': 'S', 'Ý': 'Y', 'Ğ': 'G'
61
+ }
62
+
63
+ def preprocess_sentence(sentence: str) -> List[str]:
64
+ """
65
+ Preprocess a single sentence.
66
+
67
+ Args:
68
+ sentence (str): Input sentence.
69
+
70
+ Returns:
71
+ List[str]: List of preprocessed tokens.
72
+ """
73
+ for original, replacement in REPLACEMENTS.items():
74
+ sentence = sentence.replace(original, replacement)
75
+
76
+ sentence = re.sub(r'[^a-zA-Z ]', ' ', sentence)
77
+ sentence = sentence.lower()
78
+
79
+ tokens = word_tokenize(sentence)
80
+ return [word for word in tokens if word not in STOP_WORDS and len(word) > 2]
81
+
82
+ def process_chunk(chunk: List[str]) -> List[List[str]]:
83
+ """
84
+ Process a chunk of sentences in parallel.
85
+
86
+ Args:
87
+ chunk (List[str]): List of sentences to process.
88
+
89
+ Returns:
90
+ List[List[str]]: List of preprocessed sentences (as token lists).
91
+ """
92
+ return [preprocess_sentence(sentence) for sentence in chunk]
93
+
94
+ def load_and_preprocess(file_path: str) -> List[List[str]]:
95
+ """
96
+ Load and preprocess the input file.
97
+
98
+ Args:
99
+ file_path (str): Path to the input file.
100
+
101
+ Returns:
102
+ List[List[str]]: List of preprocessed sentences (as token lists).
103
+ """
104
+ with open(file_path, 'r', encoding='utf-8') as f:
105
+ sentences = f.readlines()
106
+
107
+ chunk_size = len(sentences) // multiprocessing.cpu_count()
108
+ chunks = [sentences[i:i + chunk_size] for i in range(0, len(sentences), chunk_size)]
109
+
110
+ processed_sentences = []
111
+ with ProcessPoolExecutor() as executor:
112
+ for chunk_result in tqdm(executor.map(process_chunk, chunks), total=len(chunks), desc="Preprocessing"):
113
+ processed_sentences.extend(chunk_result)
114
+
115
+ return processed_sentences
116
+
117
+ def train_word2vec(sentences: List[List[str]], params: Dict) -> Word2Vec:
118
+ """
119
+ Train the Word2Vec model.
120
+
121
+ Args:
122
+ sentences (List[List[str]]): Preprocessed sentences.
123
+ params (Dict): Model parameters.
124
+
125
+ Returns:
126
+ Word2Vec: Trained Word2Vec model.
127
+ """
128
+ model = Word2Vec(sentences=sentences, vector_size=params['vector_size'],
129
+ window=params['window'], min_count=params['min_count'],
130
+ workers=multiprocessing.cpu_count(), sg=params['sg'],
131
+ epochs=params['epochs'], negative=params['negative'],
132
+ sample=params['sample'], compute_loss=True)
133
+ return model
134
+
135
+ def save_model(model: Word2Vec, output_dir: str, model_name: str) -> None:
136
+ """
137
+ Save the trained model and its metadata.
138
+
139
+ Args:
140
+ model (Word2Vec): Trained Word2Vec model.
141
+ output_dir (str): Directory to save the model.
142
+ model_name (str): Name of the model.
143
+ """
144
+ output_path = Path(output_dir)
145
+ output_path.mkdir(parents=True, exist_ok=True)
146
+
147
+ model_path = output_path / f"{model_name}.model"
148
+ model.save(str(model_path))
149
+ logging.info(f"Model saved to {model_path}")
150
+
151
+ # Save model metadata
152
+ metadata_path = output_path / f"{model_name}_metadata.txt"
153
+ with open(metadata_path, 'w', encoding='utf-8') as f:
154
+ f.write(f"Model: {model_name}\n")
155
+ f.write(f"Vocabulary size: {len(model.wv.key_to_index)}\n")
156
+ f.write(f"Vector size: {model.vector_size}\n")
157
+ f.write(f"Window size: {model.window}\n")
158
+ f.write(f"Min count: {model.min_count}\n")
159
+ f.write(f"Training epochs: {model.epochs}\n")
160
+ f.write(f"Final training loss: {model.get_latest_training_loss()}\n")
161
+ logging.info(f"Model metadata saved to {metadata_path}")
162
+
163
+ def main():
164
+ """Main execution function."""
165
+ logging.info("Starting Turkmen Word2Vec model training")
166
+
167
+ # Load and preprocess data
168
+ processed_sentences = load_and_preprocess(CONFIG['input_file'])
169
+ logging.info(f"Preprocessed {len(processed_sentences)} sentences")
170
+
171
+ # Train model
172
+ model = train_word2vec(processed_sentences, CONFIG)
173
+ logging.info("Model training completed")
174
+
175
+ # Save model and metadata
176
+ save_model(model, CONFIG['output_dir'], CONFIG['model_name'])
177
+
178
+ logging.info("Process completed successfully")
179
+
180
+ if __name__ == "__main__":
181
+ main()