Turkmen Word2Vec Model πΉπ²π¬
Welcome to the Turkmen Word2Vec Model project! This open-source initiative aims to provide a robust word embedding solution for the Turkmen language. π
Introduction π
The Turkmen Word2Vec Model is designed to create high-quality word embeddings for the Turkmen language. By leveraging the power of Word2Vec and incorporating Turkmen-specific preprocessing, this project offers a valuable resource for various natural language processing tasks in Turkmen.
Requirements π
To use this project, you'll need:
- Python 3.6+
- NLTK
- Gensim
- tqdm
Metadata
Model: turkmen_word2vec
Vocabulary size: 153695
Vector size: 300
Window size: 5
Min count: 15
Training epochs: 10
Final training loss: 80079792.0
Turkmen-Specific Character Replacement π€
One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.
Replacement Map
Here's the character replacement map used in the preprocessing step:
REPLACEMENTS = {
'Γ€': 'a', 'Γ§': 'ch', 'ΓΆ': 'o', 'ΓΌ': 'u', 'Ε': 'n', 'Γ½': 'y', 'Δ': 'g', 'Ε': 's',
'Γ': 'Ch', 'Γ': 'O', 'Γ': 'U', 'Γ': 'A', 'Ε': 'N', 'Ε': 'S', 'Γ': 'Y', 'Δ': 'G'
}
This mapping ensures that:
- Special Turkmen characters are converted to their closest Latin alphabet equivalents.
- The essence of the original text is preserved while making it more processable for standard NLP tools.
- Both lowercase and uppercase variants are handled appropriately.
Implementation
The replacement is implemented in the preprocess_sentence
function:
def preprocess_sentence(sentence: str) -> List[str]:
for original, replacement in REPLACEMENTS.items():
sentence = sentence.replace(original, replacement)
# ... (rest of the preprocessing steps)
This step is crucial as it:
- Standardizes the text, making it easier to process and analyze.
- Maintains the semantic meaning of words while adapting them to a more universal character set.
- Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.
By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.
Installation π§
Clone this repository:
git clone https://github.com/yourusername/turkmen-word2vec.git cd turkmen-word2vec
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate
Install the required packages:
pip install -r requirements.txt
Usage π
Prepare your Turkmen text data in a file (one sentence per line).
Update the
CONFIG
dictionary intrain_turkmen_word2vec.py
with your desired parameters and file paths.Run the script:
python train_turkmen_word2vec.py
The script will preprocess the data, train the model, and save it along with its metadata.
You can then use the trained model in your projects:
from gensim.models import Word2Vec model = Word2Vec.load("tkm_w2v/turkmen_word2vec.model") similar_words = model.wv.most_similar("salam", topn=10) # Example usage
Configuration βοΈ
You can customize the model's behavior by modifying the CONFIG
dictionary in train_turkmen_word2vec.py
. Here are the available options:
input_file
: Path to your input text fileoutput_dir
: Directory to save the model and metadatamodel_name
: Name of your modelvector_size
: Dimensionality of the word vectorswindow
: Maximum distance between the current and predicted wordmin_count
: Minimum frequency of words to be included in the modelsg
: Training algorithm (1 for skip-gram, 0 for CBOW)epochs
: Number of training epochsnegative
: Number of negative samples for negative samplingsample
: Threshold for downsampling higher-frequency words
Contact π¬
Bahtiyar Mamedov - @gargamelix - [email protected]
Project Link: https://huggingface.co./mamed0v/turkmen-word2vec
Happy embedding! π If you find this project useful, please give it a star βοΈ and share it with others who might benefit from it.