|
--- |
|
library_name: transformers |
|
language: |
|
- en |
|
- hy |
|
base_model: |
|
- intfloat/multilingual-e5-base |
|
tags: |
|
- sentence-transformers |
|
--- |
|
|
|
# Armenian-Text-Embeddings-1 |
|
|
|
## Model Details |
|
- **Model Name**: Armenian-Text-Embeddings-1 |
|
- **Model Type**: Text Embeddings for Armenian Language |
|
- **Base Model**: intfloat/multilingual-e5-base |
|
- **Version**: 1.0.0 |
|
- **License**: Apache 2.0 |
|
- **Last Updated**: November 2024 |
|
- **Model Architecture**: Transformer-based embeddings model |
|
- **Input**: Armenian text |
|
- **Output**: Dense vector embeddings |
|
|
|
## Quick Start |
|
```python |
|
import torch.nn.functional as F |
|
|
|
from torch import Tensor |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Metric-AI/armenian-text-embeddings-1') |
|
model = AutoModel.from_pretrained('Metric-AI/armenian-text-embeddings-1') |
|
|
|
|
|
def average_pool(last_hidden_states: Tensor, |
|
attention_mask: Tensor) -> Tensor: |
|
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
|
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
|
|
|
|
|
# Each input text should start with "query: " or "passage: ", even for non-English texts. |
|
# For tasks other than retrieval, you can simply use the "query: " prefix. |
|
input_texts = [ |
|
'query: Ինչպե՞ս պատրաստել տոլմա', # How to make tolma |
|
'query: Քանի՞ գրամ սպիտակուց է հարկավոր օրական', # How many grams of protein needed daily |
|
|
|
"""passage: Տոլմայի բաղադրատոմս՝ |
|
Բաղադրիչներ՝ |
|
- 500գ աղացած միս |
|
- 1 բաժակ բրինձ |
|
- Խաղողի տերևներ |
|
- 2 գլուխ սոխ |
|
- Համեմունքներ՝ աղ, սև պղպեղ, քարի |
|
|
|
Պատրաստման եղանակը՝ |
|
1. Միսը խառնել բրնձի, մանր կտրատած սոխի և համեմունքների հետ |
|
2. Խաղողի տերևները լվանալ և թողնել տաք ջրի մեջ 10 րոպե |
|
3. Լցոնել տերևները և դասավորել կաթսայի մեջ |
|
4. Եփել դանդաղ կրակի վրա 45-60 րոպե""", # Detailed tolma recipe |
|
|
|
"""passage: Սպիտակուցի օրական չափաբաժինը կախված է մարդու քաշից, սեռից և ֆիզիկական ակտիվությունից: |
|
Միջին հաշվով, կանանց համար խորհուրդ է տրվում 46-50 գրամ սպիտակուց օրական: |
|
Մարզիկների համար այս թիվը կարող է հասնել մինչև 1.6-2 գրամ մարմնի քաշի յուրաքանչյուր կիլոգրամի համար: |
|
Հղիների համար պահանջվում է լրացուցիչ 25 գրամ սպիտակուց: |
|
|
|
Սպիտակուցի հարուստ աղբյուրներ են՝ |
|
- Հավի միս (31գ/100գ) |
|
- Ձու (13գ/100գ) |
|
- Ոսպ (25գ/100գ) |
|
- Մածուն (3.5գ/100գ)"""] # Detailed protein intake advice |
|
|
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') |
|
outputs = model(**batch_dict) |
|
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
# normalize embeddings |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:2] @ embeddings[2:].T) * 100 |
|
print(scores.tolist()) |
|
|
|
# [[83.96063232421875, 30.283924102783203], [32.504661560058594, 82.4246826171875]] |
|
``` |
|
|
|
## Support for Sentence Transformers |
|
|
|
Below is an example for usage with sentence_transformers. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1') |
|
|
|
embeddings = model.encode(input_texts, normalize_embeddings=True) |
|
``` |
|
|
|
|
|
## Intended Use |
|
### Primary Intended Uses |
|
- Retrieval-augmented generation (RAG) |
|
- Semantic search in Armenian |
|
- Document similarity computation |
|
- Cross-lingual text understanding |
|
- Text classification tasks |
|
- Information retrieval |
|
|
|
## Training Data |
|
### Dataset Details |
|
- **Source**: Reddit dataset with English-Armenian translations |
|
- **Size**: 1.08M pairs of rows |
|
- **Content Type**: Title and body text pairs |
|
- **Token Statistics**: |
|
- Training Set: |
|
- Translated Title Tokens: 23,921,393 |
|
- Translated Body Tokens: 194,200,654 |
|
- Test Set: |
|
- Translated Title Tokens: 242,443 |
|
- Translated Body Tokens: 1,946,164 |
|
- **Split Ratio**: 99% train, 1% test |
|
|
|
## Training Procedure |
|
### Training Details |
|
- **Weight Averaging**: |
|
- Base model (multilingual-e5-base): 0.6 weight |
|
- Fine-tuned model: 0.4 weight |
|
- **Training Duration**: 2 days |
|
- **Hardware**: 4 x NVIDIA A100 40GB GPUs |
|
- **Training Parameters**: |
|
- Epochs: 5 |
|
- Batch Size: 256 per GPU, (256*4 in total) |
|
- Learning Rate: 5e-5 |
|
- Weight Decay: 0.01 |
|
- Warmup Steps: 1000 |
|
- Maximum Sequence Length: 128 tokens |
|
- FP16 Training: Enabled |
|
- Gradient Clipping: 1.0 |
|
|
|
### Optimization Configuration |
|
- **Framework**: DeepSpeed Stage 2 |
|
- **Optimizer**: AdamW with auto weight decay |
|
- **Mixed Precision**: FP16 with dynamic loss scaling |
|
- **ZeRO Optimization**: Stage 2 with: |
|
- Allgather partitions |
|
- Overlap communications |
|
- Contiguous gradients |
|
- **Additional Features**: |
|
- Gradient checkpointing |
|
- Tensor parallelism (size: 2) |
|
|
|
## Performance and Limitations |
|
### Capabilities |
|
- Effective for semantic similarity tasks in Armenian |
|
- Suitable for document classification and clustering |
|
|
|
### Limitations |
|
- Performance may vary on domain-specific terminology |
|
- May not capture Armenian-specific cultural contexts effectively |
|
- Limited by the quality of training data translations |
|
|
|
### Known Biases |
|
- May exhibit biases present in Reddit content |
|
|
|
## Environmental Impact |
|
- **Training Hardware**: 4 x NVIDIA A100 40GB |
|
- **Training Duration**: 48 hours |
|
- **Estimated Energy Consumption**: 384 kWh (estimated based on A100 power consumption) |
|
|
|
## Ethical Considerations |
|
- **Data Privacy**: Training data from public Reddit content |
|
- **Potential Misuse**: Could be misused for content manipulation or spam |
|
- **Bias**: May perpetuate social biases present in Reddit content |
|
- **Recommendations**: |
|
- Monitor system outputs for harmful content |
|
- Implement content filtering for production use |
|
- Regular bias assessment recommended |
|
|
|
## Technical Specifications |
|
- **Model Size**: ~278M parameters (based on e5-base) |
|
- **Embedding Dimension**: 384 |
|
- **Max Sequence Length**: 128 tokens |
|
- **Framework Compatibility**: |
|
- PyTorch |
|
- Hugging Face Transformers |
|
- DeepSpeed |
|
|
|
## Citation |
|
```bibtex |
|
@misc{armenian-text-embeddings-1, |
|
author = {Spartak Bughdaryan, Zaruhi Navasardyan, Bagrat Minasyan, Hrant Davtyan}, |
|
title = {Armenian-Text-Embeddings-1: Enhanced Armenian Language Embeddings}, |
|
year = {2024}, |
|
howpublished = {\url{https://metric.am/blog/announcing-armenian-text-embeddings/}} |
|
} |
|
``` |
|
|
|
## Additional Information |
|
### Base Model References |
|
- multilingual-e5-base: [https://huggingface.co./intfloat/multilingual-e5-base](https://huggingface.co./intfloat/multilingual-e5-base) |
|
|
|
### Acknowledgments |
|
- intfloat for the original multilingual-e5-base model |
|
- Reddit community for the source content |
|
- DeepSpeed team for optimization toolkit |
|
|
|
## Version History |
|
- 1.0.0 (November 2024): Initial release |