SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Mubin/allmini-ai-embedding-similarity")
# Run inference
sentences = [
    'NLP algorithm development, statistical modeling, biomedical informatics',
    "skills for this position are:Natural Language Processing (NLP)Python (Programming Language)Statistical ModelingHigh-Performance Liquid Chromatography (HPLC)Java Job Description:We are seeking a highly skilled NLP Scientist to develop our innovative and cutting-edge NLP/AI solutions to empower life science. This involves working directly with our clients, as well as cross-functional Biomedical Science, Engineering, and Business leaders, to identify, prioritize, and develop NLP/AI and Advanced analytics products from inception to delivery.Key requirements and design innovative NLP/AI solutions.Develop and validate cutting-edge NLP algorithms, including large language models tailored for healthcare and biopharma use cases.Translate complex technical insights into accessible language for non-technical stakeholders.Mentor junior team members, fostering a culture of continuous learning and growth.Publish findings in peer-reviewed journals and conferences.Engage with the broader scientific community by attending conferences, workshops, and collaborating on research projects. Qualifications:Ph.D. or master's degree in biomedical NLP, Computer Science, Biomedical Informatics, Computational Linguistics, Mathematics, or other related fieldsPublication records in leading computer science or biomedical informatics journals and conferences are highly desirable\n\nRegards,Guru Prasath M US IT RecruiterPSRTEK Inc.Princeton, NJ [email protected]: 609-917-9967 Ext:114",
    'Skills :\na) Azure Data Factory – Min 3 years of project experiencea. Design of pipelinesb. Use of project with On-prem to Cloud Data Migrationc. Understanding of ETLd. Change Data Capture from Multiple Sourcese. Job Schedulingb) Azure Data Lake – Min 3 years of project experiencea. All steps from design to deliverb. Understanding of different Zones and design principalc) Data Modeling experience Min 5 Yearsa. Data Mart/Warehouseb. Columnar Data design and modelingd) Reporting using PowerBI Min 3 yearsa. Analytical Reportingb. Business Domain Modeling and data dictionary\nInterested please apply to the job, looking only for W2 candidates.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric ai-job-validation ai-job-test
cosine_accuracy 0.9703 0.9804

Training Details

Training Dataset

ai-job-embedding-finetuning

  • Dataset: ai-job-embedding-finetuning at b18b3c2
  • Size: 812 training samples
  • Columns: query, job_description_pos, and job_description_neg
  • Approximate statistics based on the first 812 samples:
    query job_description_pos job_description_neg
    type string string string
    details
    • min: 7 tokens
    • mean: 15.03 tokens
    • max: 38 tokens
    • min: 6 tokens
    • mean: 216.92 tokens
    • max: 256 tokens
    • min: 6 tokens
    • mean: 217.63 tokens
    • max: 256 tokens
  • Samples:
    query job_description_pos job_description_neg
    Data Engineering Lead, Databricks administration, Neo4j expertise, ETL processes Requirements

    Experience: At least 6 years of hands-on experience in deploying production-quality code, with a strong preference for experience in Python, Java, or Scala for data processing (Python preferred).Technical Proficiency: Advanced knowledge of data-related Python packages and a profound understanding of SQL and Databricks.Graph Database Expertise: Solid grasp of Cypher and experience with graph databases like Neo4j.ETL/ELT Knowledge: Proven track record in implementing ETL (or ELT) best practices at scale and familiarity with data pipeline tools.

    Preferred Qualifications

    Professional experience using Python, Java, or Scala for data processing (Python preferred)

    Working Conditions And Physical Requirements

    Ability to work for long periods at a computer/deskStandard office environment

    About The Organization

    Fullsight is an integrated brand of our three primary affiliate companies – SAE Industry Technologies Consortia, SAE International and Performance Review Institute – a...
    skills through a combination of education, work experience, and hobbies. You are excited about the complexity and challenges of creating intelligent, high-performance systems while working with a highly experienced and driven data science team.

    If this described you, we are interested. You can be an integral part of a cross-disciplinary team working on highly visible projects that improve performance and grow the intelligence in our Financial Services marketing product suite. Our day-to-day work is performed in a progressive, high-tech workspace where we focus on a friendly, collaborative, and fulfilling environment.

    Key Duties/Responsibilities

    Leverage a richly populated feature stores to understand consumer and market behavior. 20%Implement a predictive model to determine whether a person or household is likely to open a lending or deposit account based on the advertising signals they've received. 20%Derive a set of new features that will help better understand the interplay betwe...
    Snowflake data warehousing, Python design patterns, AWS tools expertise Requirements:
    - Good communication; and problem-solving abilities- Ability to work as an individual contributor; collaborating with Global team- Strong experience with Data Warehousing- OLTP, OLAP, Dimension, Facts, Data Modeling- Expertise implementing Python design patterns (Creational, Structural and Behavioral Patterns)- Expertise in Python building data application including reading, transforming; writing data sets- Strong experience in using boto3, pandas, numpy, pyarrow, Requests, Fast API, Asyncio, Aiohttp, PyTest, OAuth 2.0, multithreading, multiprocessing, snowflake python connector; Snowpark- Experience in Python building data APIs (Web/REST APIs)- Experience with Snowflake including SQL, Pipes, Stream, Tasks, Time Travel, Data Sharing, Query Optimization- Experience with Scripting language in Snowflake including SQL Stored Procs, Java Script Stored Procedures; Python UDFs- Understanding of Snowflake Internals; experience in integration with Reporting; UI applications- Stron...
    skills and ability to lead detailed data analysis meetings/discussions.

    Ability to work collaboratively with multi-functional and cross-border teams.

    Good English communication written and spoken.

    Nice to have;

    Material master create experience in any of the following areas;

    SAP

    GGSM

    SAP Data Analyst, MN/Remote - Direct Client
    Cloud Data Engineering, Databricks Pyspark, Data Warehousing Design Experience of Delta Lake, DWH, Data Integration, Cloud, Design and Data Modelling. Proficient in developing programs in Python and SQLExperience with Data warehouse Dimensional data modeling. Working with event based/streaming technologies to ingest and process data. Working with structured, semi structured and unstructured data. Optimize Databricks jobs for performance and scalability to handle big data workloads. Monitor and troubleshoot Databricks jobs, identify and resolve issues or bottlenecks. Implement best practices for data management, security, and governance within the Databricks environment. Experience designing and developing Enterprise Data Warehouse solutions. Proficient writing SQL queries and programming including stored procedures and reverse engineering existing process. Perform code reviews to ensure fit to requirements, optimal execution patterns and adherence to established standards.

    Requirements:

    You are:

    Minimum 9+ years of experience is required. 5+ years...
    QualificationsExpert knowledge of using and configuring GCP (Vertex), AWS, Azure Python: 5+ years of experienceMachine Learning libraries: Pytorch, JaxDevelopment tools: Bash, GitData Science frameworks: DatabricksAgile Software developmentCloud Management: Slurm, KubernetesData Logging: Weights and BiasesOrchestration, Autoscaling: Ray, ClearnML, WandB etc.
    Optional QualificationsExperience training LLMs and VLMsML for Robotics, Computer Vision etc.Developing Browser Apps/Dashboards, both frontend and backend Javascript, React, etc. Emancro is committed to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

ai-job-embedding-finetuning

  • Dataset: ai-job-embedding-finetuning at b18b3c2
  • Size: 101 evaluation samples
  • Columns: query, job_description_pos, and job_description_neg
  • Approximate statistics based on the first 101 samples:
    query job_description_pos job_description_neg
    type string string string
    details
    • min: 10 tokens
    • mean: 15.78 tokens
    • max: 51 tokens
    • min: 9 tokens
    • mean: 220.13 tokens
    • max: 256 tokens
    • min: 21 tokens
    • mean: 213.07 tokens
    • max: 256 tokens
  • Samples:
    query job_description_pos job_description_neg
    Big Data Engineer, Spark, Hadoop, AWS/GCP Skills • Expertise and hands-on experience on Spark, and Hadoop echo system components – Must Have • Good and hand-on experience* of any of the Cloud (AWS/GCP) – Must Have • Good knowledge of HiveQL & SparkQL – Must Have Good knowledge of Shell script & Java/Scala/python – Good to Have • Good knowledge of SQL – Good to Have • Good knowledge of migration projects on Hadoop – Good to Have • Good Knowledge of one of the Workflow engines like Oozie, Autosys – Good to Have Good knowledge of Agile Development– Good to Have • Passionate about exploring new technologies – Good to Have • Automation approach – Good to Have
    Thanks & RegardsShahrukh KhanEmail: [email protected]
    experience:

    GS-14:

    Supervisory/Managerial Organization Leadership

    Supervises an assigned branch and its employees. The work directed involves high profile data science projects, programs, and/or initiatives within other federal agencies.Provides expert advice in the highly technical and specialized area of data science and is a key advisor to management on assigned/delegated matters related to the application of mathematics, statistical analysis, modeling/simulation, machine learning, natural language processing, and computer science from a data science perspective.Manages workforce operations, including recruitment, supervision, scheduling, development, and performance evaluations.Keeps up to date with data science developments in the private sector; seeks out best practices; and identifies and seizes opportunities for improvements in assigned data science program and project operations.


    Senior Expert in Data Science

    Recognized authority for scientific data analysis using advanc...
    Time series analysis, production operations, condition-based monitoring Experience in Production Operations or Well Engineering Strong scripting/programming skills (Python preferable)

    Desired:

    Strong time series surveillance background (eg. OSI PI, PI AF, Seeq) Strong scripting/programming skills (Python preferable) Strong communication and collaboration skills Working knowledge of machine learning application (eg. scikit-learn) Working knowledge of SQL and process historians Delivers positive results through realistic planning to accomplish goals Must be able to handle multiple concurrent tasks with an ability to prioritize and manage tasks effectively



    Apex Systems is

    Apex Systems is a world-class IT services company that serves thousands of clients across the globe. When you join Apex, you become part of a team that values innovation, collaboration, and continuous learning. We offer quality career resources, training, certifications, development opportunities, and a comprehensive benefits package. Our commitment to excellence is reflected in man...
    Qualifications:· 3-5 years of experience as a hands-on analyst in an enterprise setting, leveraging Salesforce, Marketo, Dynamics, and similar tools.· Excellent written and verbal communication skills.· Experience with data enrichment processes and best practices.· Strong understanding of B2B sales & marketing for large, complex organizations.· Expertise in querying, manipulating, and analyzing data using SQL and/or similar languages.· Advanced Excel skills and experience with data platforms like Hadoop and Databricks.· Proven proficiency with a data visualization tool like Tableau or Power BI.· Strong attention to detail with data quality control and integration expertise.· Results-oriented, self-directed individual with multi-tasking, problem-solving, and independent learning abilities.· Understanding of CRM systems like Salesforce and Microsoft Dynamics.· Solid grasp of marketing practices, principles, KPIs, and data types.· Familiarity with logical data architecture and cloud data ...
    Senior Data Analyst jobs with expertise in Power BI, NextGen EHR, and enterprise ETL. requirements.Reporting and Dashboard Development: Design, develop, and maintain reports for the HRSA HCCN Grant and other assignments. Create and maintain complex dashboards using Microsoft Power BI.Infrastructure Oversight: Monitor and enhance the data warehouse, ensuring efficient data pipelines and timely completion of tasks.Process Improvements: Identify and implement internal process improvements, including automating manual processes and optimizing data delivery.Troubleshooting and Maintenance: Address data inconsistencies using knowledge of various database structures and workflow best practices, including NextGen EHR system.Collaboration and Mentorship: Collaborate with grant PHCs and analytic teams, mentor less senior analysts, and act as a project lead for specific deliverables.
    Experience:Highly proficient in SQL and experienced with reporting packages.Enterprise ETL experience is a major plus!data visualization tools (e.g., Tableau, Power BI, Qualtrics).Azure, Azure Data Fa...
    Qualifications

    3 to 5 years of experience in exploratory data analysisStatistics Programming, data modeling, simulation, and mathematics Hands on working experience with Python, SQL, R, Hadoop, SAS, SPSS, Scala, AWSModel lifecycle executionTechnical writingData storytelling and technical presentation skillsResearch SkillsInterpersonal SkillsModel DevelopmentCommunicationCritical ThinkingCollaborate and Build RelationshipsInitiative with sound judgementTechnical (Big Data Analysis, Coding, Project Management, Technical Writing, etc.)Problem Solving (Responds as problems and issues are identified)Bachelor's Degree in Data Science, Statistics, Mathematics, Computers Science, Engineering, or degrees in similar quantitative fields


    Desired Qualification(s)

    Master's Degree in Data Science, Statistics, Mathematics, Computer Science, or Engineering


    Hours: Monday - Friday, 8:00AM - 4:30PM

    Locations: 820 Follin Lane, Vienna, VA 22180
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step ai-job-validation_cosine_accuracy ai-job-test_cosine_accuracy
0 0 0.9307 -
1.0 51 0.9703 0.9804

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
0
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Mubin/allmini-ai-embedding-similarity

Finetuned
(205)
this model

Dataset used to train Mubin/allmini-ai-embedding-similarity

Evaluation results