chilly-magician commited on
Commit
78a1cf9
1 Parent(s): e3a68be

[up]: training section

Browse files
Files changed (3) hide show
  1. README.md +141 -17
  2. calculate_metrics.py +0 -0
  3. test_query_parser.py +0 -0
README.md CHANGED
@@ -4,7 +4,9 @@ base_model: tiiuae/falcon-7b-instruct
4
  license: apache-2.0
5
  language:
6
  - en
7
- pipeline_tag: text-generation
 
 
8
  tags:
9
  - search-queries
10
  - instruct-fine-tuned
@@ -352,32 +354,94 @@ print(output)
352
 
353
  ### Training Data
354
 
355
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
 
 
 
 
356
 
357
- [More Information Needed]
358
 
359
- ### Training Procedure
360
 
361
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
362
 
363
- #### Preprocessing [optional]
 
 
 
364
 
365
- [More Information Needed]
366
 
 
367
 
368
- #### Training Hyperparameters
369
 
370
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
371
 
372
- #### Speeds, Sizes, Times [optional]
373
 
374
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
375
 
376
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
377
 
378
- ## Evaluation
 
 
379
 
380
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
381
 
382
  ### Testing Data, Factors & Metrics
383
 
@@ -395,9 +459,69 @@ print(output)
395
 
396
  #### Metrics
397
 
398
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
399
-
400
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
401
 
402
  ### Results
403
 
 
4
  license: apache-2.0
5
  language:
6
  - en
7
+ pipeline_tag: text-generation
8
+ datasets:
9
+ - EmbeddingStudio/query-parsing-instructions-falcon
10
  tags:
11
  - search-queries
12
  - instruct-fine-tuned
 
354
 
355
  ### Training Data
356
 
357
+ We used synthetically generated query parsing instructions:
358
+ * We generated lists of possible filters for 63 customer categories:
359
+ * [Raw version of filters dataset](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-filters-raw)
360
+ * [Split by representations](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-filters)
361
+ * Select randomly up-to 150 possible combinations (1-3 filters in each combination) of filters, the way each filter's representation appears maximum twice.
362
+ * For a given category and combination we [generated](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-queries) with GPT-4 Turbo:
363
+ * 2 search queries and theirs parsed version with unstructured parts.
364
+ * 2 search queries and theirs parsed version without unstructured part.
365
+ * Using filters, queries and parsed version we prepared [72.5k falcon format instruction](EmbeddingStudio/query-parsing-instructions-falcon)
366
 
367
+ **Warning:** EmbeddingStudio team aware you that generated queries **weren't enough curated**, and will be curated later once we finish our product market fit stage.
368
 
369
+ #### Principles of train / test splitting
370
 
371
+ As we are trying to fine-tune LLM to follow zero-shot query parsing instructions, so we want to test:
372
+ * Ability to work well with unseen domain
373
+ * Ability to work well with unseen filters
374
+ * Ability to work well with unseen queries
375
 
376
+ For these purposes we:
377
+ 1. We put into test split 5 categories, completely separared from train: `Telecommunication Companies, Legal Services, Enterprise Software Development, Artificial Intelligence and Machine Learning, Documentation and Knowledge Sharing`.
378
+ 2. Also out of each appearing in train company categories, we put aside / removed one filter and queries related to it.
379
+ 3. Selected 5% of other queries and put it into test.
380
 
381
+ #### Filters generation details
382
 
383
+ We used GPT-4 Turbo to generate several possible filters for 63 company categroies. For each filter we also generated some possible representations. For examples filter `Date` can be represented as `dd/mm/YYYY`, `YYYY-mm-dd`, as words `2024 Jan 17`, etc.
384
 
385
+ #### Queries generation details
386
 
387
+ We also used GPT-4 Turbo for generation of search queries and theirs parsed version. Main principles were:
388
+ * If passed schema doesn't contain possible filter, do not generate query itself or a possible filter
389
+ * If a selected representations combination contains enumeration, so we ask to map values in a search query and a parsed version.
390
+ * If a selected representations combination contains pattern, so we ask GPT-4 Turbo to be aligned with a pattern
391
 
392
+ #### Instructions generation details
393
 
394
+ For the generation instructions we used following ideas:
395
+ 1. Zero-Shot query parser should be schema agnostic. Cases like `snake_case, CamelCase, http-headers-like` should not ruin generation process.
396
+ 2. Zero-Shot query parser should be spelling errors insensitive.
397
+ 3. Training instructions should be in the following order:
398
+ * Category
399
+ * Schema
400
+ * Query
401
+
402
+ So LLM can be used in the following way: just generate embedding of category -> schema part, so inference will be faster.
403
 
404
+ We assume, that `schema agnostic` termin means something wider, like to be able to work not only with JSONs, but also with HTML, Markdown, YAML, etc. We are working on it.
405
+
406
+ So, what was our approach as an attempt to achieve these abilities:
407
+ 1. For each query we generated a version with a mistake
408
+ 2. Passed to each parsed version an additional field `Correct`, which contains a corrected version of a search query.
409
+ 3. For each query we randomly selected and used a case for schema fields and a case for filter and representation names.
410
+ 4. For each query we additionally generated two instuctions:
411
+ * Where did we remove from a provided schema and parsed version one filter
412
+ * Where did we remove from a provided schema and parsed version all related filters
413
+
414
+ **Warning:** EmbeddingStudio team ask you to curate datasets on your own precisely.
415
+
416
+ ### Training Procedure
417
 
418
+ 1. Mixed Precision Regime
419
+ 2. Supervised Fine-Tuning
420
+ 3. Three epochs with cosine scheduler
421
 
422
+ All details in Training Hyperparameters
423
+
424
+ #### Preprocessing [optional]
425
+
426
+ The preprocessing steps are not detailed in the provided code. Typically, preprocessing involves tokenization, normalization, data augmentation, and handling of special tokens. In this training setup, the tokenizer was configured with `add_prefix_space=True` and `use_fast=False`, which might indicate special considerations for tokenizing certain languages or text formats.
427
+
428
+ #### Training Hyperparameters
429
+ | Hyperparameter | Value | Description |
430
+ |--------------------------------------|------------------------------|-------------------------------------------------------|
431
+ | **Training Regime** | Mixed Precision (bfloat16) | Utilizes bfloat16 for efficient memory usage and training speed. |
432
+ | **Model Configuration** | Causal Language Model | Incorporates LoRA (Low-Rank Adaptation) for training efficiency. |
433
+ | **Quantization Configuration** | Bits and Bytes (BnB) | Uses settings like `load_in_4bit` and `bnb_4bit_quant_type` for model quantization. |
434
+ | **Training Environment** | CUDA-enabled Device | Indicates GPU acceleration for training. |
435
+ | **Learning Rate** | 2e-4 | Determines the step size at each iteration while moving toward a minimum of a loss function. |
436
+ | **Weight Decay** | 0.001 | Helps in regularizing and preventing overfitting. |
437
+ | **Warmup Ratio** | 0.03 | Fraction of total training steps used for the learning rate warmup. |
438
+ | **Optimizer** | Paged AdamW (32-bit) | Optimizes the training process with efficient memory usage. |
439
+ | **Gradient Accumulation Steps** | 2 | Reduces memory consumption and allows for larger effective batch sizes. |
440
+ | **Max Grad Norm** | 0.3 | Maximum norm for the gradients. |
441
+ | **LR Scheduler Type** | Cosine | Specifies the learning rate schedule. |
442
+ | **PEFT Configurations** | LoraConfig | Details like `lora_alpha`, `lora_dropout`, and `r` for LoRA adaptations. |
443
+ | **Training Dataset Segmentation** | Train and Test Sets | Segmentation of the dataset for training and evaluation. |
444
+ | **Max Sequence Length** | 1024 | Maximum length of the input sequences. |
445
 
446
  ### Testing Data, Factors & Metrics
447
 
 
459
 
460
  #### Metrics
461
 
462
+ ##### Total metrics
463
+
464
+ | Category | Recall | Precision | F1 | Accuracy |
465
+ | ------------------------------------------------ | ------ | --------- | ----- | -------- |
466
+ | Telecommunication Companies [+] | 0.70 | 0.67 | 0.68 | 0.52 |
467
+ | Legal Services [+] | 0.80 | 0.74 | 0.77 | 0.63 |
468
+ | Enterprise Software Development [+] | 0.78 | 0.71 | 0.74 | 0.59 |
469
+ | Artificial Intelligence and Machine Learning [+] | 0.77 | 0.78 | 0.78 | 0.63 |
470
+ | Documentation and Knowledge Sharing [+] | 0.68 | 0.65 | 0.66 | 0.50 |
471
+ | Educational Institutions | 0.55 | 0.51 | 0.53 | 0.36 |
472
+ | Job Recruitment Agencies | 0.58 | 0.51 | 0.54 | 0.37 |
473
+ | Banking Services | 0.73 | 0.81 | 0.76 | 0.62 |
474
+ | Investment Services | 0.50 | 0.50 | 0.50 | 0.33 |
475
+ | Insurance Services | 0.77 | 0.77 | 0.77 | 0.62 |
476
+ | Financial Planning and Advisory | 0.65 | 0.67 | 0.66 | 0.49 |
477
+ | Credit Services | 0.60 | 0.65 | 0.63 | 0.45 |
478
+ | Payment Processing | 0.79 | 0.74 | 0.76 | 0.62 |
479
+ | Mortgage and Real Estate Services | 1.00 | 1.00 | 1.00 | 1.00 |
480
+ | Taxation Services | 0.52 | 0.57 | 0.54 | 0.37 |
481
+ | Risk Management and Compliance | 1.00 | 0.95 | 0.98 | 0.95 |
482
+ | Digital and Mobile Banking | 0.72 | 0.71 | 0.71 | 0.55 |
483
+ | Retail Stores (Online and Offline) | 0.96 | 0.87 | 0.92 | 0.85 |
484
+ | Automotive Dealerships | 0.52 | 0.53 | 0.53 | 0.36 |
485
+ | Restaurants and Food Delivery Services | 0.76 | 0.77 | 0.76 | 0.62 |
486
+ | Entertainment and Media Platforms | 0.80 | 0.84 | 0.82 | 0.70 |
487
+ | Government Services | 0.58 | 0.65 | 0.61 | 0.44 |
488
+ | Travelers and Consumers | 0.89 | 0.89 | 0.89 | 0.80 |
489
+ | Logistics and Supply Chain Management | 0.56 | 0.59 | 0.58 | 0.41 |
490
+ | Customer Support Services | 0.60 | 0.54 | 0.57 | 0.40 |
491
+ | Market Research Firms | 0.52 | 0.49 | 0.51 | 0.34 |
492
+ | Mobile App Development | 0.81 | 0.79 | 0.80 | 0.67 |
493
+ | Game Development | 0.94 | 0.94 | 0.94 | 0.88 |
494
+ | Cloud Computing Services | 0.64 | 0.62 | 0.63 | 0.46 |
495
+ | Data Analytics and Business Intelligence | 0.63 | 0.61 | 0.62 | 0.45 |
496
+ | Cybersecurity Software | 0.54 | 0.59 | 0.57 | 0.39 |
497
+ | User Interface/User Experience Design | 0.63 | 0.64 | 0.63 | 0.46 |
498
+ | Internet of Things (IoT) Development | 0.89 | 0.71 | 0.79 | 0.65 |
499
+ | Project Management Tools | 0.80 | 0.83 | 0.81 | 0.69 |
500
+ | Version Control Systems | 0.77 | 0.73 | 0.75 | 0.60 |
501
+ | Continuous Integration/Continuous Deployment | 0.85 | 0.83 | 0.84 | 0.72 |
502
+ | Issue Tracking and Bug Reporting | 0.64 | 0.62 | 0.63 | 0.46 |
503
+ | Collaborative Development Environments | 0.68 | 0.67 | 0.68 | 0.51 |
504
+ | Team Communication and Chat Tools | 0.94 | 0.91 | 0.93 | 0.87 |
505
+ | Task and Time Management | 0.78 | 0.78 | 0.78 | 0.64 |
506
+ | Customer Support and Feedback | 0.88 | 0.82 | 0.85 | 0.74 |
507
+ | Cloud-based Development Environments | 0.81 | 0.81 | 0.81 | 0.68 |
508
+ | Image Stock Platforms | 0.88 | 0.85 | 0.87 | 0.76 |
509
+ | Video Hosting and Portals | 0.86 | 0.88 | 0.87 | 0.77 |
510
+ | Social Networks | 0.60 | 0.57 | 0.59 | 0.41 |
511
+ | Professional Social Networks | 0.68 | 0.69 | 0.68 | 0.52 |
512
+ | Dating Apps | 0.90 | 0.90 | 0.90 | 0.82 |
513
+ | Aggregate | 0.73 | 0.72 | 0.73 | 0.59 |
514
+
515
+ ##### Unseen domains metrics
516
+
517
+ | Category | Recall | Precision | F1 | Accuracy |
518
+ | ------------------------------------------------ | ------ | --------- | ----- | -------- |
519
+ | Telecommunication Companies [+] | 0.70 | 0.67 | 0.68 | 0.52 |
520
+ | Legal Services [+] | 0.80 | 0.74 | 0.77 | 0.63 |
521
+ | Enterprise Software Development [+] | 0.78 | 0.71 | 0.74 | 0.59 |
522
+ | Artificial Intelligence and Machine Learning [+] | 0.77 | 0.78 | 0.78 | 0.63 |
523
+ | Documentation and Knowledge Sharing [+] | 0.68 | 0.65 | 0.66 | 0.50 |
524
+ | Aggregate | 0.75 | 0.71 | 0.73 | 0.57 |
525
 
526
  ### Results
527
 
calculate_metrics.py ADDED
File without changes
test_query_parser.py ADDED
File without changes