|
--- |
|
datasets: |
|
- BEE-spoke-data/bees-internal |
|
language: |
|
- en |
|
license: apache-2.0 |
|
--- |
|
|
|
# BeeTokenizer |
|
|
|
> note: this is **literally** a tokenizer trained on beekeeping text |
|
|
|
After minutes of hard work, it is now available. |
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer") |
|
|
|
test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination." |
|
|
|
output = tokenizer(test_string) |
|
print(f"Test string: {test_string}") |
|
print(f"Tokens ({len(output.input_ids)}):\n\t{output.input_ids}") |
|
``` |
|
|
|
|
|
## Notes |
|
|
|
1. the default tokenizer (on branch `main`) has a vocab size of 32000 |
|
2. based on the `SentencePieceBPETokenizer` class |
|
|
|
<details> |
|
<summary>How to Tokenize Text and Retrieve Offsets</summary> |
|
|
|
To tokenize a complex sentence and also retrieve the offsets mapping, you can use the following Python code snippet: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
# Initialize the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer") |
|
|
|
# Sample complex sentence related to beekeeping |
|
test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination." |
|
|
|
# Tokenize the input string and get the offsets mapping |
|
output = tokenizer.encode_plus(test_string, return_offsets_mapping=True) |
|
|
|
print(f"Test string: {test_string}") |
|
|
|
# Tokens |
|
tokens = tokenizer.convert_ids_to_tokens(output['input_ids']) |
|
print(f"Tokens: {tokens}") |
|
|
|
# Offsets |
|
offsets = output['offset_mapping'] |
|
print(f"Offsets: {offsets}") |
|
``` |
|
|
|
This should result in the following (_Feb '24 version_): |
|
|
|
``` |
|
>>> print(f"Test string: {test_string}") |
|
Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination. |
|
>>> |
|
>>> # Tokens |
|
>>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids']) |
|
>>> print(f"Tokens: {tokens}") |
|
Tokens: ['When', '▁dealing', '▁with', '▁Varroa', '▁destructor', '▁mites,', "▁it's", '▁cru', 'cial', '▁to', '▁administer', '▁the', '▁right', '▁acar', 'icides', '▁during', '▁the', '▁late', '▁autumn', '▁months,', '▁but', '▁only', '▁after', '▁ensuring', '▁that', '▁the', '▁worker', '▁bee', '▁population', '▁is', '▁free', '▁from', '▁pesticide', '▁contam', 'ination.'] |
|
>>> |
|
>>> # Offsets |
|
>>> offsets = output['offset_mapping'] |
|
>>> print(f"Offsets: {offsets}") |
|
Offsets: [(0, 4), (4, 12), (12, 17), (17, 24), (24, 35), (35, 42), (42, 47), (47, 51), (51, 55), (55, 58), (58, 69), (69, 73), (73, 79), (79, 84), (84, 90), (90, 97), (97, 101), (101, 106), (106, 113), (113, 121), (121, 125), (125, 130), (130, 136), (136, 145), (145, 150), (150, 154), (154, 161), (161, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 199), (199, 206), (206, 214)] |
|
``` |
|
|
|
if you compare this to the output of [the llama tokenizer](https://huggingface.co./fxmarty/tiny-llama-fast-tokenizer) (below), you can quickly see which is more suited for beekeeping related language modeling. |
|
|
|
``` |
|
>>> print(f"Test string: {test_string}") |
|
Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination. |
|
>>> # Tokens |
|
>>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids']) |
|
>>> print(f"Tokens: {toke>>> print(f"Tokens: {tokens}") |
|
Tokens: ['<s>', '▁When', '▁dealing', '▁with', '▁Var', 'ro', 'a', '▁destruct', 'or', '▁mit', 'es', ',', '▁it', "'", 's', '▁cru', 'cial', '▁to', '▁admin', 'ister', '▁the', '▁right', '▁ac', 'ar', 'ic', 'ides', '▁during', '▁the', '▁late', '▁aut', 'umn', '▁months', ',', '▁but', '▁only', '▁after', '▁ens', 'uring', '▁that', '▁the', '▁worker', '▁be', 'e', '▁population', '▁is', '▁free', '▁from', '▁p', 'estic', 'ide', '▁cont', 'am', 'ination', '.'] |
|
>>> offsets = output['offset_mapping'] |
|
>>> print(f"Offsets: {offsets}") |
|
Offsets: [(0, 0), (0, 4), (4, 12), (12, 17), (17, 21), (21, 23), (23, 24), (24, 33), (33, 35), (35, 39), (39, 41), (41, 42), (42, 45), (45, 46), (46, 47), (47, 51), (51, 55), (55, 58), (58, 64), (64, 69), (69, 73), (73, 79), (79, 82), (82, 84), (84, 86), (86, 90), (90, 97), (97, 101), (101, 106), (106, 110), (110, 113), (113, 120), (120, 121), (121, 125), (125, 130), (130, 136), (136, 140), (140, 145), (145, 150), (150, 154), (154, 161), (161, 164), (164, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 191), (191, 196), (196, 199), (199, 204), (204, 206), (206, 213), (213, 214)] |
|
``` |
|
|