--- language: - en license: apache-2.0 datasets: - BEE-spoke-data/bees-internal --- # BeeTokenizer > note: this is **literally** a tokenizer trained on beekeeping text After minutes of hard work, it is now available. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer") test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination." output = tokenizer(test_string) print(f"Test string: {test_string}") print(f"Tokens:\n\t{output.input_ids}") ``` ## Notes - the default tokenizer (on branch `main`) has a vocab size of 32128
How to Tokenize Text and Retrieve Offsets To tokenize a complex sentence and also retrieve the offsets mapping, you can use the following Python code snippet: ```python from transformers import AutoTokenizer # Initialize the tokenizer tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer") # Sample complex sentence related to beekeeping test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination." # Tokenize the input string and get the offsets mapping output = tokenizer.encode_plus(test_string, return_offsets_mapping=True) print(f"Test string: {test_string}") # Tokens tokens = tokenizer.convert_ids_to_tokens(output['input_ids']) print(f"Tokens: {tokens}") # Offsets offsets = output['offset_mapping'] print(f"Offsets: {offsets}") ``` This should result in the following (_Nov 2023 version_):
>>> print(f"Test string: {test_string}")
  Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination.
  >>> 
  >>> # Tokens
  >>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
  >>> print(f"Tokens: {tokens}")
  Tokens: ['▁When', '▁dealing', '▁with', '▁Varroa', '▁destructor', '▁mites,', "▁it's", '▁cru', 'cial', '▁to', '▁administer', '▁the', '▁right', '▁acar', 'icides', '▁during', '▁the', '▁late', '▁autumn', '▁months,', '▁but', '▁only', '▁after', '▁ensuring', '▁that', '▁the', '▁worker', '▁bee', '▁population', '▁is', '▁free', '▁from', '▁pesticide', '▁contamination', '.']
  >>> 
  >>> # Offsets
  >>> offsets = output['offset_mapping']
  >>> print(f"Offsets: {offsets}")
  Offsets: [(0, 4), (4, 12), (12, 17), (17, 24), (24, 35), (35, 42), (42, 47), (47, 51), (51, 55), (55, 58), (58, 69), (69, 73), (73, 79), (79, 84), (84, 90), (90, 97), (97, 101), (101, 106), (106, 113), (113, 121), (121, 125), (125, 130), (130, 136), (136, 145), (145, 150), (150, 154), (154, 161), (161, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 199), (199, 213), (213, 214)]