Unable to set max number of tokens in input higher than 1024
Hey!
I am fine-tuning this model with my own data - but if I set the max number of tokens to be higher than 1024 I get this error
'IndexError: index out of range in self'
which indeed I would not expect - since I chose this model as to handle longer sequence inputs.
Does anyone know why that could be the case?
Can you paste your code here.
I get the exact same error 'IndexError: index out of range in self' when I set the max number of tokens to be higher than 1024. Is there a solution to this problem?
Paste code here.
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding
data_files = {"train":"/content/drive/MyDrive/THESIS/train_baseline_documents.csv", "test":"/content/drive/MyDrive/THESIS/test_baseline_documents.csv", "validation":"/content/drive/MyDrive/THESIS/val_baseline_documents.csv"}
splitted_dataset = load_dataset("csv", data_files=data_files)
Load the tokenizer of the model BioBERT and tokenize the dataset
checkpoint = "allenai/led-base-16384"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["Document"], truncation=True, padding=True)
tokenized_dataset = splitted_dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("Credibility", "labels")
print(tokenized_dataset)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Trainning arguments
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="/home/theotsio", per_device_train_batch_size=6, seed=42)
Loading Modelfrom
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
Train the pretrained model on the specific task
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
tokenizer=tokenizer
)
from codecarbon import EmissionsTracker
tracker = EmissionsTracker()
tracker.start()
Run the train
trainer.train()
tracker.stop()
Evaluation of validation
predictions_val = trainer.predict(tokenized_dataset["validation"])
predictions_test = trainer.predict(tokenized_dataset["test"])
import numpy as np
Evaluation of validation
preds_val = np.argmax(predictions_val.predictions, axis=-1)
preds_test = np.argmax(predictions_test.predictions, axis=-1)
import datasets
metric = datasets.load_metric("accuracy")
print("The validation accuracy", metric.compute(predictions=preds_val, references=predictions_val.label_ids))
print("The test set accuracy", metric.compute(predictions=preds_test, references=predictions_test.label_ids))
This is my code. Thank you for your time