Update README.md
Browse files
README.md
CHANGED
@@ -131,10 +131,10 @@ model-index:
|
|
131 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
132 |
</a>
|
133 |
|
134 |
-
Summarize long text and get a SparkNotes-
|
135 |
|
136 |
-
-
|
137 |
-
-
|
138 |
|
139 |
A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
|
140 |
|
@@ -144,7 +144,7 @@ A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/
|
|
144 |
|
145 |
> In this chapter, the monster explains how he intends to exact revenge on "the little b\*\*\*\*" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
|
146 |
|
147 |
-
While a
|
148 |
|
149 |
* * *
|
150 |
|
@@ -152,21 +152,24 @@ While a somewhat crude example, try running this copypasta through other summari
|
|
152 |
|
153 |
<!-- TOC -->
|
154 |
|
155 |
-
-
|
156 |
-
-
|
157 |
-
-
|
158 |
-
-
|
159 |
-
|
160 |
-
|
161 |
-
-
|
162 |
-
-
|
163 |
-
-
|
164 |
-
|
165 |
-
-
|
166 |
-
-
|
167 |
-
-
|
168 |
-
-
|
169 |
-
|
|
|
|
|
|
|
170 |
|
171 |
<!-- /TOC -->
|
172 |
|
@@ -201,7 +204,7 @@ print(result[0]["summary_text"])
|
|
201 |
|
202 |
### Beyond the basics
|
203 |
|
204 |
-
There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for
|
205 |
|
206 |
#### Adjusting parameters
|
207 |
|
@@ -209,11 +212,11 @@ Pass [other parameters related to beam search textgen](https://huggingface.co/bl
|
|
209 |
|
210 |
#### LLM.int8 Quantization
|
211 |
|
212 |
-
>
|
213 |
|
214 |
-
Per [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 is now supported for `long-t5` models. Per **initial
|
215 |
|
216 |
-
How-to:
|
217 |
|
218 |
install the latest `main` branch:
|
219 |
|
@@ -238,11 +241,11 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
|
|
238 |
)
|
239 |
```
|
240 |
|
241 |
-
The above is already present in the Colab demo linked at the top of the model
|
242 |
|
243 |
-
Do you
|
244 |
|
245 |
-
\* More rigorous
|
246 |
|
247 |
* * *
|
248 |
|
@@ -250,25 +253,25 @@ Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 wo
|
|
250 |
|
251 |
### Intended uses & limitations
|
252 |
|
253 |
-
While this model seems to improve
|
254 |
|
255 |
-
Specifically: negation statements (i.e., model says:
|
256 |
|
257 |
-
-
|
258 |
|
259 |
### Training and evaluation data
|
260 |
|
261 |
`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
|
262 |
|
263 |
-
-
|
264 |
-
-
|
265 |
-
-
|
266 |
|
267 |
### Eval results
|
268 |
|
269 |
Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
|
270 |
|
271 |
-
**Please read the note above as due to training methods, validation set
|
272 |
|
273 |
- eval_loss: 1.2756
|
274 |
- eval_rouge1: 41.8013
|
@@ -307,15 +310,17 @@ lol
|
|
307 |
|
308 |
See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)
|
309 |
|
310 |
-
You can also use the same code to split a document into batches of 4096, etc., and
|
|
|
|
|
311 |
|
312 |
### How to fine-tune further?
|
313 |
|
314 |
See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
|
315 |
|
316 |
-
###
|
317 |
|
318 |
-
I
|
319 |
|
320 |
```sh
|
321 |
pip install textsum
|
@@ -330,18 +335,14 @@ summarizer = Summarizer(
|
|
330 |
model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
|
331 |
)
|
332 |
|
333 |
-
|
334 |
-
out_str = summarizer.summarize_string(
|
335 |
-
"This is a long string of text that will be summarized."
|
336 |
-
)
|
337 |
print(f"summary: {out_str}")
|
338 |
-
|
339 |
```
|
340 |
|
341 |
-
This package provides easy-to-use interfaces for
|
342 |
-
|
343 |
-
For details, explanations, and docs, see the README (_linked above_) or the [wiki](https://github.com/pszemraj/textsum/wiki).
|
344 |
|
|
|
345 |
|
346 |
* * *
|
347 |
|
@@ -349,7 +350,7 @@ For details, explanations, and docs, see the README (_linked above_) or the [wik
|
|
349 |
|
350 |
### Updates
|
351 |
|
352 |
-
Updates to this model/model card will be posted here
|
353 |
|
354 |
### Training hyperparameters
|
355 |
|
|
|
131 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
132 |
</a>
|
133 |
|
134 |
+
Summarize long text and get a SparkNotes-like summary of any topic!
|
135 |
|
136 |
+
- Generalizes reasonably well to academic & narrative text.
|
137 |
+
- This is the XL checkpoint, which **produces even better summaries [from a human evaluation perspective](https://long-t5-xl-book-summary-examples.netlify.app/)**.
|
138 |
|
139 |
A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
|
140 |
|
|
|
144 |
|
145 |
> In this chapter, the monster explains how he intends to exact revenge on "the little b\*\*\*\*" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
|
146 |
|
147 |
+
While this is a crude example, try running this copypasta through other summarization models to see the difference in comprehension (_even though it's not even a "long" text!_).
|
148 |
|
149 |
* * *
|
150 |
|
|
|
152 |
|
153 |
<!-- TOC -->
|
154 |
|
155 |
+
- [Description](#description)
|
156 |
+
- [How-To in Python](#how-to-in-python)
|
157 |
+
- [Beyond the basics](#beyond-the-basics)
|
158 |
+
- [Adjusting parameters](#adjusting-parameters)
|
159 |
+
- [LLM.int8 Quantization](#llmint8-quantization)
|
160 |
+
- [About](#about)
|
161 |
+
- [Intended uses & limitations](#intended-uses--limitations)
|
162 |
+
- [Training and evaluation data](#training-and-evaluation-data)
|
163 |
+
- [Eval results](#eval-results)
|
164 |
+
- [FAQ](#faq)
|
165 |
+
- [How can I run inference with this on CPU?](#how-can-i-run-inference-with-this-on-cpu)
|
166 |
+
- [How to run inference over a very long (30k+ tokens) document in batches?](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
|
167 |
+
- [How to fine-tune further?](#how-to-fine-tune-further)
|
168 |
+
- [Are there simpler ways to run this?](#are-there-simpler-ways-to-run-this)
|
169 |
+
- [Training procedure](#training-procedure)
|
170 |
+
- [Updates](#updates)
|
171 |
+
- [Training hyperparameters](#training-hyperparameters)
|
172 |
+
- [Framework versions](#framework-versions)
|
173 |
|
174 |
<!-- /TOC -->
|
175 |
|
|
|
204 |
|
205 |
### Beyond the basics
|
206 |
|
207 |
+
There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for reduced memory consumption.
|
208 |
|
209 |
#### Adjusting parameters
|
210 |
|
|
|
212 |
|
213 |
#### LLM.int8 Quantization
|
214 |
|
215 |
+
> alternative section title: how to get this monster to run inference on free colab runtimes
|
216 |
|
217 |
+
Per [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 is now supported for `long-t5` models. Per **initial tests** the summarization quality seems to hold while using _significantly_ less memory! \*
|
218 |
|
219 |
+
How-to: basically make sure you have pip-installed the **latest GitHub repo main** version of `transformers`, and also the `bitsandbytes` package.
|
220 |
|
221 |
install the latest `main` branch:
|
222 |
|
|
|
241 |
)
|
242 |
```
|
243 |
|
244 |
+
The above is already present in the Colab demo linked at the top of the model map.
|
245 |
|
246 |
+
Do you like to ask questions? Great. But first, check out the [how LLM.int8 works blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) by huggingface.
|
247 |
|
248 |
+
\* More rigorous metrics-based research comparing beam-search summarization with and without LLM.int8 will take place over time.
|
249 |
|
250 |
* * *
|
251 |
|
|
|
253 |
|
254 |
### Intended uses & limitations
|
255 |
|
256 |
+
While this model seems to improve factual consistency, **don't take summaries as foolproof and check things that seem odd**.
|
257 |
|
258 |
+
Specifically: negation statements (i.e., the model says: _this thing does not have [ATTRIBUTE]_, when instead it should have said _this thing has lots of [ATTRIBUTE]_).
|
259 |
|
260 |
+
- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually check this by comparing a particular statement with what the surrounding sentences imply.
|
261 |
|
262 |
### Training and evaluation data
|
263 |
|
264 |
`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
|
265 |
|
266 |
+
- For **initial fine-tuning**, only input text with 12288 input tokens or less and 1024 output tokens or less was used (_i.e. lines longer than that were dropped before training_) for memory reasons. After a quick analysis, summaries in the 12288-16384 range are in the **small** minority in this dataset.
|
267 |
+
- In addition, this initial training combined the training and validation sets and trained on them in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should (always) be the test set.**.
|
268 |
+
- The **final stages of fine-tuning** used the standard 16384 input/1024 output conventions, preserving the standard in/out lengths (_and truncating longer sequences_). This did not seem to change the loss/performance much.
|
269 |
|
270 |
### Eval results
|
271 |
|
272 |
Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
|
273 |
|
274 |
+
**Please read the note above, as due to the training methods, the performance on the validation set looks better than the results on the test set will be**. The model achieves the following results on the evaluation set:
|
275 |
|
276 |
- eval_loss: 1.2756
|
277 |
- eval_rouge1: 41.8013
|
|
|
310 |
|
311 |
See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)
|
312 |
|
313 |
+
You can also use the same code to split a document into batches of 4096, etc., and iterate over them with the model. This is useful in situations where CUDA memory is limited.
|
314 |
+
|
315 |
+
**Update:** see the section on the `textsum` package below.
|
316 |
|
317 |
### How to fine-tune further?
|
318 |
|
319 |
See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
|
320 |
|
321 |
+
### Are there simpler ways to run this?
|
322 |
|
323 |
+
For this reason, I created a Python package utility. It's called [textsum](https://github.com/pszemraj/textsum), and you can use it to load models and summarize things in a few lines of code.
|
324 |
|
325 |
```sh
|
326 |
pip install textsum
|
|
|
335 |
model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
|
336 |
)
|
337 |
|
338 |
+
long_string = "This is a long string of text that will be summarized."
|
339 |
+
out_str = summarizer.summarize_string(long_string)
|
|
|
|
|
340 |
print(f"summary: {out_str}")
|
|
|
341 |
```
|
342 |
|
343 |
+
This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.
|
|
|
|
|
344 |
|
345 |
+
For details, explanations, and documentation, see the README (_linked above_) or the [wiki](https://github.com/pszemraj/textsum/wiki).
|
346 |
|
347 |
* * *
|
348 |
|
|
|
350 |
|
351 |
### Updates
|
352 |
|
353 |
+
Updates to this model/model card will be posted here when relevant. The model seems to be fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated.
|
354 |
|
355 |
### Training hyperparameters
|
356 |
|