pszemraj commited on
Commit
ba3e462
·
1 Parent(s): 3845156

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -45
README.md CHANGED
@@ -131,10 +131,10 @@ model-index:
131
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
132
  </a>
133
 
134
- Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
135
 
136
- - Generalizes reasonably well to academic & narrative text.
137
- - This is the XL checkpoint, which **from a human-evaluation perspective, [produces even better summaries](https://long-t5-xl-book-summary-examples.netlify.app/)**.
138
 
139
  A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
140
 
@@ -144,7 +144,7 @@ A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/
144
 
145
  > In this chapter, the monster explains how he intends to exact revenge on "the little b\*\*\*\*" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
146
 
147
- While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (_despite it not even being a "long" text!_)
148
 
149
  * * *
150
 
@@ -152,21 +152,24 @@ While a somewhat crude example, try running this copypasta through other summari
152
 
153
  <!-- TOC -->
154
 
155
- - [Description](#description)
156
- - [How-To in Python](#how-to-in-python)
157
- - [Beyond the basics](#beyond-the-basics)
158
- - [About](#about)
159
- - [Intended uses & limitations](#intended-uses--limitations)
160
- - [Training and evaluation data](#training-and-evaluation-data)
161
- - [Eval results](#eval-results)
162
- - [FAQ](#faq)
163
- - [How can I run inference with this on CPU?](#how-can-i-run-inference-with-this-on-cpu)
164
- - [How to run inference over a very long (30k+ tokens) document in batches?](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
165
- - [How to fine-tune further?](#how-to-fine-tune-further)
166
- - [Training procedure](#training-procedure)
167
- - [Updates](#updates)
168
- - [Training hyperparameters](#training-hyperparameters)
169
- - [Framework versions](#framework-versions)
 
 
 
170
 
171
  <!-- /TOC -->
172
 
@@ -201,7 +204,7 @@ print(result[0]["summary_text"])
201
 
202
  ### Beyond the basics
203
 
204
- There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for decreased memory devouring.
205
 
206
  #### Adjusting parameters
207
 
@@ -209,11 +212,11 @@ Pass [other parameters related to beam search textgen](https://huggingface.co/bl
209
 
210
  #### LLM.int8 Quantization
211
 
212
- > alternate section title: how to get this monster to run inference on free Colab runtimes
213
 
214
- Per [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 is now supported for `long-t5` models. Per **initial testing** summarization quality appears to hold while requiring _significantly_ less memory! \*
215
 
216
- How-to: essentially ensure you have pip installed from the **latest GitHub repo main** version of `transformers`, and `bitsandbytes`
217
 
218
  install the latest `main` branch:
219
 
@@ -238,11 +241,11 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
238
  )
239
  ```
240
 
241
- The above is already present in the Colab demo linked at the top of the model card.
242
 
243
- Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 works blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) by huggingface.
244
 
245
- \* More rigorous metric-based investigation into comparing beam-search summarization with and without LLM.int8 will take place over time.
246
 
247
  * * *
248
 
@@ -250,25 +253,25 @@ Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 wo
250
 
251
  ### Intended uses & limitations
252
 
253
- While this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.
254
 
255
- Specifically: negation statements (i.e., model says: _This thing does not have [ATTRIBUTE]_ where instead it should have said _This thing has a lot of [ATTRIBUTE]_).
256
 
257
- - I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply.
258
 
259
  ### Training and evaluation data
260
 
261
  `kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
262
 
263
- - **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less (_i.e. rows with longer were dropped before training_) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority
264
- - In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.**
265
- - **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much.
266
 
267
  ### Eval results
268
 
269
  Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
270
 
271
- **Please read the note above as due to training methods, validation set performance looks better than the test set results will be**. The model achieves the following results on the evaluation set:
272
 
273
  - eval_loss: 1.2756
274
  - eval_rouge1: 41.8013
@@ -307,15 +310,17 @@ lol
307
 
308
  See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)
309
 
310
- You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.
 
 
311
 
312
  ### How to fine-tune further?
313
 
314
  See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
315
 
316
- ### Is there an easier way to use this?
317
 
318
- I have created a python package utility for this reason. It's called [textsum](https://github.com/pszemraj/textsum), and you can use it to load models and summarize things in a few lines of code.
319
 
320
  ```sh
321
  pip install textsum
@@ -330,18 +335,14 @@ summarizer = Summarizer(
330
  model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
331
  )
332
 
333
- # summarize a long string
334
- out_str = summarizer.summarize_string(
335
- "This is a long string of text that will be summarized."
336
- )
337
  print(f"summary: {out_str}")
338
-
339
  ```
340
 
341
- This package provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.
342
-
343
- For details, explanations, and docs, see the README (_linked above_) or the [wiki](https://github.com/pszemraj/textsum/wiki).
344
 
 
345
 
346
  * * *
347
 
@@ -349,7 +350,7 @@ For details, explanations, and docs, see the README (_linked above_) or the [wik
349
 
350
  ### Updates
351
 
352
- Updates to this model/model card will be posted here as relevant. The model seems fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated.
353
 
354
  ### Training hyperparameters
355
 
 
131
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
132
  </a>
133
 
134
+ Summarize long text and get a SparkNotes-like summary of any topic!
135
 
136
+ - Generalizes reasonably well to academic & narrative text.
137
+ - This is the XL checkpoint, which **produces even better summaries [from a human evaluation perspective](https://long-t5-xl-book-summary-examples.netlify.app/)**.
138
 
139
  A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
140
 
 
144
 
145
  > In this chapter, the monster explains how he intends to exact revenge on "the little b\*\*\*\*" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
146
 
147
+ While this is a crude example, try running this copypasta through other summarization models to see the difference in comprehension (_even though it's not even a "long" text!_).
148
 
149
  * * *
150
 
 
152
 
153
  <!-- TOC -->
154
 
155
+ - [Description](#description)
156
+ - [How-To in Python](#how-to-in-python)
157
+ - [Beyond the basics](#beyond-the-basics)
158
+ - [Adjusting parameters](#adjusting-parameters)
159
+ - [LLM.int8 Quantization](#llmint8-quantization)
160
+ - [About](#about)
161
+ - [Intended uses & limitations](#intended-uses--limitations)
162
+ - [Training and evaluation data](#training-and-evaluation-data)
163
+ - [Eval results](#eval-results)
164
+ - [FAQ](#faq)
165
+ - [How can I run inference with this on CPU?](#how-can-i-run-inference-with-this-on-cpu)
166
+ - [How to run inference over a very long (30k+ tokens) document in batches?](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
167
+ - [How to fine-tune further?](#how-to-fine-tune-further)
168
+ - [Are there simpler ways to run this?](#are-there-simpler-ways-to-run-this)
169
+ - [Training procedure](#training-procedure)
170
+ - [Updates](#updates)
171
+ - [Training hyperparameters](#training-hyperparameters)
172
+ - [Framework versions](#framework-versions)
173
 
174
  <!-- /TOC -->
175
 
 
204
 
205
  ### Beyond the basics
206
 
207
+ There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for reduced memory consumption.
208
 
209
  #### Adjusting parameters
210
 
 
212
 
213
  #### LLM.int8 Quantization
214
 
215
+ > alternative section title: how to get this monster to run inference on free colab runtimes
216
 
217
+ Per [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 is now supported for `long-t5` models. Per **initial tests** the summarization quality seems to hold while using _significantly_ less memory! \*
218
 
219
+ How-to: basically make sure you have pip-installed the **latest GitHub repo main** version of `transformers`, and also the `bitsandbytes` package.
220
 
221
  install the latest `main` branch:
222
 
 
241
  )
242
  ```
243
 
244
+ The above is already present in the Colab demo linked at the top of the model map.
245
 
246
+ Do you like to ask questions? Great. But first, check out the [how LLM.int8 works blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) by huggingface.
247
 
248
+ \* More rigorous metrics-based research comparing beam-search summarization with and without LLM.int8 will take place over time.
249
 
250
  * * *
251
 
 
253
 
254
  ### Intended uses & limitations
255
 
256
+ While this model seems to improve factual consistency, **don't take summaries as foolproof and check things that seem odd**.
257
 
258
+ Specifically: negation statements (i.e., the model says: _this thing does not have [ATTRIBUTE]_, when instead it should have said _this thing has lots of [ATTRIBUTE]_).
259
 
260
+ - I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually check this by comparing a particular statement with what the surrounding sentences imply.
261
 
262
  ### Training and evaluation data
263
 
264
  `kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
265
 
266
+ - For **initial fine-tuning**, only input text with 12288 input tokens or less and 1024 output tokens or less was used (_i.e. lines longer than that were dropped before training_) for memory reasons. After a quick analysis, summaries in the 12288-16384 range are in the **small** minority in this dataset.
267
+ - In addition, this initial training combined the training and validation sets and trained on them in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should (always) be the test set.**.
268
+ - The **final stages of fine-tuning** used the standard 16384 input/1024 output conventions, preserving the standard in/out lengths (_and truncating longer sequences_). This did not seem to change the loss/performance much.
269
 
270
  ### Eval results
271
 
272
  Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
273
 
274
+ **Please read the note above, as due to the training methods, the performance on the validation set looks better than the results on the test set will be**. The model achieves the following results on the evaluation set:
275
 
276
  - eval_loss: 1.2756
277
  - eval_rouge1: 41.8013
 
310
 
311
  See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)
312
 
313
+ You can also use the same code to split a document into batches of 4096, etc., and iterate over them with the model. This is useful in situations where CUDA memory is limited.
314
+
315
+ **Update:** see the section on the `textsum` package below.
316
 
317
  ### How to fine-tune further?
318
 
319
  See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
320
 
321
+ ### Are there simpler ways to run this?
322
 
323
+ For this reason, I created a Python package utility. It's called [textsum](https://github.com/pszemraj/textsum), and you can use it to load models and summarize things in a few lines of code.
324
 
325
  ```sh
326
  pip install textsum
 
335
  model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
336
  )
337
 
338
+ long_string = "This is a long string of text that will be summarized."
339
+ out_str = summarizer.summarize_string(long_string)
 
 
340
  print(f"summary: {out_str}")
 
341
  ```
342
 
343
+ This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.
 
 
344
 
345
+ For details, explanations, and documentation, see the README (_linked above_) or the [wiki](https://github.com/pszemraj/textsum/wiki).
346
 
347
  * * *
348
 
 
350
 
351
  ### Updates
352
 
353
+ Updates to this model/model card will be posted here when relevant. The model seems to be fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated.
354
 
355
  ### Training hyperparameters
356