omkarenator commited on
Commit
e58e006
·
verified ·
1 Parent(s): 84a7120

Update arxiv examples (#3)

Browse files

- Update arxiv examples (8c01f34cc02a8d64ce4c84c7911320ed878ffcab)

curated.py CHANGED
@@ -24,7 +24,10 @@ overview = (
24
  "Individual Filtering Discussion for Each Source",
25
  style="margin-bottom: 5px",
26
  ),
27
- Li(B("Estimated Reading Time: 25 minutes"),style="margin-bottom: 5px", ),
 
 
 
28
  ),
29
  ),
30
  )
@@ -34,9 +37,10 @@ curated_sources_intro = Div(
34
  P(
35
  "While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
36
  D_cite(bibtex_key="thepile"),
37
- " regarding both inclusion of the dataset and filtering techniques.",
38
  ),
39
- P("These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide high quality data. And as mentioned above, they are excluded from the web dataset via URL matching. Details about each of the sources are provided below. ",
 
40
  ),
41
  P(
42
  "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
@@ -566,7 +570,7 @@ se_examples = DV2(
566
  )
567
  phil_examples = DV("data/curated_samples/philpapers_raw.json", 2, "PhilPapers")
568
  arx_examples = DV2(
569
- "data/curated_samples/arxiv_raw.json", "data/curated_samples/arxiv_extract.json", 3
570
  )
571
  s2o_examples = DV("data/curated_samples/s2orc_raw.json", 0, "S2ORC")
572
  s2oa_examples = DV("data/curated_samples/s2orc_abstract_raw.json", 0, "S2ORC Abstract")
@@ -859,19 +863,19 @@ filtering_process = Div(
859
  ),
860
  ),
861
  table_div_s2o,
862
- # Details(
863
- # Summary("S2ORC Filtering Examples -- need to update"),
864
- # Div(
865
- # P("examples are missing"),
866
- # style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
867
- # ),
868
- # style="""
869
- # background-color: #FFFAEA; /* Light yellow background */
870
- # padding: 15px;
871
- # border-radius: 12px;
872
- # margin-bottom: 15px
873
- # """,
874
- # ),
875
  ),
876
  ),
877
  Section(
@@ -912,19 +916,19 @@ filtering_process = Div(
912
  ),
913
  ),
914
  table_div_s2oa,
915
- #Details(
916
  # Summary("S2ORC Abstract Filtering Examples "),
917
- # Div(
918
- # P("examples are missing"),
919
- # style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
920
- # ),
921
- # style="""
922
- # background-color: #FFFAEA; /* Light yellow background */
923
- # padding: 15px;
924
- # border-radius: 12px;
925
- # margin-bottom: 15px
926
- # """,
927
- # ),
928
  )
929
  ),
930
  Section(
@@ -1201,9 +1205,9 @@ filtering_process = Div(
1201
  P(B("Unique Data Preparation Challenges: ")),
1202
  Ul(
1203
  Li(
1204
- "The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
1205
  D_code("<AUTHOR>", language="html"),
1206
- " tag to encode the main thread text by the original poster, and use a ",
1207
  D_code("<COMMENT>", language="html"),
1208
  " tag to encode the replies. We initially choose ",
1209
  D_code("<P>", language="html"),
@@ -1289,7 +1293,9 @@ filtering_process = Div(
1289
  "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
1290
  ),
1291
  P(B("Unique Data Preparation Challenges: ")),
1292
- P("The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."),
 
 
1293
  Ul(
1294
  Li(
1295
  "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
@@ -1309,8 +1315,9 @@ filtering_process = Div(
1309
  ),
1310
  Li(
1311
  "All form feed (",
1312
- D_code("\\f", language="bash"),
1313
- ")characters were removed.", style="margin-bottom: -3px"
 
1314
  ),
1315
  ),
1316
  P(B("Filters Applied: ")),
 
24
  "Individual Filtering Discussion for Each Source",
25
  style="margin-bottom: 5px",
26
  ),
27
+ Li(
28
+ B("Estimated Reading Time: 25 minutes"),
29
+ style="margin-bottom: 5px",
30
+ ),
31
  ),
32
  ),
33
  )
 
37
  P(
38
  "While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
39
  D_cite(bibtex_key="thepile"),
40
+ " regarding both inclusion of the dataset and filtering techniques.",
41
  ),
42
+ P(
43
+ "These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide high quality data. And as mentioned above, they are excluded from the web dataset via URL matching. Details about each of the sources are provided below. ",
44
  ),
45
  P(
46
  "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
 
570
  )
571
  phil_examples = DV("data/curated_samples/philpapers_raw.json", 2, "PhilPapers")
572
  arx_examples = DV2(
573
+ "data/curated_samples/arxiv_raw.json", "data/curated_samples/arxiv_markdown.json", 3
574
  )
575
  s2o_examples = DV("data/curated_samples/s2orc_raw.json", 0, "S2ORC")
576
  s2oa_examples = DV("data/curated_samples/s2orc_abstract_raw.json", 0, "S2ORC Abstract")
 
863
  ),
864
  ),
865
  table_div_s2o,
866
+ # Details(
867
+ # Summary("S2ORC Filtering Examples -- need to update"),
868
+ # Div(
869
+ # P("examples are missing"),
870
+ # style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
871
+ # ),
872
+ # style="""
873
+ # background-color: #FFFAEA; /* Light yellow background */
874
+ # padding: 15px;
875
+ # border-radius: 12px;
876
+ # margin-bottom: 15px
877
+ # """,
878
+ # ),
879
  ),
880
  ),
881
  Section(
 
916
  ),
917
  ),
918
  table_div_s2oa,
919
+ # Details(
920
  # Summary("S2ORC Abstract Filtering Examples "),
921
+ # Div(
922
+ # P("examples are missing"),
923
+ # style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
924
+ # ),
925
+ # style="""
926
+ # background-color: #FFFAEA; /* Light yellow background */
927
+ # padding: 15px;
928
+ # border-radius: 12px;
929
+ # margin-bottom: 15px
930
+ # """,
931
+ # ),
932
  )
933
  ),
934
  Section(
 
1205
  P(B("Unique Data Preparation Challenges: ")),
1206
  Ul(
1207
  Li(
1208
+ "The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
1209
  D_code("<AUTHOR>", language="html"),
1210
+ " tag to encode the main thread text by the original poster, and use a ",
1211
  D_code("<COMMENT>", language="html"),
1212
  " tag to encode the replies. We initially choose ",
1213
  D_code("<P>", language="html"),
 
1293
  "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
1294
  ),
1295
  P(B("Unique Data Preparation Challenges: ")),
1296
+ P(
1297
+ "The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."
1298
+ ),
1299
  Ul(
1300
  Li(
1301
  "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
 
1315
  ),
1316
  Li(
1317
  "All form feed (",
1318
+ D_code("\\f", language="bash"),
1319
+ ")characters were removed.",
1320
+ style="margin-bottom: -3px",
1321
  ),
1322
  ),
1323
  P(B("Filters Applied: ")),
data/curated_samples/arxiv_markdown.json ADDED
The diff for this file is too large to render. See raw diff
 
data/curated_samples/arxiv_raw.json CHANGED
The diff for this file is too large to render. See raw diff