Merge branch 'main' of https://huggingface.co./novakat/nerkor-cars-onpp-hubert into main
Browse files
README.md
CHANGED
@@ -23,4 +23,61 @@ inference:
|
|
23 |
|
24 |
## Limitations
|
25 |
|
26 |
-
- max_seq_length = 448
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
## Limitations
|
25 |
|
26 |
+
- max_seq_length = 448
|
27 |
+
|
28 |
+
## Training data
|
29 |
+
|
30 |
+
The underlying corpus, [NerKor+CARS-ONPP](https://github.com/novakat/NYTK-NerKor-Cars-OntoNotesPP), was derived from [NYTK-NerKor](https://github.com/nytud/NYTK-NerKor), a Hungarian gold standard named entity annotated corpus containing about 1 million tokens.
|
31 |
+
It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of [hvg.hu](hvg.hu).
|
32 |
+
While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories (`PER`, `LOC`, `MISC`, `ORG`), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation.
|
33 |
+
The new annotation elaborates on subtypes of the `LOC` and `MISC` entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups. The annotation was elaborated with further entity subtypes not present in the Ontonotes 5 annotation (see below).
|
34 |
+
|
35 |
+
## Tags derived from the OntoNotes 5.0 annotation
|
36 |
+
|
37 |
+
Names are annotated according to the following set of types:
|
38 |
+
|
39 |
+
| | |
|
40 |
+
|---|---------|
|
41 |
+
|`PER` | = PERSON People, including fictional |
|
42 |
+
|`FAC` | = FACILITY Buildings, airports, highways, bridges, etc. |
|
43 |
+
|`ORG` | = ORGANIZATION Companies, agencies, institutions, etc. |
|
44 |
+
|`GPE` | Geopolitical entites: countries, cities, states |
|
45 |
+
|`LOC` | = LOCATION Non-GPE locations, mountain ranges, bodies of water |
|
46 |
+
|`PROD` | = PRODUCT Vehicles, weapons, foods, etc. (Not services) |
|
47 |
+
|`EVENT` | Named hurricanes, battles, wars, sports events, etc. |
|
48 |
+
|`WORK_OF_ART` | Titles of books, songs, etc. |
|
49 |
+
|`LAW` | Named documents made into laws |
|
50 |
+
|
51 |
+
The following are also annotated in a style similar to names:
|
52 |
+
|
53 |
+
| | |
|
54 |
+
|---|---------|
|
55 |
+
| `NORP` | Nationalities or religious or political groups |
|
56 |
+
| `LANGUAGE` | Any named language |
|
57 |
+
| `DATE` | Absolute or relative dates or periods |
|
58 |
+
| `TIME` | Times smaller than a day |
|
59 |
+
| `PERCENT` | Percentage (including "%") |
|
60 |
+
| `MONEY` | Monetary values, including unit |
|
61 |
+
| `QUANTITY` | Measurements, as of weight or distance |
|
62 |
+
| `ORDINAL` | "first", "second" |
|
63 |
+
| `CARDINAL` | Numerals that do not fall under another type |
|
64 |
+
|
65 |
+
## Additional tags (not in OntoNotes 5)
|
66 |
+
Further subtypes of names of type `MISC`:
|
67 |
+
|
68 |
+
| | |
|
69 |
+
|-|-|
|
70 |
+
|`AWARD`| Awards and prizes |
|
71 |
+
| `CAR` | Cars and trucks |
|
72 |
+
|`MEDIA`| Media outlets, TV channels, news portals|
|
73 |
+
|`SMEDIA`| Social media platforms|
|
74 |
+
|`PROJ`| Projects and initiatives |
|
75 |
+
|`MISC`| Unresolved subtypes of MISC entities |
|
76 |
+
|`MISC-ORG`| Organization-like unresolved subtypes of MISC entities |
|
77 |
+
|
78 |
+
Further non-name entities:
|
79 |
+
|
80 |
+
| | |
|
81 |
+
|-|-|
|
82 |
+
|`DUR` |Time duration
|
83 |
+
|`ID`| identifier
|