HunNERwiki: Automatically generated NE tagged corpus for Hungarian




The text of the corpus is automatically generated from Hungarian Wikipedia articles. It contains Named Entity (NE) tagging according to the CoNLL standard (Person, Organization, Location and Miscellaneous), and additional morphological annotation. The corpus is the largest ever NE tagged corpus for Hungarian, which can be used for training and testing NE recognizer applications. Thanks to the standard tagset, the performance of systems trained on the hunNERwiki corpus is comparable with the performance of other state-of-the-art systems.
Besides the obvious advantages of fully automatic building and annotation procedure (reducing the annotation cost), the novelty of the corpus is the application of collaboratively constructed resources (Wikipedia, DBpedia).

  • in-house software, hunmorph, hundisambig