ChisNERE: a premodern Chinese corpus with named entity and relation annotation

Mar 4, 2025·
TANG Xuemei (唐雪梅)
TANG Xuemei (唐雪梅)
,
Zekun Deng,
,
Jun Wang
,
Qi Su
· 0 min read
Abstract
This work contributes to the digital humanities approach for studying premodern Chinese history and culture by creating a large-scale dataset annotated with named entities and relations. Through careful annotation guidelines and labeling of over 200,000 characters, we developed a dataset containing 30,000 named entities across six types and 7,000 relations spanning twenty categories. Experiments on named entity recognition (NER) using pre-trained language models and large language models on this dataset achieved an initial performance of NER (91.32 percent F1). In addition, relationship extraction (RE) on the pretrained language model achieves an 85.32 percent F1 score. While there is still room for improvement, our annotated dataset and models provide a useful starting point for extracting semantic information from premodern Chinese texts. It represents an effort to connect history and technology, increasing accessibility and preservation of premodern Chinese cultural treasures. Furthermore, our dataset can facilitate downstream tasks like culture analysis, knowledge graph construction, and computational understanding of premodern Chinese. Overall, this research represents a significant step toward digitally exploring premodern Chinese documents, providing a pathway for future work on knowledge organization and computational analysis of this valuable cultural legacy.
Type
Publication
Digital Scholarship in the Humanities