Ming Jiang | publications

2024

COLM

How Well Do LLMs Identify Cultural Unity in Diversity? Jialin Li, Junli Wang, Junjie Hu, and Ming Jiang arXiv preprint arXiv:2408.05102
arXiv

Prompting Large Vision-Language Models for Compositional Reasoning Timothy Ossowski, Ming Jiang, and Junjie Hu arXiv preprint arXiv:2401.11337
NAACL

CPopQA: Ranking cultural concept popularity by LLMs Ming Jiang, and Mansi Joshi In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) [Abs]
Many recent studies examining the knowledge capacity of large language models (LLM) have focused on knowledge explicitly learned from the pretraining data or implicitly inferable from similar contexts. However, the extent to which an LLM effectively captures corpus-level statistical trends of concepts for reasoning, especially long-tail ones, is largely underexplored. In this study, we introduce a novel few-shot question-answering task (CPopQA) that examines LLMs’ statistical ranking abilities for long-tail cultural concepts (e.g., holidays), particularly focusing on these concepts’ popularity in the United States and the United Kingdom, respectively. We curate a dataset of 457 holidays across 58 countries, generating a total of 9,000 QA testing pairs. Experiments on four strong LLMs show that open-sourced LLMs still lag way behind close LLM API (e.g., GPT-3.5) in statistical ranking of cultural concepts. Notably, GPT-3.5 exhibited its potential to identify geo-cultural proximity across continents.
CVPR

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) [Abs]
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. At each navigation step the agent selects from possible candidate locations and then makes the move. For better navigation planning the lookahead exploration strategy aims to effectively evaluate the agent’s next action by accurately anticipating the future environment of candidate locations. To this end some existing works predict RGB images for future environments while this strategy suffers from image distortion and high computational cost. To address these issues we propose the pre-trained hierarchical neural radiance representation model (HNR) to produce multi-level semantic features for future environments which are more robust and efficient than pixel-wise RGB reconstruction. Furthermore with the predicted future environmental representations our lookahead VLN model is able to construct the navigable future path tree and select the optimal path via efficient parallel evaluation. Extensive experiments on the VLN-CE datasets confirm the effectiveness of our method.

2023

arXiv

Empowering LLM-based machine translation with cultural awareness Binwei Yao, Ming Jiang, Diyi Yang, and Junjie Hu arXiv preprint arXiv:2305.14328
Big Data

PHAED: A speaker-aware parallel hierarchical attentive encoder-decoder model for multi-turn dialogue generation Zihao Wang, Ming Jiang, and Junli Wang IEEE Transactions on Big Data [Abs]
"This article presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Differing from prior work that treats the conversation history as a long text, we argue that capturing relative social relations among utterances (i.e., generated by either the same speaker or different persons) benefits the machine capturing fine-grained context information from a conversation history to improve context coherence in the generated response. Given that, we propose a Parallel Hierarchical Attentive Encoder-Decoder (PHAED) model that can effectively leverage conversation history by modeling each utterance with the awareness of its speaker and contextual associations with the same speaker’s previous messages. Specifically, to distinguish the speaker roles over a multi-turn conversation (involving two speakers), we regard the utterances from one speaker as responses and those from the other as queries. After understanding queries via hierarchical encoder with inner-query and inter-query encodings, transformer-xl style decoder reuses the hidden states of previously generated responses to generate a new response. Our empirical results with three large-scale benchmarks show that PHAED significantly outperforms baseline models on both automatic and human evaluations. Furthermore, our ablation study shows that dialogue models with speaker tokens can generally decrease the possibility of generating non-coherent responses."

2022

Dissertation

The influence of optical character recognition quality on the robustness of semantic encoding Ming Jiang Thesis at University of Illinois Urbana Champaign [Abs]
"Historical textual collections, digitized by machine scanning and optical character recognition (OCR), offer unique opportunities for exploring and disseminating heritage knowledge. Research innovations in this field, including recent advances in natural language processing (NLP), have been widely promoted as promising new tools for supporting research on these collections. Unfortunately, the inevitable OCR noise in these digitized materials challenges the performance of advanced NLP techniques, which are generally built for born-digital corpora. Moreover, the black-box NLP further makes it hard to understand the effects of OCR errors on NLP algorithms. This dissertation concentrates on the problem mentioned above, with a specific focus on the robustness of word embedding techniques such as word2vec, BERT, etc. for semantic encoding of OCR’d texts. We explore the problem through three interrelated parts of the studies. The first two parts compare various word embedding technologies to capture their latent characteristics on texts with OCR quality issues; Part I examines document-level encoding; Part II investigates sentence- and word-level encoding. Finally, the last part analyzes the effect of different levels of OCR noise on a specific word embedding methodology. Experimental results show that: (1) fine-tuned BERT outperforms pre-trained BERT when encoding OCR’d texts; (2) BERT-based dynamic embeddings are more sensitive to OCR errors than static embeddings in encoding words and sentences; (3) coarse-grained encoding (e.g., document-level) mitigates OCR noise interference on word embeddings, while fine-grained encoding (e.g., word-level) reduces the robustness of word embeddings to OCR noise; (4) OCR noise in unseen testing data can reduce embedding performance and downstream outcomes, while noise in the training corpus can benefit embedding robustness; and, (5) OCR noise does matter in scientific relation classification. Following our results, we recommend that scholars analyze their data with regard to both text granularity and data quality in training and testing corpora, in order to select the appropriate embedding tool for their analyses."
JCDL

A prototype Gutenberg-Hathitrust sentence-level parallel corpus for OCR error analysis: Pilot investigations Ming Jiang, Ryan C Dubnicek, Glen Worthey, Ted Underwood, and J Stephen Downie In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries [Abs]
"This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019[21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata. We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues."
IJDL

Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections Ming Jiang, Jennifer D’Souza, Sören Auer, and J Stephen Downie International Journal on Digital Libraries [Abs]
"The rapid growth of research publications has placed great demands on digital libraries (DL) for advanced information management technologies. To cater to these demands, techniques relying on knowledge-graph structures are being advocated. In such graph-based pipelines, inferring semantic relations between related scientific concepts is a crucial step. Recently, BERT-based pre-trained models have been popularly explored for automatic relation classification. Despite significant progress, most of them were evaluated in different scenarios, which limits their comparability. Furthermore, existing methods are primarily evaluated on clean texts, which ignores the digitization context of early scholarly publications in terms of machine scanning and optical character recognition (OCR). In such cases, the texts may contain OCR noise, in turn creating uncertainty about existing classifiers’ performances. To address these limitations, we started by creating OCR-noisy texts based on three clean corpora. Given these parallel corpora, we conducted a thorough empirical evaluation of eight BERT-based classification models by focusing on three factors: (1) BERT variants; (2) classification strategies; and, (3) OCR noise impacts. Experiments on clean data show that the domain-specific pre-trained BERT is the best variant to identify scientific relations. The strategy of predicting a single relation each time outperforms the one simultaneously identifying multiple relations in general. The optimal classifier’s performance can decline by around 10% to 20% in F-score on the noisy corpora. Insights discussed in this study can help DL stakeholders select techniques for building optimal knowledge-graph-based systems."

2021

CHR

Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Ted Underwood, and J Stephen Downie In Proceedings of the Second Conference on Computational Humanities Research [Abs]
"Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using “clean” corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT’s performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT’s ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR’d text (19,049 volumes) and “clean” re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR’d book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR’d texts; however, fine-tuning pre-trained BERT on OCR’d texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes."
JCDL

Evaluating BERT’s Encoding of Intrinsic Semantic Features of OCR’d Digital Library Collections Ming* Jiang, Yuerong* Hu, Glen Worthey, Ryan C Dubnicek, Ted Underwood, and J Stephen Downie In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (Poster)
iConference

The Gutenberg-HathiTrust Pallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Boris Capitanu, Deren Kudeki, and J Stephen Downie iConference 2021 (Poster) [Abs]
"This paper proposes large-scale parallel corpora of Englishlanguage publications for exploring the effects of optical character recognition (OCR) errors in the scanned text of digitized library collections on various corpus-based research. We collected data from: (1) Project Gutenberg (Gutenberg) for a human-proofread clean corpus; and, (2) HathiTrust Digital Library (HathiTrust) for an uncorrected OCR-impacted corpus. Our data is parallel regarding the content. So far as we know, this is the first large-scale benchmark dataset intended to evaluate the effects of text noise in digital libraries. In total, we collected and aligned 19,049 pairs of uncorrected OCR-impacted and human-proofread books in six domains published from 1780 to 1993."

2020

ICADL

Improving Scholarly Knowledge Representation: Evaluating BERT-based Models for Scientific Relation Classification Runner-up of Best Student Paper Award Ming Jiang, Jennifer D’Souza, Sören Auer, and J Stephen Downie In Proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries (ICADL 2020)
ASIS&T

Targeting precision: A Hybrid Scientific Relation Extraction Pipeline for Improved Scholarly Knowledge Organization Ming Jiang, Jennifer D’Souza, Sören Auer, and J Stephen Downie In Proceedings of the 83rd Annual Meeting of the Association for Informatin Science and Technology (ASIS&T 2020)
JCDL

Improving Digital Libraries’ Provision of Digital Humanties Datasets: A Case Study of HTRC Literature Dataset Yuerong Hu, Ming Jiang, Ted Underwood, and J Stephen Downie In 2020 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020) [Abs]
"This paper investigates the limitations and challenges of the curated datasets provided by digital libraries in support of digital humanities (DH) research. Our presented work provides a use case utilizing an English literature dataset of 178,381 volumes curated by the HathiTrust Research Center (HTRC) for measuring the change of three literature genres. These volumes were selected from over 17 million digitized items in the HathiTrust Digital Library. We demonstrate our methods and workflow for improving the representativeness and scholarly usability of the existing datasets. We analyzed and effectively overcame three common limitations: duplicate volumes, uneven distribution of data and OCR errors. We suggest that stakeholders of digital libraries should flag and address these limitations to improve their provisions’ usability in the context of digital humanities research."

2019

EMNLP

TIGEr: Text-to-Image Grounding for Image Caption Evaluation Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019) [Abs] [Code]
This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric’s effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores.
EMNLP

REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning Ming Jiang, Junjie Hu, Qiuyuan Huang, Lei Zhang, Jana Diesner, and Jianfeng Gao In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019) [Abs]
Popular metrics used for evaluating image captioning systems, such as BLEU and CIDEr, provide a single score to gauge the system’s overall effectiveness. This score is often not informative enough to indicate what specific errors are made by a given system. In this study, we present a fine-grained evaluation method REO for automatically measuring the performance of image captioning systems. REO assesses the quality of captions from three perspectives: 1) Relevance to the ground truth, 2) Extraness of the content that is irrelevant to the ground truth, and 3) Omission of the elements in the images and human references. Experiments on three benchmark datasets demonstrate that our method achieves a higher consistency with human judgments and provides more intuitive evaluation results than alternative metrics.
TextGraphs

A Constituency Parsing Tree based Method for Relation Extraction from Abstracts of Scholarly Publications Ming Jiang, and Jana Diesner In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing at EMNLP-IJCNLP 2019 [Abs]
We present a simple, rule-based method for extracting entity networks from the abstracts of scientific literature. By taking advantage of selected syntactic features of constituent parsing trees, our method automatically extracts and constructs graphs in which nodes represent text-based entities (in this case, noun phrases) and their relationships (in this case, verb phrases or preposition phrases). We use two benchmark datasets for evaluation and compare with previously presented results for these data. Our evaluation results show that the proposed method leads to accuracy rates that are comparable to or exceed the results achieved with state-of-the-art, learning-based methods in several cases.

2018

SUNBELT

Reliable Construction of Semantic Networks based on Text Data and Measurement of Effects in Text-based Networks Jana Diesner, Ming Jiang, and Siva Ratna Kumari Narisetti International Network for Social Network Analysis (SUNBELT 2018)

2016

COLING

Says Who...? Identification of Expert versus Layman Critics’ Reviews of Documentary Films Ming Jiang, and Jana Diesner In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016) [Abs]
We extend classic review mining work by building a binary classifier that predicts whether a review of a documentary film was written by an expert or a layman with 90.70% accuracy (F1 score), and compare the characteristics of the predicted classes. A variety of standard lexical and syntactic features was used for this supervised learning task. Our results suggest that experts write comparatively lengthier and more detailed reviews that feature more complex grammar and a higher diversity in their vocabulary. Layman reviews are more subjective and contextualized in peoples’ everyday lives. Our error analysis shows that laymen are about twice as likely to be mistaken as experts than vice versa. We argue that the type of author might be a useful new feature for improving the accuracy of predicting the rating, helpfulness and authenticity of reviews. Finally, the outcomes of this work might help researchers and practitioners in the field of impact assessment to gain a more fine-grained understanding of the perception of different types of media consumers and reviewers of a topic, genre or information product.
Hypertext

Issue-focused Documentaries Versus Other Films: Rating and Type Prediction based on User-authored Reviews Ming Jiang, and Jana Diesner In Proceedings of the 27th ACM Conference on Hypertext and Social Media (Hypertext 2016) [Abs]
User-authored reviews offer a window into micro-level engagement with issue-focused documentary films, which is a critical yet insufficiently understood topic in media impact assessment. Based on our data, features, and supervised learning method, we find that ratings of non-documentary (feature film) reviews can be predicted with higher accuracy (73.67%, F1 score) than ratings of documentary reviews (68.05%). We also constructed a classifier that separates reviews of documentaries from reviews of feature films with an accuracy of 71.32%. However, as our goal with this paper is not to improve the accuracy of predicting the rating and type or genre of film reviews, but to advance our understanding of the perception of documentaries in comparison to feature films, we also identified commonalities and differences between these two types of films as well as between low versus high ratings. We find that in contrast to reviews of feature films, comments on documentaries are shorter but composed of longer sentences, are less emotional, contain less positive and more negative terms, are lexically more concise, and are more focused on verbs than on nouns and adjectives. Compared to low-rated reviews, comments with a high rating are shorter, are more emotional and contain more positive than negative sentiment, and have less question marks and more exclamation points. Overall, this work contributes to advancing our understanding of the impact of different types of information products on individual information consumers.
iConference

Assessing Public Awareness of Social Justice Documentary Films based on News Coverage Versus Social Media Jana Diesner, Rezvaneh Rezapour, and Ming Jiang In iConference 2016 [Abs]
The comprehensive measurement of the impact that information products have on individuals, groups and society is of practical relevance to many actors, including philanthropic funding organizations. In this paper we focus on assessing one dimension of impact, namely public awareness, which we conceptualize as the amount and substance of attention that information products gain from the press and social media. We are looking at a type of products that philanthropic organizations fund, namely social justice documentaries. Using topic modeling as a text summarization technique, we find that films from certain domains, such as “Politics and Government” and “Environment and Nature,” attract more attention than productions on others, such as “Gender and Ethnicity”. We also observe that film-related public discourse on social media (Facebook and non-expert reviews) has a higher overlap with the content of a film than press coverage of films does. This is partially due to the fact that social media users focus more on the topics of a production whereas the press pays strong attention to cinematographic and related features.

2013

Geomatica

Pedestrian Navigation Services: Challenges and Current Trends Hassan A Karimi, Ming Jiang, and Rui Zhu Geomatica 2013 [Abs]
"With the success and popularity of vehicle navigation services, the demand for Pedestrian Navigation Services (PNS) has increased in recent years. PNS, while overlap in functionality with vehicle navigation services, must be designed specifically for the wayfinding and navigational needs and preferences of pedestrians. One major shortcoming of most existing PNS in outdoors is that they utilize and provide services based on road networks, resulting in PNS that do not effectively and properly track pedestrians as they usually walk on sidewalks, which have more segments and are narrower than roads. Challenges in building PNS include constructing appropriate sidewalk networks, continually tracking users in real time on sidewalks without interruption, and providing personalized routes as well as directions. In this paper, these challenges are highlighted and current trends in PNS, for both outdoors and indoors, are discussed and analyzed. A prototype PNS designed for the University of Pittsburg’s main campus sidewalk network (PNS-Pitt) is also discussed."
SOCA

A Novel Information Search and Recommendation Services Platform based on an Indexing Network Xiaodong Deng, Ming Jiang, Haichun Sun, Yangjie Zhang, Junjun Liu, Yu Guo, Xin Wang, Dajie Ge, Pengwei Wang, Zhijun Ding, and others In 2013 IEEE 6th International Conference on Service-Oriented Computing and Applications (SOCA 2013) [Abs]
With the rapid development of Internet technology, information resources on the Internet become more abundant, but also bring some problems like diversity, heterogeneity, disorder, and redundancy. Given a brief expression like search keywords only, users’ needs are ambiguous. Therefore, current technologies of search applications relying on direct keyword matching cannot meet the requirements of users exactly. Service applications are hoped to be more intelligent and knowledgeable. To solve such challenge, this paper attempts to organize web pages into a semantic association graph based on a novel model - indexing network, which can provide more valuable information and services for users. An information search and recommendation services platform is implemented using cloud distributed systems (Hadoop + Habse + Zookeeper) based on Sugon-Tongji cloud platform that is located at Tongji University. 70 million web pages are crawled on the Internet and their corresponding indexing network model are constructed. Several novel services are implemented and provided on the platform, such as search location, search navigation, category/keyword recommendation, and the interaction interface of the indexing network. System demonstration and experimental analysis show that the proposed platform can provide and support more knowledgeable and valuable information search-related services, thereby better meeting the growing needs of Internet users.
TREC

PITT at TREC 2013 Contextual Suggestion Track. Ming Jiang, and Daqing He In TREC 2013 [Abs]
This paper reports the IRIS Lab@Pitt’s participation to 2013 TREC Contextual Suggestion track, which focuses on technology and issues related to location-based recommender systems (LBRSs). Besides the data provided by the track, our recommendation algorithms also retrieve information from Yelp for creating candidate, example and user profiles. Our algorithms uses linear regression model to combine multiple attributes of candidate profiles into the calculation, and performed 5-fold cross validation for training and testing on 2012 track data. The two runs we submitted this year both obtained reasonable good performance comparing with the median results of all runs.