"Historical textual collections, digitized by machine scanning and optical character recognition (OCR), offer unique opportunities for exploring and disseminating heritage knowledge. Research innovations in this field, including recent advances in natural language processing (NLP), have been widely promoted as promising new tools for supporting research on these collections. Unfortunately, the inevitable OCR noise in these digitized materials challenges the performance of advanced NLP techniques, which are generally built for born-digital corpora. Moreover, the black-box NLP further makes it hard to understand the effects of OCR errors on NLP algorithms. This dissertation concentrates on the problem mentioned above, with a specific focus on the robustness of word embedding techniques such as word2vec, BERT, etc. for semantic encoding of OCR’d texts. We explore the problem through three interrelated parts of the studies. The first two parts compare various word embedding technologies to capture their latent characteristics on texts with OCR quality issues; Part I examines document-level encoding; Part II investigates sentence- and word-level encoding. Finally, the last part analyzes the effect of different levels of OCR noise on a specific word embedding methodology. Experimental results show that: (1) fine-tuned BERT outperforms pre-trained BERT when encoding OCR’d texts; (2) BERT-based dynamic embeddings are more sensitive to OCR errors than static embeddings in encoding words and sentences; (3) coarse-grained encoding (e.g., document-level) mitigates OCR noise interference on word embeddings, while fine-grained encoding (e.g., word-level) reduces the robustness of word embeddings to OCR noise; (4) OCR noise in unseen testing data can reduce embedding performance and downstream outcomes, while noise in the training corpus can benefit embedding robustness; and, (5) OCR noise does matter in scientific relation classification. Following our results, we recommend that scholars analyze their data with regard to both text granularity and data quality in training and testing corpora, in order to select the appropriate embedding tool for their analyses."