Skip to content

Text mining workshop with Elena Knudsen

Elena Knudsen

In this workshop, Elena will introduce text mining using R, its applications, and how the Tidy Text format can be used to handle data more efficiently.

Text Mining is a method of data analysis in which information is pulled from unstructured text data. It’s a way of analyzing a portion of text, including novels, articles, and historical documents, and using statistical techniques to draw conclusions about that sample. Text Mining uses Natural Language Processing (NLP) to break down and interpret large amounts of text. NLP tends to consist of recognizing speech, classifying text, and natural-language classification and generation. The focus of Text Mining specifically is the frequency and association of terms, not necessarily the meaning behind them.

The Tidy Text format takes text samples and provides a structure for them to make analysis easier. Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. Tidy text format is a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. 

By keeping the input and output in tidy tables, users can use a variety of different R packages to analyze the same data. R is a programming language that is often used for statistical computing and graphical presentation. This workshop will introduce the basics of these concepts and how they connect. Elena will also introduce how these tools may be applied and why they’re useful. 

Leave a Reply

Your email address will not be published. Required fields are marked *