Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Latent Dirichlet Allocation

Extracting Topics from Software Engineering Data

Joshua Charles Campbell*; Abram Hindle*; Eleni Stroulia* ^* Department of Computing Science, University of Alberta, Edmonton, AB, Canada

Abstract

Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories.

This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

Keywords

Latent Dirichlet allocation

Topic modeling

Software engineering

Tutorial

Chapter Outline

6.1 Introduction 140

6.2 Applications of LDA in Software Analysis 141

6.3 How LDA Works 142

6.4 LDA Tutorial 145

6.4.1 Materials 145

6.4.2 Acquiring Software-Engineering Data 146

6.4.3 Text Analysis and Data Transformation 146

6.4.3.1 Loading text 147

6.4.3.2 Transforming text 147

6.4.3.3 Lexical analysis 147

6.4.3.4 Stop word removal 147

6.4.3.5 Stemming 148

6.4.3.6 Common and uncommon word removal 148

6.4.3.7 Building a vocabulary 148

6.4.4 Applying LDA 149

6.4.5 LDA Output Summarization 149

6.4.5.1 Document and topic analysis 149

6.4.5.2 Visualization 152

6.5 Pitfalls and Threats to Validity 153

6.5.1 Criterion Validity 155

6.5.2 Construct Validity 155

6.5.3 Internal Validity 155

6.5.4 External Validity 156

6.5.5 Reliability 157

6.6 Conclusions 157

References 157

6.1 Introduction

Whether they consist of code, bug/issue reports, mailing-list messages, requirements specifications, or documentation, software repositories include text documents. This textual information is an invaluable source of information, and can potentially be used in a variety of software-engineering activities. The textual descriptions of bugs can be compared against each other to recognize duplicate bug reports. The textual documentations of software modules can be used to recommend relevant source-code fragments. E-mail messages can be analyzed to better understand the skills and roles of software developers, and to recognize their concerns about the project status and their sentiments about their work and teammates. This is why we have recently witnessed numerous research projects investigating the application of text-analysis methods to software text assets.

The simplest approach to analyzing textual documents is to use a vector-space model, which views documents (and queries) as frequency vectors of words. For example “the” occurred once, “my” occurred twice, “bagel” occurred zero times, and so on. Effectively, a vector-space model views terms as dimensions in a high-dimensional space, so that each document is represented by a point in that space on the basis of the frequency of terms it includes. This model suffers from two major shortcomings. First, it makes the consideration of all words impractical: since each word is a dimension, considering all words would imply expensive computations in a very high-dimensional space. Second, it assumes that all words are independent. In response to these two assumptions, methods for extracting topic models—that is, thematic topics corresponding to related bags of words, were developed.

A thematic topic is a collection of words which are somehow related. For example, a topic might consist of the words “inning,” “home,” “batter,” “strike,” “catcher,” “foul,” and “pitcher,” which are all related to the game of baseball. The most well known topic-model methods are latent semantic indexing (LSI) and latent Dirichlet allocation (LDA). LSI employs singular-value decomposition to describe relationships between terms and concepts as a matrix. LDA arranges and rearranges words into buckets, which represent topics, until it estimates that it has found the most likely arrangement. In the end, having identified the topics relevant to a document collection (as sets of related words) LDA associates each document in the subject collection with a weighted list of topics.

LDA has recently emerged as the method of choice for working with large collections of text documents. There is a wealth of publications reporting its applications in a variety of text-analysis tasks in general and software engineering in particular. LDA can be used to summarize, cluster, link, and preprocess large collections of data because it produces a weighted list of topics for every document in a collection dependent on the properties of the whole. These lists can then be compared, counted, clustered, paired, or fed into more advanced algorithms. Furthermore, each topic comprises a weighted list of words which can be used in summaries.

This chapter provides an overview of LDA and its relevance to analyzing textual software-engineering data. First, in Section 6.2, we discuss the mathematical model underlying LDA. In Section 6.3, we present a tutorial on how to use state-of-the-art software tools to generate an LDA model of a software-engineering corpus. In Section 6.4, we discuss some typical pitfalls encountered when using LDA. In Section 6.5, we review the mining-software repositories literature for example applications of LDA. Finally, in Section 6.6, we conclude with a summary of the important points one must be aware of when considering using this method.

6.2 Applications of LDA in Software Analysis

The LDA method was originally formulated by Blei et al. [1], and it soon became quite popular within the software-engineering community. LDA’s popularity comes from the variety of its potential applications.

LDA excels at feature reduction, and can employed as a preprocessing step for other models, such as machine learning algorithms. LDA can also be used to augment the inputs to machine learning and clustering algorithms by producing additional features from documents. One example of this type of LDA usage was described by Wang and Wong [2], who employed it in a recommender system. Similarly, labeled LDA can be used to create vectors of independent features from arbitrary feature sets such as tags.

An important use of LDA is for linking software artifacts. There are many instances of such artifact-linking applications, such as measuring coupling between code modules [3] and matching code modules with natural-language requirement specifications [4] for traceability purposes. Asuncion et al. [5] applied LDA on textual documentation and source-code modules and used the topic-document matrix to indicate traceability between the two. Thomas et al. [6] focused on the use of LDA on yet another traceability problem, linking e-mail messages to source-code modules. Gethers et al. [7] investigated the effectiveness of LDA for traceability-link recovery. They combined information retrieval techniques, including the Jenson-Shannon model, the vector space model, and the relational topic model using LDA. They concluded that each technique had its positives and negatives, yet the integration of the methods tended to produce the best results. Typically steeped in the information retrieval domain, Savage et al. [8], Poshyvanyk [9], and McMillan et al. [10] have explored the use of information retrieval techniques such as LSI [11] and LDA to recover software traceability links in source code and other documents. For a general literature survey related to traceability techniques (including LDA), the interested reader should refer to De Lucia et al. [12].

Baldi et al. [13] labeled LDA-extracted topics and compared them with aspects in software development. Baldi et al. claim that some topics do map to aspects such as security, logging, and cross-cutting concerns, which was somewhat corroborated by Hindle et al. [14].

Clustering is frequently used to compare and identify (dis)similar documents and code, or to quantify the overlap between two sets of documents. Clustering algorithms can potentially be applied to topic probability vectors produced by LDA. LDA has been used in a clustering context, for issue report querying, and for deduplication. Lukins et al. [15] applied LDA topic analysis to issue reports, leveraging LDA inference to infer if queries, topics, and issue reports were related to each other. Alipour et al. [16] leveraged LDA topics to add context to deduplicate issue reports, and found that LDA topics added useful contextual information to issue/bug deduplication. Campbell et al. [17] used LDA to examine the coverage of popular project documentation by applying LDA to two collections of documents at once: user questions and project documentation. This was done by clustering and comparing LDA output data.

Often LDA is used to summarize the contents of large datasets. This is done by manually or automatically labeling the most popular topics produced by unlabeled LDA. Labeled LDA can be used to track specific features over time—for example, to measure the fragmentation of a software ecosystem as in Han et al. [18].

Even though LDA topics are assumed to be implicit and not observable, there is substantial work on assessing the interpretability of those summaries by developers. Labeling software artifacts using LDA was investigated by De Lucia et al. [19]. By using multiple information retrieval approaches such as LDA and LSI, they labeled and summarized source code and compared it against human-generated labels. Hindle et al. [20] investigated if developers could interpret and label LDA topics. They reported limited success, with 50% being successfully labeled by the developers, and that nonexperts tend to do poorly at labeling topics on systems they have not dealt with.

Finally, there has been some research on the appropriate choice of LDA hyperparameters and parameters: α, β, K topics. Grant and Cordy [21] were concerned about K, where K is the number of topics. Panichella et al. [22] proposed LDA-GA, a genetic algorithm approach to searching for appropriate LDA hyperparameters and parameters. LDA-GA needs an evaluation measure, and thus Panichella et al. used software-engineering-specific tasks that allowed their genetic algorithm optimized the number of topics for cost-effectiveness.

6.3 How LDA Works

The input of LDA is a collection of documents and a few parameters. The output is a probabilistic model describing (a) how much words belong to topics and (b) how associated topics are with documents. A list of topics, often containing some topics repeatedly, is generated at random on the basis of (b). That list is the same length as the number of words in the document. Then, that list of topics is transformed into a list of words by turning each topic into a word on the basis of (a).

LDA is a generative model. This means that it works with the probability of observing a particular dataset given some assumptions about how that dataset was created. At its core is the assumption that a document is generated by a small number of “topics.” An LDA “topic” is a probability distribution, assigning to each word in the collection vocabulary a probability.

Topics are considered hypothetical and unobservable, which is to say that they do not actually exist in documents. This means that, first, we know that documents are not actually generated from a set of topics. Instead, we are using the concept of a topic as a simplified model for what must be a more complicated process, the process of writing a document. Second, documents do not come with information about what topics are present, what words those topics contain, and how much of each topic is in each document. Therefore, we must infer the topic characteristics from a collection of collections of words. Each document is usually assumed to be generated by a few of the total number of possible topics. So, every word in every document is assumed to be attributable to one of the document’s topics.

Though, of course, words do a have a particular order in a document, LDA does not consider their order. Each word is assigned an individual probability of being generated. That is, the probability of a topic k generating a word v is a value ϕ_k,v [23]. The sum of these probabilities for a topic k must be 1:

$\sum_{v} ϕ_{k, v} = 1 .$ $\sum_{v} ϕ_{k, v} = 1 .$

Furthermore, the ϕ_k,v values are assumed to come from a random variable ϕ_k, with a symmetric Dirichlet distribution (which is the origin of the name of the LDA method). The symmetric Dirichlet distribution has one parameter, β, which determines whether a topic is narrow (i.e., focuses on a few words) or broad (i.e., covers a bigger spread of words). If β is 1, the probability of a topic generating a word often is the same as the probability of a topic generating a word rarely. If β is less than 1, most words will be extremely unlikely, while a few will make up the majority of the words generated. In other words, larger values of β lead to broad topics, and smaller values of β lead to narrow topics.

In summary, words are assumed to come from topics, with the probability of a word coming from a specific topic coming from a Dirichlet distribution. Thus, if we know that β is a very small positive integer, we know that a topic which would generate any word with equal probability is itself very unlikely to exist.

The “document” is an important concept involved in understanding how LDA works. A document is also a probability distribution. Every possible topic has a probability of occurring from 0 to 1, and the sum of these probabilities is 1. The probability that a document d will generate a word from a specific topic k is θ_d,k. Again, the θ_d,k probabilities are probabilities, but the probabilities of θ_d,k taking on a particular probability value comes from a Dirichlet distribution of parameter α. The topics that a document is observed to generate are a vector, Z_d. This vector is N_d words long, representing every word in the document. If α is near 1, we expect to see documents with few topics and documents with many topics in equal proportion. If α is less than 1, we expect most documents to only use a few topics. If α is greater than 1, we expect most documents to use almost every topic.

To summarize, words come from topics. The probability of a word being generated by a specific topic comes from a symmetric Dirichlet distribution. The probability of a document containing a word from a specific topic is dictated by a different symmetric Dirichlet distribution. The words that a document is observed to generate are a vector, W_d, which is formed by observing the topic indicated by the entries in the Z_d vector.

Figure 6.1 shows the words present in each document coming from topic 1 in nine different LDA models of varying parameters. Additionally, it shows α, β, θ, ϕ, Z, and W. By following the arrows, we can see how each prior generates each observed posterior. Figure 6.1 shows the effects that α has on θ, that θ has on Z, and that Z has on W. As we increase α, we observe that more documents contain words from topic 1. Additionally, it shows the effect that β has on ϕ and that ϕ has on W. As we increase β, we end up with topic 1 including a larger variety of vocabulary words. For example, the plot in column 1 and row 1 in Figure 6.1 shows that each document uses a much smaller subset of the possible vocabulary than the plot in column 1 and row 3 below it. Similarly, the plot in column 3 and row 1 shows that many more documents contain words from topic 1 than the plot in column 1 and row 1.

f06-01-9780124115194 — Figure 6.1 9 Example LDA models produced by varying α and β. Arrows show the relationship between prior and posterior.

The LDA process consists in allocating and reallocating weights (or probabilities if normalized) in θ and ϕ until the lower bound of the total probability of observing the input documents is maximized. Conceptually, this is accomplished by dividing up topics among words and by dividing up documents among topics. This iterative process can be implemented in many different ways.

Technically, the generative process LDA assumes is as follows, given a corpus of M documents, each of length N_i [1]:

1. For every topic $k \in \{1, \dots, K\}$ $k \in \{1, \dots, K\}$ , choose $\vec{ϕ_{k}} \sim Dir (β)$ $\vec{ϕ_{k}} \sim Dir (β)$ .

2. For every document $d \in \{1, \dots, M\}$ $d \in \{1, \dots, M\}$

(a) choose $\vec{θ_{d}} \sim Dir (α)$ $\vec{θ_{d}} \sim Dir (α)$ .

(b) for every word $j \in 1, \dots, N_{d}$ $j \in 1, \dots, N_{d}$ in document d

i. choose a topic $z_{d, j} \sim multinomial (\vec{θ_{d}})$ $z_{d, j} \sim multinomial (\vec{θ_{d}})$ .

ii. choose a word $w_{d, j} \sim multinomial (\vec{ϕ_{z_{d, j}}})$ $w_{d, j} \sim multinomial (\vec{ϕ_{z_{d, j}}})$ .

Thus, the probability of a topic k generating a word v at a position j in a document d is $p (w_{d, j} = v ∣ α, β, K)$ $p (w_{d, j} = v ∣ α, β, K)$ :

$\int_{Θ} \sum_{k = 1}^{K} \int_{Φ} p (w_{d, j} = v ∣ \vec{ϕ_{k}}) p (z_{d, j} = k ∣ \vec{θ_{d}}) p (\vec{ϕ_{k}} ∣ β) p (\vec{θ_{d}} ∣ α) d \vec{ϕ_{k}} d \vec{θ_{d}},$ $\int_{Θ} \sum_{k = 1}^{K} \int_{Φ} p (w_{d, j} = v ∣ \vec{ϕ_{k}}) p (z_{d, j} = k ∣ \vec{θ_{d}}) p (\vec{ϕ_{k}} ∣ β) p (\vec{θ_{d}} ∣ α) d \vec{ϕ_{k}} d \vec{θ_{d}},$

si10_e

integrating over all possible probability vectors of length K (Θ) and of length V (Φ). The goal of LDA software is to maximize the probability

$p (θ, Z ∣ W, α, β, K)$ $p (θ, Z ∣ W, α, β, K)$

by choosing θ and Z given a corpus W and parameters α and β. Unfortunately, this problem is intractable [1], so the values of θ and ϕ that maximize the above probability are estimated by LDA software. The exact technique employed to estimate the maximum differs between different pieces of software.

6.4 LDA Tutorial

In this tutorial, we illustrate the LDA method in the context of analyzing textual data extracted from the issue-tracking system of a popular project.

1. The first task involves acquiring the issue-tracker data and representing it in a convenient format such as JavaScript Object Notation (JSON).

2. Then we transform the text of the input data—namely, we convert the text to word counts, where words are represented as integer IDs.

3. We apply LDA software on the transformed documents to produce a topic-document matrix and a topic-word matrix.

4. We then summarize the top words from the topic-word matrix to produce topic-word summaries, and store the topic-document matrix.

5. Finally, we analyze the document matrix and the topics. The objective of this analysis step is to (a) examine the latent topics discovered, (b) plot the topic relevance over time, and (c) cluster the issues (i.e., input documents) according to their associated topics.

6.4.1 Materials

This tutorial will use source code that the authors have developed to run LDA on issues collected in issues-tracking systems. For the sake of simplicity, the authors have provided a configured Ubuntu 64bit x86 virtual machine for VirtualBox¹ with all the software and appropriate data already loaded and available. The file is called LDA-Tutorial.ova, and can be downloaded from http://archive.org/details/LDAinSETutorial/ and https://archive.org/29/items/LDAinSETutorial/.

Download the ova file and import it into VirtualBox. Alternatively, use VirtualBox to export it to a raw file to write directly to a USB stick or hard drive. The username and the password of this virtual image are tutorial. On boot-up the virtual image will open to an Lubuntu 14.04 desktop. The source code for this tutorial is located in the /home/tutorial/lda-chapter-tutorial directory, which is also linked to from the desktop.

To access and browse the source code of the tutorial, visit http://bitbucket.org/abram/lda-chapter-tutorial/ and git clone that project, or download a zip file of the tutorial data and source code from http://webdocs.cs.ualberta.ca/~hindle1/2014/lda-chapter-tutorial.zip. The data directory contains the issue-tracker data for the bootstrap project. The important source code file is lda_from_json.py, which depends on lda.py. We use lda_from_json.py to apply the LDA algorithm, implemented by Vowpal Wabbit,² on issue-tracker issues. It is highly recommended to use the virtual machine as Vowpal Wabbit and other dependencies are already installed and configured.

6.4.2 Acquiring Software-Engineering Data

The data source for this tutorial will be the issues and comments of the Bootstrap³ issue tracker. Bootstrap is a popular JavaScript-based Website front-end framework that allows preprocessing, templating, and dynamic-content management of webpages. Bootstrap is a very active project, and its developer community is regularly reporting issues regarding Web browser compatibility and developer support. As of March 2014, Bootstrap had 13,182 issues in its issue tracker.

Our first task is to acquire the issue-tracker data for Bootstrap. To achieve this result we have written a Github issue tracker-extractor that relies on the Github application programming interface (API) and the Ruby Octokit library. Our program github_issues_to_json.rb (included in the chapter tutorial repository) uses the Github API to download the issues and comments from the Github issue tracker. One must first sign up to Github as a registered user and provide the GHUSERNAME and GHPASSWORD in the config.json file in the root of the chapter repository. One can also specify GHUSER (target Github user) and GHPROJECT (target Github user’s project to mirror) in config.json or as an environment variable. github_issues_to_json.rb downloads issue-tracker data and every page of issues and issue comments. It saves this data to a JSON file, resembling the original format obtained from the Github API. The JSON file created, large.json, contains both issues and comments, stored as a list of JSON objects (issues), each of which contains a list of comments. Mirroring Bootstrap takes a couple of minutes because of the thousands of issues and thousands of comments within Bootstrap’s issue tracker.

Once we have downloaded the Bootstrap issues (and comments) into large.json, we need to load and prepare that data for LDA. Most LDA programs will require that documents are preprocessed.

6.4.3 Text Analysis and Data Transformation

In this section we will cover preprocessing the data for LDA. Generally those who use LDA apply the following prepossessing steps:

• Loading text

• Transforming text

• Lexical analysis of text

• Optionally removing stop words

• Optionally stemming

• Optionally removing uncommon or very common words

• Building a vocabulary

• Transforming each text document into a word bag

6.4.3.1 Loading text

Loading the text is usually a matter of parsing a data file or querying a database where the text is stored. In this case, it is a JSON file containing Github issue-tracker API call results.

6.4.3.2 Transforming text

The next step is to transform the text into a final textual representation. This will be the textual representation of the documents. Some text is structured and thus must be processed. Perhaps section headers and other markup need to be removed. If the input text is raw HTML, perhaps one needs to strip HTML from the text before use. For the issue-tracker data we could include author names in the comments and in the issue description. This might allow for author-oriented topics, but might also confuse future analysis when we notice there was no direct mention of any of the authors. In this tutorial we have chosen to concatenate the title and the full description of the issue report, so that topics will have access to both fields.

Many uses of LDA in software analysis include source code in the document texts. Using source code requires lexing, parsing, filtering, and often renaming values. When feeding source code to LDA, some users do not want comments, some do not want identifiers, some do not want keywords, and some want only identifiers. Thus, the task of converting documents to a textual representation is nontrivial, especially if documents are marked up.

6.4.3.3 Lexical analysis

The next step is the lexical analysis of the texts. We need to split the words or tokens out of the text in order to eventually count them.

With source code we apply lexical analysis, where one extracts tokens from source code in a fashion similar to how compilers perform lexical analysis before parsing.

With natural language text, words and punctuation are separated where appropriate. For example, some words, such as initialisms, contain periods, but most of the time a period indicates the end of a sentence and is not a part of the word. With texts about source code, it might be useful to have some tokens start with a period—for instance, if you are analyzing cascading style sheets (CSS) or texts with CSS snippets, where a period prefix indicates a CSS class.

6.4.3.4 Stop word removal

Often words appear in texts which are not useful in topic analysis. Such words are called stop words. It is common in natural language processing and information retrieval systems to filter out stop words before executing a query or building a model. Stop words are words that are not relevant to the desired analysis. Whether a word is considered a stop word or not depends on the analysis, but there are some sets of common stop words available. Some users of natural language processing and LDA tools view terms such as “the,” “at,” and “a” as unnecessary, whereas other researchers, depending on the context, might view the definitives and prepositions as important. We have included stop_words, a text file that contains various words that we do not wish to include in topics in this tutorial. For each word extracted from the document, we remove those found within our stop word list.

6.4.3.5 Stemming

Since words in languages such as English have multiple forms and tenses, it is common practice to stem words. Stemming is the process of reducing words to their original root. Stemming is optional and is often used to reduce vocabulary sizes. For instance, given the words “act,” “acting,” “acted,” and “acts,” the stem for all four words will be “act.” Thus, if a sentence contains any of the words, on stemming, we will resolve it to the same stem. Unfortunately, sometimes stemming reduces the semantic meaning of a term. For example, “acted” is in the past tense, but this information will be lost if the word is stemmed. Stemming is not always necessary.

Stemming software is readily available. NLTK⁴ comes with an implementation of the Porter and Snowball stemmers. One caveat with stemmers is they often produce word roots that are not words or that conflict with other words. Sometimes this leads to unreadable output from LDA unless one keeps the original documents and their original words.

6.4.3.6 Common and uncommon word removal

Since LDA is often used to find topics, it is common practice to filter out exceptionally common words and infrequent words. Words that appear in only one document are often viewed as unimportant, because they will not form a topic with multiple documents. Unfortunately, if very infrequent words are left in, some documents which do not contain the word will be associated with that word via the topics that include that word. The common words are often skipped because they muddle topic summaries and make interpretation more difficult.

Once the documents have been preprocessed and prepared via lexing, filtering, and stemming, we can start indexing them for use as documents within an LDA implementation.

6.4.3.7 Building a vocabulary

In this tutorial, we use the Vowpal Wabbit software of Langford et al. [24]. Vowpal Wabbit accepts a sparse document-word matrix format where each line represents a document and each element of the line is an integer word joined by a colon to its count within that document. We provide lda.py, found within the lda-chapter-tutorial directory, a program to convert text to Vowpal Wabbit’s input format, and parse its output format.

One difficulty encountered using LDA libraries and programs is that often you have to maintain your own vocabulary or dictionary. We also have to calculate and provide the size of the vocabulary as $⌈ {log}_{2} (| words |) ⌉$ $⌈ {log}_{2} (| words |) ⌉$ .

6.4.4 Applying LDA

We choose 20 for the number of topics for the sake reading and interpreting the topics. The number of topics depends on the intent behind the analysis. If one wants to use LDA for dimensionality reduction, perhaps keeping the number of topics low is important. If one wants to cluster documents using LDA a larger number of topics might be warranted. Conceptual coupling might be best served with many topics over fewer topics.

We provide our parameters to Vowpal Wabbit: α set to 0.01, β set to 0.01 (called ρ in Vowpal Wabbit), and K, the number of topics. The value 0.01 is a common default for α and β in many pieces of LDA software. These parameters should be chosen on the basis of the desired breadth of documents and topics, respectively.

• If documents that discuss only a few topics and never mention all others are desired, α should be set small, to around 1/K. With this setting, almost all documents will almost never mention more than a few topics.

• Inversely, if documents that discus almost every possible topic but focus on some more than others are desired, α should be set closer to 1. With this setting, almost all documents will discuss almost every topic, but not in equal proportions.

• Setting β is similar to setting α except that β controls the breadth of words belonging to each topic.

Vowpal Wabbit reads the input documents and parameters and outputs a document-topic matrix and a topic-word matrix. predictions-00.txt, where 00 is the number of topics, is a file containing the document-topic matrix. Each document is on one line, and each row is the document-topic weight. If multiple passes are used, the last M lines of predictions-00.txt, where M is the number of documents, are the final predictions for the document-topic matrix. The first token is the word ID, and the remaining K tokens are the allocation for each topic (topics are columns).

6.4.5 LDA Output Summarization

Our program lda.py produces summary.json, a JSON summary of the top topic words for each topic extracted, ranked by weight. Two other JSON files are created, document_topic_matrix.json and document_topic_map.json. The first file (matrix) contains the documents and weights represented by JSON lists. The second file (map) contains the documents represented by their ID mapped to a list of weights. document_topic_map.json contains both the original ID and the document weight, where as the matrix it uses indices as IDs. lda-topics.json is also produced, and it lists the weights of words associated with each topic, as lists of lists. lids.json is a list of document IDs in the order presented to Vowpal Wabbit and the order used in the document_topic_matrix.json file. dicts.json maps words to their integer IDs. You can download the JSON and comma separated value (CSV) output of our particular run from https://archive.org/29/items/LDAinSETutorial/bootstrap-output.zip.

6.4.5.1 Document and topic analysis

Since the topics have been extracted, let us take a look! In Table 6.1 we see a depiction of 20 topics extracted from the Bootstrap project issue tracker. The words shown are the top 10 ranking words from each of the topics, the most heavily allocated words in the topic.

Table 6.1

The Top 10 Ranked Words of the 20 Topics Extracted from Bootstrap’s Issue-Tracker Issues

Topic No.	Top 10 Topic Words
1	grey blazer cmd clipboard packagist webview kizer ytimg vi wrench
2	lodash angular betelgeuse ree redirects codeload yamlish prototypejs deselect manufacturer
3	color border background webkit image gradient white default rgba variables
4	asp contrast andyl runat hyperlink consolidating negatory pygments teuthology ftbastler
5	navbar class col css width table nav screen http span
6	phantomjs enforcefocus jshintrc linting focusin network chcp phantom humans kevinknelson
7	segmented typical dlabel signin blockquotes spotted hyphens tax jekyllrb hiccups
8	modal input button form btn data http tooltip popover element
9	dropdown issue chrome menu github https http firefox png browser
10	zepto swipe floor chevy flipped threshold enhanced completeness identified cpu
11	grid width row container columns fluid column min media responsive
12	div class li href carousel ul data tab id tabs
13	parent accordion heading gruntfile validator ad mapped errorclass validclass collapseone
14	bootstrap github https css http js twitter docs pull don
15	left rtl support direction location hash dir ltr languages offcanvas
16	percentage el mistake smile spelling plnkr portuguese lokesh boew ascii
17	font icon sm lg size xs md glyphicons icons glyphicon
18	tgz cdn bootstrapcdn composer netdna libs yml host wamp cdnjs
19	npm js npmjs lib http install bin error ruby node
20	license org mit apache copyright xl cc spec gpl holder

Each topic is assigned a number by LDA software; however, the order in which it assigns numbers is arbitrary and has no meaning. If you ran LDA again with different seeds or a different timing (depending on the implementation), you would get different topics or similar topics but in different orders. Nonetheless, we can see in Table 6.1 that many of these topics are related to the Bootstrap project. Topic summaries such as these are often your first canary in the coal mine: they give you some idea of the health of your LDA output. If they are full of random tokens and numbers, one might consider stripping out such tokens from the analysis. If we look to topic 20, we see a set of terms: license org mit apache copyright xl cc spec gpl holder. MIT, Apache, GPL, and CC are all copyright licenses, and all of these licenses have terms and require attribution. Perhaps documents related to topic 20 are related to licensing. How do we verify if topic 20 is about licensing or not?

Using the document-topic matrix, we can look at the documents that are ranked high for topic 20. Thus, we can load the CSV file, document_topic_map.csv, or the JSON file, document_topic_map.json, with our favorite spreadsheet program (LibreOffice is included with the virtual machine), R, or Python, and sort then data in descending order on the T20 (topic 20) column. Right at the top is issue 2054. Browsing large.json or by visiting issue 2054 on Github,⁵ we can see that the subject of the issue is “Migrate to MIT License.” The next issues relate to licensing for image assets (#3942), JavaScript minification (unrelated, but still weighted heavily toward topic 20) (#3057), phantomJS error (#10811), and two licensing issues (#6342 and #966). Table 6.2 provides more details about these six issues. The LDA Python program also produces the file document_topic_map_norm.csv, which has normalized the topic weights. Reviewing the top weighted documents from the normalized CSV file reveals different issues, but four of the six top issues are still licensing relevant (#11785, #216, #855, and #10693 are licensing related but #9987 and #12366 are not). Table 6.3 provides more details about these six normalized issues.

Table 6.2

Issue Texts of the Top Documents Related to Topic 20 (Licensing) from document_topic_map.csv

https://github.com/twbs/bootstrap/issues/2054	cweagans
Migrate to MIT License
I’m wanting to include Bootstrap in a Drupal distribution that I’m working on. Because I’m using the Drupal.org packaging system, I cannot include Bootstrap because the APLv2 is not compatible with GPLv2 …
https://github.com/twbs/bootstrap/issues/3942	justinshepard
License for Glyphicons is unclear
The license terms for Glyphicons when used with Bootstrap needs to be clarified. For example, including a link to Glyphicons on every page in a prominent location isn’t possible or appropriate for some projects. …
https://github.com/twbs/bootstrap/issues/3057	englishextra
bootstrap-dropdown.js clearMenus() needs ; at the end
bootstrap-dropdown.js when minified with JSMin::minify produces error in Firefox error console saying clearMenus()needs ; …
https://github.com/twbs/bootstrap/issues/10811	picomancer
“PhantomJS must be installed locally” error running qunit:files task
I’m attempting to install Bootstrap in an LXC virtual machine, getting “PhantomJS must be installed locally” error. …
https://github.com/twbs/bootstrap/issues/6342	mdo
WIP: Bootstrap 3
While our last major version bump (2.0) was a complete rewrite of the docs, CSS, and JavaScript, the move to 3.0 is equally ambitious, but for a different reason: Bootstrap 3 will be mobile-first. …
MIT License is discussed.
https://github.com/twbs/bootstrap/issues/966	andrijas
Icons as font instead of img
Hi Any reason you opted to include image based icons in bootstrap which are limited to the 16px dimensions? For example http://somerandomdude.com/work/iconic/ is available as open source fonts—means you can include icons in headers, buttons of various size etc since its vector based. …
License of icons is discussed.

t0015

Table 6.3

Issue Texts of the Top Normalized Documents Related to Topic 20 (Licensing) from document_topic_map_norm.csv

https://github.com/twbs/bootstrap/pull/12366	mdo
Change a word
…
Blank + Documentation change
https://github.com/twbs/bootstrap/pull/9987	cvrebert
Change ‘else if‘ to ‘else‘
…
Blank + Provided a patch changing else if to else
https://github.com/twbs/bootstrap/pull/10693	mdo
Include a copy of the CC-BY 3.0 License that the docs are under
This adds a copy of the Creative Commons Attribution 3.0 Unported license to the repo. /cc @mdo
https://github.com/twbs/bootstrap/issues/855	mistergiri
Can i use bootstrap in my premium theme?
Can i use bootstrap in my premium cms theme and sell it?
https://github.com/twbs/bootstrap/issues/216	caniszczyk
Add CC BY license to documentation
At the moment, there’s no license associated with the bootstrap documentation. We should license it under CC BY as it’s as liberal as the software license (CC BY). …
https://github.com/twbs/bootstrap/issues/11785	tlindig
License in the README.md
At bottom of README.md is written: Copyright and license Copyright 2013 Twitter, Inc under the Apache 2.0 license. With 3.1 you switched to MIT. It looks like you forgott to update this part too.

t0020

6.4.5.2 Visualization

Looking at the numbers and topics is not enough, usually we want to visually explore the data to tease out interesting information. One can use simple tools such as spreadsheets to make basic visualizations.

Common visualization tasks with LDA include the following:

• Plotting document to topic association over time.

• Plotting the document-topic matrix.

• Plotting the document-word matrix.

• Plotting the association between two distinct kinds of documents within the same LDA run.

Given the CSV files, one can visualize the prevalence of topics over time. Figure 6.2 depicts the proportional topic weights of the first 128 issues over time against topics 15-20 from Table 6.1.

f06-02-9780124115194 — Figure 6.2 Example of using simple spreadsheet charting to visualize part of the document-topic matrix of Bootstrap (topics 15-20 of the first 128 issues).

From the spreadsheet inspection, the reader should notice that the document topic weights are somewhat noisy and hard to immediately interpret. For instance, it is hard to tell when a topic is popular and when it becomes less popular. Alternatively, one might ask if a topic is constantly referenced over time or if it is periodically popular. One method of gaining an overview is to bin or group documents by their date (e.g., weekly, biweekly, monthly) and then plot the mean topic weight of one topic per time bin over time. This allows one to produce a visualization depicting peaks of topic relevance over time. With the tutorial files we have included an R script called plotter.R that produces a summary of the 20 topics extracted combined with the dates extracted from the issue tracker. This R script produces Figure 6.3, a plot of the average relevance of documents per 2-week period over time. This plot is very similar to the plots in Hindle et al. [20]. If one looks at the bottom right corner of Figure 6.3, in the plot of topic 20 one can see that topic 20 peaks from time to time, but is not constantly discussed. This matches our perception of the licensing discussions found within the issue tracker: they occur when licenses need to be clarified or change, but they do not change all the time. This kind of overview can be integrated into project dashboards to give managers an overview of issue-tracker discussions over time.

f06-03-9780124115194 — Figure 6.3 Average topic weight for Bootstrap issues in 2-week bins. The topics are clearly described in Table 6.1.

Further directions for readers to explore include using different kinds of documents, such as documentation, commits, issues, and source code, and then relying on LDA’s document-topic matrix to link these artifacts. We hope this tutorial has helped illustrate how LDA can be used to gain an overview of unstructured data within a repository and infer relationships between documents.

6.5 Pitfalls and Threats to Validity

This section summarizes the threats to validity that practitioners may face when using LDA. In addition, this section describes potential pitfalls and hazards in using LDA.

One pitfall is that different pieces of LDA software output different types of data. Some LDA software packages report probabilities, while others report word counts or other weights. While one can convert between probabilities and word counts, it is important to consider whether each document receives equal weight, or whether longer documents should receive more weight than shorter documents.

6.5.1 Criterion Validity

Criterion validity relates to the ability of a method to correspond with other measurements that are collected in order to study the same concept. LDA topics are not necessarily intuitive ideas, concepts, or topics. Therefore, results from LDA may not correspond with results from topic labeling performed by humans.

A typical erroneous assumption frequently made by LDA users is that an LDA topic will represent a more traditional topic that humans write about such as sports, computers, or Africa. It is important to remember that LDA topics may not correspond to an intuitive domain concept. This problem was explored in Hindle et al. [20]. Thus, working with LDA-produced topics has some hazards: for example, even if LDA produces a recognizable sports topic, it may be combined with other topics or there may be other sports topics.

6.5.2 Construct Validity

Construct validity relates to the ability of research to measure what it intended to measure. LDA topics are independent topics extracted from word distributions. This independence means that correlated or co-occurring concepts or ideas will not necessarily be given their own topic, and if they are, the documents might be split between topics.

One should be aware of the constraints and properties of LDA when trying to infer if LDA output shows an activity or not. LDA topics are not necessarily intuitive ideas, concepts, or topics. Comparisons between topics in terms of document association can be troublesome owing to the independence assumption of topics.

Finally, it is necessary to remember that LDA assumes that topic-word probabilities and document-topic probabilities are Dirichlet distributions. Furthermore, many pieces of LDA software use symmetric Dirichlet distributions. This implies the assumption that the Dirichlet parameters are the same for every word (β) or topic (α), respectively, and that these parameters are known beforehand. In most software, this means that α and β must be set carefully.

6.5.3 Internal Validity

Internal validity refers to how well conclusions can be made about casual effects and relationships. An important aspect of LDA is that topics are independent; thus, if two ideas are being studied to see if one causes the other, one has to guard against LDA’s word allocation strategy.

This means that a word can come only from a single topic. Even if LDA produced a recognizable “sports” topic and a recognizable “news” topic, their combination is assumed never to occur. “Sports news,” may then appear as a third topic, independent from the first two. Or, it may be present in other topics whose focus is neither sports nor news. The independence of topics makes their comparison problematic.

For example, it might be desirable to ask if two topics overlap in some way. Table 6.4 depicts the correlation between every pair of topics as described by the document-topic matrix. Because of their independence they are not allowed to correlate: the output of LDA has topic-to-topic correlation values that are never significantly different from zero, as shown by the confidence intervals in Table 6.4.

Table 6.4

Topic-Topic Correlation Matrix (95% Confidence Intervals of the Correlation Amount)

	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
Topic 1	1	− 0.22 to 0.17	− 0.21 to 0.18	− 0.22 to 0.17	− 0.16 to 0.24
Topic 2	− 0.22 to 0.17	1	− 0.22 to 0.17	− 0.23 to 0.16	− 0.23 to 0.16
Topic 3	− 0.21 to 0.18	− 0.22 to 0.17	1	− 0.22 to 0.18	− 0.22 to 0.18
Topic 4	− 0.22 to 0.17	− 0.23 to 0.16	− 0.22 to 0.18	1	− 0.23 to 0.16
Topic 5	− 0.16 to 0.24	− 0.23 to 0.16	− 0.22 to 0.18	− 0.23 to 0.16	1
Topic 6	− 0.11 to 0.28	− 0.22 to 0.17	− 0.21 to 0.18	− 0.22 to 0.17	− 0.21 to 0.19
Topic 7	− 0.22 to 0.17	− 0.23 to 0.16	− 0.21 to 0.18	− 0.23 to 0.16	− 0.23 to 0.16
Topic 8	− 0.22 to 0.17	− 0.23 to 0.16	− 0.21 to 0.18	− 0.23 to 0.17	− 0.23 to 0.16
Topic 9	− 0.23 to 0.17	− 0.24 to 0.15	− 0.19 to 0.21	− 0.23 to 0.16	− 0.24 to 0.16
Topic 10	− 0.22 to 0.17	− 0.23 to 0.16	− 0.21 to 0.18	− 0.23 to 0.16	− 0.11 to 0.28
	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
Topic 1	− 0.11 to 0.28	− 0.22 to 0.17	− 0.22 to 0.17	− 0.23 to 0.17	− 0.22 to 0.17
Topic 2	− 0.22 to 0.17	− 0.23 to 0.16	− 0.23 to 0.16	− 0.24 to 0.15	− 0.23 to 0.16
Topic 3	− 0.21 to 0.18	− 0.21 to 0.18	− 0.21 to 0.18	− 0.19 to 0.21	− 0.21 to 0.18
Topic 4	− 0.22 to 0.17	− 0.23 to 0.16	− 0.23 to 0.17	− 0.23 to 0.16	− 0.23 to 0.16
Topic 5	− 0.21 to 0.19	− 0.23 to 0.16	− 0.23 to 0.16	− 0.24 to 0.16	− 0.11 to 0.28
Topic 6	1	− 0.2 to 0.2	− 0.22 to 0.18	− 0.22 to 0.17	− 0.22 to 0.18
Topic 7	− 0.2 to 0.2	1	− 0.21 to 0.18	− 0.21 to 0.18	− 0.23 to 0.17
Topic 8	− 0.22 to 0.18	− 0.21 to 0.18	1	− 0.23 to 0.16	− 0.22 to 0.17
Topic 9	− 0.22 to 0.17	− 0.21 to 0.18	− 0.23 to0.16	1	− 0.05 to 0.33
Topic 10	− 0.22 to 0.18	− 0.23 to 0.17	− 0.22 to 0.17	− 0.05 to 0.33	1

t0025

From the same LDA model as Figure 6.1.

To show that an event caused a change in LDA output, one should use a different data source and manual validation. LDA output changes given different α and β parameters, and sometimes given a test one, could tune these parameters to pass or fail this test. One has to provide motivation for the choice of α and β in order for any conclusions drawn from LDA output to be convincing.

6.5.4 External Validity

External validity is about generalization and how broadly findings can be made. LDA topics are relevant to the corpora provided their topics and words associated with the topics might not be generalizable. Alternatively LDA can be applied to numerous collections of documents, and thus external validity can be addressed in some situations.

6.5.5 Reliability

Reliability is about how well one can repeat the findings of a study. With LDA, the exact topics found will not be found again without sharing of initial parameters or seeds. Thus, all LDA studies should report their parameters. Yet, even if parameters are reported, LDA implementations will return different results, and the same implementation might produce different topics or different topic orderings each time it is run. Others might not be able to replicate the exact topics found or the ordering of the topics found.

Since LDA models are found iteratively, it is important to ensure that they have had adequate time to converge before use. Otherwise, the model does not correspond to the input data. The time required for convergence depends on the number of topics, documents, and vocabulary words. For example, given Vowpal Wabbit with 100 topics and 20,000 documents, each pass takes a matter of seconds on modest hardware, but at least two passes is recommended. To choose the correct number of passes, the output should be examined and the number of passes should be increased until the output stops changing significantly.

6.6 Conclusions

LDA is a powerful tool for working with collections of structured, unstructured, and semistructured text documents, of which there are plenty in software repositories. Our literature review has documented the abundance of LDA applications in software analysis, from document clustering for issue/bug de-duplication, to linking for traceability between code, documentation, requirements, and communications, to summarizing association of events and documents with software life cycle activities.

We have demonstrated a simple case of using LDA to explore the contents of an issue-tracker repository and showed how the topics link back to documents. We also discussed how to visualize the output.

LDA, however, relies on a complex underlying probabilistic model and a number of assumptions. Therefore, even though off-the-shelf software is available to compute LDA models, the user of this software must be aware of potential pitfalls and caveats. This chapter has outlined the basics of the underlying conceptual model and discussed these pitfalls in order to enable the informed use of this powerful method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data

Create new playlist

Sign In

Sign Up