Data pull

The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB.

We use JSON tools to convert the results into a clean JSON and to create a dataframe.

from pandas.io.json import json_normalize 
import json 
import pandas as pd 
import bson.json_util as json_util
 
sanitized = json.loads(json_util.dumps(results)) 
normalized = json_normalize(sanitized) 
df = pd.DataFrame(normalized)

The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following:

df.columns 
 
Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 
       'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 
       'compare_url', 'contents_url', 'contributors_url', 'default_branch', 
       'deployments_url', 'description', 'downloads_url', 'events_url', 
       'fork', 
       'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url', 
       'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads', 
       'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage', 
       'hooks_url', 'html_url', 'id', 'issue_comment_url', 
       'issue_events_url', 
       'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url', 
       'merges_url', 'milestones_url', 'mirror_url', 'name', 
       'notifications_url', 'open_issues', 'open_issues_count', 
       'owner.avatar_url', 'owner.events_url', 'owner.followers_url', 
       'owner.following_url', 'owner.gists_url', 'owner.gravatar_id', 
       'owner.html_url', 'owner.id', 'owner.login', 
       'owner.organizations_url', 
       'owner.received_events_url', 'owner.repos_url', 'owner.site_admin', 
       'owner.starred_url', 'owner.subscriptions_url', 'owner.type', 
       'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url', 
       'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url', 
       'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url', 
       'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url', 
       'watchers', 
       'watchers_count', 'year'], 
        dtype='object')

Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends:

description: A user description of a repository
watchers_count: The number of watchers
size: The size of repository in kilobytes
forks_count: The number of forks
open_issues_count: The number of open issues
language: The programming language the repository is written in

We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group.

Table of Contents for Data pull

Create new playlist

Sign In

Sign Up

Table of Contents for
Data pull