Data pull

The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB.

We use JSON tools to convert the results into a clean JSON and to create a dataframe.

from pandas.io.json import json_normalize 
import json 
import pandas as pd 
import bson.json_util as json_util sanitized = json.loads(json_util.dumps(results)) normalized = json_normalize(sanitized) df = pd.DataFrame(normalized)

The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following:

df.columns 
 
Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 
       'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 
       'compare_url', 'contents_url', 'contributors_url', 'default_branch', 
       'deployments_url', 'description', 'downloads_url', 'events_url', 
'fork', 'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url', 'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads', 'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage', 'hooks_url', 'html_url', 'id', 'issue_comment_url',
'issue_events_url', 'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url', 'merges_url', 'milestones_url', 'mirror_url', 'name', 'notifications_url', 'open_issues', 'open_issues_count', 'owner.avatar_url', 'owner.events_url', 'owner.followers_url', 'owner.following_url', 'owner.gists_url', 'owner.gravatar_id', 'owner.html_url', 'owner.id', 'owner.login',
'owner.organizations_url', 'owner.received_events_url', 'owner.repos_url', 'owner.site_admin', 'owner.starred_url', 'owner.subscriptions_url', 'owner.type', 'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url', 'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url', 'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url', 'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url',
'watchers', 'watchers_count', 'year'], dtype='object')

Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends:

  • description: A user description of a repository
  • watchers_count: The number of watchers
  • size: The size of repository in kilobytes
  • forks_count: The number of forks
  • open_issues_count: The number of open issues
  • language: The programming language the repository is written in

We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset