⬅ Home

Architecture

alt text

Our ranalyze module takes in a list of subreddits, a date range, and a database file. The program goes to each subreddit specified and puts all the posts and their associated comments into the database if the post was created inside the date range.

Design

Modules

For retrieving information from reddit we are using the module PRAW (Python reddit API wrapper) which provides built in functionality to prevent throttling from reddit. We have written our own custom database wrapper (ranalyze/pkg/database.py) for storing posts and comments, which provides a simple way to add comment and post data from the API into the database. Additionally, for progress tracking on machines that are performing these scraping jobs, we have included a progress module to provide progress updates to standard output when running. All of these modules are combined in ranalyze.py. When run, this parses the command line arguments for configuration information on the type of scraping job to be run, makes a database connection using our database wrapper, and adds or updates all of the found posts and comments in the specified date range and subreddits.

Data

Data from the scraping jobs is output to a SQLite database using our database wrapper. For our purposes, we are using a single table for storage of both comments and posts, called entries with the following columns:

id - (text, primary key) the id of the comment or post as provided by reddit
permalink - (text, unique) the link to the comment section if the particular row is a post
root_id - (text) the id of the associated post to a comment
up_votes - (integer) the net upvotes (upvotes - downvotes) for a post
up_ratio - (real) upvotes / (total votes)
time_submitted - (integer) the UTC timestamp of when the post was submitted
time_updated - (integer) the UTC timestamp of when this entry in the database was most recently updated
posted_by - (text) reddit account name of the person who posted the entry
title - (text) the title of the submitted post. Relevant only for posts
subreddit - (text) the subreddit that the entry came from
external_url - (text) if the submission is an external link, then that link. Relevant only for posts.
text_content - (text) If the entry is a post and that post is a self post, then this is the body of the post. If the entry is a comment, then this is the comment body
parent_id - (text, foreign key referencing id) for entries that are comments, if this is a reply to the original post, this is the id of that post. If the comment is a reply to another comment, this is the id of the parent comment. Relevant only for comments.
gilded - (integer) 1 if the entry is gilded, 0 if not
deleted - (integer) 1 if the comment or post was subsequently deleted after the original scrape, 0 otherwise