⬅ Home

Architecture

alt text

Our ranalyze module takes in a list of subreddits, a date range, and a database file. The program goes to each subreddit specified and puts all the posts and their associated comments into the database if the post was created inside the date range.

Design


Modules

For retrieving information from reddit we are using the module PRAW (Python reddit API wrapper) which provides built in functionality to prevent throttling from reddit. We have written our own custom database wrapper (ranalyze/pkg/database.py) for storing posts and comments, which provides a simple way to add comment and post data from the API into the database. Additionally, for progress tracking on machines that are performing these scraping jobs, we have included a progress module to provide progress updates to standard output when running. All of these modules are combined in ranalyze.py. When run, this parses the command line arguments for configuration information on the type of scraping job to be run, makes a database connection using our database wrapper, and adds or updates all of the found posts and comments in the specified date range and subreddits.

Data

Data from the scraping jobs is output to a SQLite database using our database wrapper. For our purposes, we are using a single table for storage of both comments and posts, called entries with the following columns: