General Materials

Requirements Deliverables

Sprint Deliverables

Ranalyze


Project Introduction

The goal of this project is to streamline the process by which employees of the Lineberger Cancer Center document and classify Internet Tobacco Vendors. Initially, we will start with tracking Reddit popularity of certain keywords and interests with the goal of providing insight about how these communities and interests have changed over time. We plan to then expand the project to provide a predictive analysis tool on websites with ambiguous internet tobacco vendor status, to aid employees in pruning false positives.

Tweet: Our aim is to provide utilities aiding in the process of scraping, caching, and identifying internet tobacco vending websites.

Standing Meetings

Group Meeting: Monday/Wednesday 2:45
Client Meeting: Friday 1:30

Contact Information and Team Roles

Architect: Rourke Creighton - Email
Client Liaison: Daniel Chiquito - Email
Project Manager: Lukas O’Daniel - Email
Writer: Bryan Iddings - Email

Team Rules


Project Concept

This project is intended to enhance the effectiveness and efficiency of the Internet Tobacco Vendors Study (ITVS). To provide such an enhancement, our team will focus on two main goals: automating and streamlining much of the ITVS’s research process;namely the processes scraping, caching, and identifying internet tobacco vendors, and providing an intuitive mechanism for up-to-date social media stats, specifically information about particular Sub-Reddits.

The current process of identifying internet tobacco vendors is a tedious and repetitive process involving manually visiting over ten thousand websites, to determine whether they are actual vendors, or false positives. To improve the efficiency of this process, we will implement an automated website analyzer to aid in determining the site’s tobacco vending status. An effective analyzer can be implemented by identifying various features indicative of internet tobacco vendors, and utilizing a simple classifier to provide a self-improving tool that produces a confidence level with its prediction.

Currently, the ITVS only accounts for statically served tobacco vendors. We will improve the breadth of the study by creating a flexible and easy-to-use data collection tool for the online E-Cigarette community on Reddit. Our tool will maintain an up-to-date database of specific information regarding posts in the various Sub-Reddits of this community, e.g. post title, number of up/down votes, number of comments, etc. This data can be used to track and analyze trends in the E-Cigarette community.

Our stretch goals for this project involve providing scripts and services to automate and streamline various other parts of the ITVS research and data collection process. Potential areas for improvement include: creating scripts to transform data effectively between formats conducive to manual labor and automation.

User stories

Personas

Jason: Project manager for the ITVS project. Jason is in charge of coordinating the more technical aspects of ITVS data processing. As such, he needs a thorough understanding of how to use every aspect of every tool, as well as the basic principles the tools are employing. Jason must be capable of performing every step in the process himself. Jason also needs the ability to train newcomers (Mysterio) to the project in order to delegate steps in the data processing pipeline.

Dmitriy: Staffer responsible for maintaining our tools long term. In addition to being able to use our tools, Dmitriy needs a complete understanding of the inner workings of our product to be able to modify and update them as necessary. Dmitriy will be involved in our design process so that he is fully up to date with our project at the end of the semester. Dmitriy needs comprehensive documentation for as much functionality as possible, as well as any design decisions we make.

Mysterio: Anyone who uses our tools. Mysterio will be a person subordinate to Jason who will take at least partial responsibility for using some component of our tools. Mysterio is assumed to have basic computing experience. Mysterio will be replacing Jason, and so Jason is assumed to be present for basic training and allocation of responsibilities. For any further questions, Mysterio needs a basic user manual that describes the function of all of our tools. Mysterio is not concerned with the implementation of our tools.

Use Cases

Data Collection

ranalyze can be used to collect data from the social media website reddit. Data can be gathered from multiple subreddits, and includes information such as post title, post content, external url, up-vote count, comment count, comment text, etc. This data can be automatically updated to reflect up-to-date information on non-static fields such as up-vote count, comment count, etc.

Data Analysis

ranalyze can also be used to flexibly extract data categorically, or based on the presence and/or frequency of specified keywords.

ITV Identification

tbd provides a machine learning approach to the identification of internet tobacco vendors (ITV). Through the analysis of various metrics, a website is assigned a score representing the likelihood that it is, in fact, an ITV.

Requirements

ranalyze

This utility will scrape a user specified set of subreddits and over a particular date range and output the result into a relational database for further manipulation and analysis. Within the limits of the reddit API, every post, comment, and relevant data and metadata (title, url, up-vote count, up-vote ratio, comment body, etc.) will be collected and stored. This process will run on a regular schedule and update non-static metadata in an incremental fashion.

Platform Analysis and Selection

Programming Language/Libraries

Python 3: Selected over Python 2 due to nearing end-of-life of Python 2.
Selenium: Headless browser package for automated scraping of dynamically generated sites.
openpyxl: Python library for reading and writing Microsoft Excel files.
Machine Learning Library: scikit-learn and TensorFlow are potential candidates under review.

Code Quality Assurance

PyUnit/nosetests: Reduce bug severity with test-driven development.
Travis CI: Continuous integration, namely automated unit testing.
CodeClimate: Ensure style-guide adherence and best coding practices.
Manual Code Review: All code is peer-reviewed prior to incorporation.

Getting Started

Open ranalyze-itvs.vipapps.unc.edu

Use the Search tab to search for reddit posts and comments. The results can be downloaded to a CSV file and will include a more columns than are shown in the search results.

Use the Word Frequency tab to see which words are most commonly used in cloud form or table form. The weight is calculated with the formula: number_of_posts_with_word * x + number_of_times_word_was_used * y = weight where x and y are specified by the sliders.

Use the Import tab to upload a CSV file of permalinks to be scraped and added to the database.

Use the Settings tab to add or remove subreddits from the scraping schedule. You can also set the default values for x and y for the Word Frequency tab.

Frequently asked questions

How do I search?

Searching is done in the Search tab by entering a list of space seperated keywords in keywords mode or by entering an expression in expression mode.

Expressions are of the format “word” and “other word” or not “third word”

What do the word frequency sliders do?

The sliders determine the coefficients x and y in the weight calculation used to rank words in the word cloud. The formula is: number_of_posts_with_word * x + number_of_times_word_was_used * y = weight

I deleted a subreddit from the settings. Why is it still in the search results?

Deleting a subreddit from the settings page only stops the automatic scraping of that subreddit. All data is retained when a subreddit is removed from the settings page. If you want to delete some data you will need to access the database directly.

I added a subreddit to the settings, why can’t I find any results for it?

Subreddits are scraped automatically, but it will take some time for a subreddit to be initially scraped. The last 1000 posts on a subreddit are scraped after it is added to the settings page, and scraping this many posts can take some time.