1 Purpose

The purpose of this course project is to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits.

  1. It will provide you with more experience using data wrangling tools on real life data sets.
  2. It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data.
  3. It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.

The course is structured in a way that allows you to work on your project as you progress through the weeks. Thus, you should not have to cram during the last one week of the term to complete your project. Rather, I plan to have you work on the project and use some of the in-class time to do peer evaluation of your code.

2 Project Goal

The principal goal of this project is to import a real life data set, clean and tidy the data, and perform basic exploratory data analysis; all while using Google Colab to produce an HTML report that is fully reproducible.

3 Project Data

You can choose a dataset on your own or select one from the four options provided below. All four data sets contain key attributes that will demonstrate the data science capabilities that you have learned throughout this course. You may even need to learn new skills not taught to accomplish your mission. These include working with:

Available data sets include:

You can choose from one of the following four data sets. Each dataset has its own challenges and strengths.

4 Project Report

You will write an HTML report that provides the sections in the grading rubric below. You will need to import, assess, clean & tidy the data, and then come up with your own research questions that you would like to answer from the data by performing exploratory data analysis (if you’d like to perform a predictive model to answer your hypothesis that is fine but it is not required). Some thoughts to help you:

Although each data set’s data dictionary contains some additional questions worth pursuing, try to be creative in your analysis and investigate the data in a way that your classmates most likely will not. Creativity is an essential ingredient for a good data scientist!

Section Standard Possible percentages
Introduction 1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?
1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)
1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.
1.4 Explain how your analysis will help the consumer of your analysis.
15
Packages Required 2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.
2.2 Messages and warnings resulting from loading the package are suppressed.
2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).
10
Data Preparation 3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.
3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).
3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.
3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.
3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.
20
Exploratory Data Analysis 4.1 Uncover new information in the data that is not self-evident (i.e. do not just plot the data as it is; rather, slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information).
4.2 Provide findings in the form of plots and tables. Show me you can display findings in different ways.
4.3 Graph(s) are carefully tuned for desired purpose. One graph illustrates one primary point and is appropriately formatted (plot and axis titles, legend if necessary, scales are appropriate, appropriate geoms used, etc.).
4.4 Table(s) carefully constructed to make it easy to perform important comparisons. Careful styling highlights important features. Size of table is appropriate.
4.5 Insights obtained from the analysis are thoroughly, yet succinctly, explained. Easy to see and understand the interesting findinsg that you uncovered.
20
Summary 6.1 Summarize the problem statement you addressed.
6.2 Summarize how you addressed this problem statement (the data used and the methodology employed).
6.3 Summarize the interesting insights that your analysis provided.
6.4 Summarize the implications to the consumer of your analysis.
6.5 Discuss the limitations of your analysis and how you, or someone else, could improve or build on it.
20
Formatting & Other Requirements 7.1 Proper coding style is followed and code is well commented.
7.2 Coding is systematic - complicated problem broken down into sub-problems that are individually much simpler. Code is efficient, correct, and minimal. Code uses appropriate data structure (list, data frame, vector/matrix/array). Code checks for common errors.
7.3 Achievement, mastery, cleverness, creativity: Tools and techniques from the course are applied very competently and, perhaps,somewhat creatively. Perhaps student has gone beyond what was expected and required, e.g., extraordinary effort, additional tools not addressed by this course, unusually sophisticated application of tools from course.
7.4 .ipynb fully executes without any errors and HTML produced matches the HTML report submitted by student(s).
15


I expect your report to tell a story with the data. I do not want you to just report some statistics that you find but, rather, to provide a coherent narrative of your findings.

Any additional details regarding the final project will be provided in class.