Data Project

The data project is the main assessment for this course. There is no final exam — your project is your final submission.

Working independently, you will find a real dataset, explore it using the tools you have learned in this course, and write up your findings as a fully reproducible R document.


Download the project template

The template below guides you through each stage of the project. It includes example code and explanations for every section. Use it as your starting point.

data-project-template.zip

Unzip the file and open data-project.Rproj in RStudio to get started.


Timeline

Work on your project progressively throughout the semester rather than leaving it all to the end. Use the Pre sections of the template to share your progress with the teacher before the final submission.

Stage What to submit When
Pre_1–Pre_3 Research question, data description, data loaded into R Session 9
Pre_4–Pre_6 Data cleaned and explored Session 11
Final project Complete write-up rendered to HTML Last day of term

Project structure

Your final project should follow this structure. The template walks you through each section with examples.

1. Introduction

State your research question clearly. What are you trying to find out, and why is it interesting? Describe your dataset — where does it come from, what does it contain, and what are the key variables?

2. Exploratory data analysis

Before fitting any models, explore your data visually and numerically. Include:

  • Summary statistics for your key variables
  • At least two well-labelled plots that reveal something interesting about your data

3. Statistical analysis

Fit at least one regression model to address your research question. Your analysis should include:

  • A visualisation of the relationship between your outcome and explanatory variables
  • A regression table with coefficient estimates
  • An interpretation of the coefficients in plain language
  • A residual analysis to check model assumptions

4. Discussion

Interpret your findings in relation to your research question. Address:

  • What do your results mean?
  • What are the limitations of your analysis?
  • What further questions does your work raise?

5. Citations and references

List any data sources, packages, and literature you have cited.


Assessment criteria

Your project will be assessed on four criteria:

Criterion What we are looking for
Technical skills Code runs without errors; document renders cleanly to HTML; code is readable and well-commented
Data wrangling Data is loaded, cleaned, and prepared appropriately; variables are handled correctly
Visualisation Plots are informative, clearly labelled, and appropriate for the data type
Interpretation Findings are explained accurately and in plain language; limitations are acknowledged

Choosing a dataset

You are free to use any dataset that interests you, as long as it has at least one numerical outcome variable and at least two explanatory variables (at least one numerical, at least one categorical).

Some good places to find datasets:

  • TidyTuesday — weekly datasets shared by the R community
  • Kaggle — large collection of datasets across many topics
  • Our World in Data — data on global development, health, and society
  • e-Stat — Japanese government statistics portal
  • Google Dataset Search — search engine for publicly available datasets
TipA good dataset for this course

A good dataset has between 100 and 10,000 rows, contains a mix of numerical and categorical variables, and is on a topic you are genuinely curious about. Avoid datasets that are too clean — a little messiness is good practice.


Submission

Submit your rendered HTML file on Moodle by the deadline. Name your file data-project_yourname.html.

Your document must be fully reproducible — the teacher should be able to re-render it from the source .qmd or .Rmd file and get the same output. Include your source file in the submission.