Data Project

The data project is the main assessment for this course. There is no final exam — your project is your final submission.

Working independently, you will find a real dataset, explore it using the tools you have learned in this course, and write up your findings as a fully reproducible R document.

Download the project template

The template below guides you through each stage of the project. It includes example code and explanations for every section. Use it as your starting point.

data-project.zip

Unzip the file and open data-project.Rproj in RStudio to get started.

Project folder structure

Your project folder is organised as follows:

data-project/
├── data-project.Rproj
├── data-project.qmd
├── data_raw/          # your original data file — never modify this
├── data_processed/    # optional: cleaned data saved from R
├── figures/           # plots saved with ggsave()
└── scripts/           # optional: separate R scripts for long tasks

A few important habits:

Raw data is read-only. Place your downloaded dataset in data_raw/ and never edit it directly. Always read from data_raw/ in your code.
Save cleaned data to data_processed/ if your cleaning steps are complex — but for most projects, cleaning inside the .qmd is fine.
Save your plots to figures/ using ggsave() with descriptive file names.
The scripts/ folder is there if you need it — for example, if your data cleaning is very long and you want to keep it separate. Most students will not need it.

Timeline

Work on your project progressively throughout the semester rather than leaving it all to the end. Use the Pre sections of the template to share your progress with the teacher before the final submission.

Stage	What to submit	When
Pre_1–Pre_3	Research question, data description, data loaded into R	Session 9
Pre_4–Pre_6	Data cleaned and explored	Session 11
Final project	Complete write-up rendered to HTML	Last day of term

Project structure

The template walks you through each section with examples and prompts. The template is a guide, not a strict requirement — you are free to adapt the structure to suit your data and research question. What matters is that your final document tells a clear, well-supported story.

1. Introduction

State your research question clearly. What are you trying to find out, and why is it interesting? Describe your dataset — where does it come from, what does it contain, and what are the key variables?

2. Exploratory data analysis

Before fitting any models, explore your data visually and numerically. Include:

Summary statistics for your key variables
At least two well-labelled plots that reveal something interesting about your data

3. Statistical analysis

Fit at least one regression model to address your research question. Your analysis should include:

A visualisation of the relationship between your outcome and explanatory variables
A regression table with coefficient estimates
An interpretation of the coefficients in plain language
A residual analysis to check model assumptions

4. Discussion

Interpret your findings in relation to your research question. Address:

What do your results mean?
What are the limitations of your analysis?
What further questions does your work raise?

5. Citations and references

List any data sources, packages, and literature you have cited.

Assessment criteria

Your project will be assessed on four criteria:

Criterion	What we are looking for
Technical skills	Code runs without errors; document renders cleanly to HTML; code is readable and well-commented
Data wrangling	Data is loaded, cleaned, and prepared appropriately; variables are handled correctly
Visualisation	Plots are informative, clearly labelled, and appropriate for the data type
Interpretation	Findings are explained accurately and in plain language; limitations are acknowledged

Choosing a dataset

You are free to use any dataset that interests you, as long as it has at least one numerical outcome variable and at least two explanatory variables (at least one numerical, at least one categorical).

Some good places to find datasets:

TidyTuesday — weekly datasets shared by the R community
Kaggle — large collection of datasets across many topics
Our World in Data — data on global development, health, and society
e-Stat — Japanese government statistics portal
Google Dataset Search — search engine for publicly available datasets

A good dataset for this course

A good dataset has between 100 and 10,000 rows, contains a mix of numerical and categorical variables, and is on a topic you are genuinely curious about. Avoid datasets that are too clean — a little messiness is good practice.

Submission

Submit your rendered HTML file on Moodle by the deadline. Name your file data-project_yourname.html.

Your document must be fully reproducible — the teacher should be able to re-render it from the source .qmd file and get the same output. Include your source .qmd file in your submission.