Data Project
The data project is the main assessment for this course. There is no final exam — your project is your final submission.
Working independently, you will find a real dataset, explore it using the tools you have learned in this course, and write up your findings as a fully reproducible R document.
Download the project template
The template below guides you through each stage of the project. It includes example code and explanations for every section. Use it as your starting point.
Unzip the file and open data-project.Rproj in RStudio to get started.
Project folder structure
Your project folder is organised as follows:
data-project/
├── data-project.Rproj
├── data-project.qmd
├── data_raw/ # your original data file — never modify this
├── data_processed/ # optional: cleaned data saved from R
├── figures/ # plots saved with ggsave()
└── scripts/ # optional: separate R scripts for long tasks
A few important habits:
- Raw data is read-only. Place your downloaded dataset in
data_raw/and never edit it directly. Always read fromdata_raw/in your code. - Save cleaned data to
data_processed/if your cleaning steps are complex — but for most projects, cleaning inside the.qmdis fine. - Save your plots to
figures/usingggsave()with descriptive file names. - The
scripts/folder is there if you need it — for example, if your data cleaning is very long and you want to keep it separate. Most students will not need it.
Timeline
Work on your project progressively throughout the semester rather than leaving it all to the end. Use the Pre sections of the template to share your progress with the teacher before the final submission.
| Stage | What to submit | When |
|---|---|---|
| Pre_1–Pre_3 | Research question, data description, data loaded into R | Session 9 |
| Pre_4–Pre_6 | Data cleaned and explored | Session 11 |
| Final project | Complete write-up rendered to HTML | Last day of term |
Project structure
The template walks you through each section with examples and prompts. The template is a guide, not a strict requirement — you are free to adapt the structure to suit your data and research question. What matters is that your final document tells a clear, well-supported story.
1. Introduction
State your research question clearly. What are you trying to find out, and why is it interesting? Describe your dataset — where does it come from, what does it contain, and what are the key variables?
2. Exploratory data analysis
Before fitting any models, explore your data visually and numerically. Include:
- Summary statistics for your key variables
- At least two well-labelled plots that reveal something interesting about your data
3. Statistical analysis
Fit at least one regression model to address your research question. Your analysis should include:
- A visualisation of the relationship between your outcome and explanatory variables
- A regression table with coefficient estimates
- An interpretation of the coefficients in plain language
- A residual analysis to check model assumptions
4. Discussion
Interpret your findings in relation to your research question. Address:
- What do your results mean?
- What are the limitations of your analysis?
- What further questions does your work raise?
5. Citations and references
List any data sources, packages, and literature you have cited.
Assessment criteria
Your project will be assessed on four criteria:
| Criterion | What we are looking for |
|---|---|
| Technical skills | Code runs without errors; document renders cleanly to HTML; code is readable and well-commented |
| Data wrangling | Data is loaded, cleaned, and prepared appropriately; variables are handled correctly |
| Visualisation | Plots are informative, clearly labelled, and appropriate for the data type |
| Interpretation | Findings are explained accurately and in plain language; limitations are acknowledged |
Choosing a dataset
You are free to use any dataset that interests you, as long as it has at least one numerical outcome variable and at least two explanatory variables (at least one numerical, at least one categorical).
Some good places to find datasets:
- TidyTuesday — weekly datasets shared by the R community
- Kaggle — large collection of datasets across many topics
- Our World in Data — data on global development, health, and society
- e-Stat — Japanese government statistics portal
- Google Dataset Search — search engine for publicly available datasets
A good dataset has between 100 and 10,000 rows, contains a mix of numerical and categorical variables, and is on a topic you are genuinely curious about. Avoid datasets that are too clean — a little messiness is good practice.
Submission
Submit your rendered HTML file on Moodle by the deadline. Name your file data-project_yourname.html.
Your document must be fully reproducible — the teacher should be able to re-render it from the source .qmd file and get the same output. Include your source .qmd file in your submission.