E Group project
You will team up in groups of 3 to 5 members. It is up to you to form the groups based on your grade expectations, affinity, complementary skills, etc. You must communicate the group compositions no later than November the 30th by a (single) email to email@example.com detailing the members of the group and, if you have it, a preliminary description about the topic of the project (~ 3 lines).
Aim of the project
You will analyze a real dataset of your choice using the statistical methodology that we have seen in the lessons and labs. The purpose is to demonstrate that you know how to apply and interpret some of the studied statistical techniques (such as simple/multiple linear regression, logistic regression, or any other methods covered in the course) in a real-case scenario that is appealing for you.
Structure of the report
Use the following mandatory structure when writing your report:
- Abstract. Provide a concise summary of the project. It must not exceed 250 words.
- Introduction. State what is the problem to be studied. Provide some context, the question(s) that you want to address, a motivation of its importance, references, etc. Remember how we introduced the case studies covered in the course as a template (but you will need to elaborate more).
- Statistical analysis. Make use of some of the aforementioned statistical techniques, the ones that are more convenient to your particular case study. You can choose between covering several at a more superficial level, or one or two in more depth. Justify their adequacy and obtain analyses, explaining how you did it, in the form of plots and summaries. Provide a critical discussion about the outputs and give insights about them.
- Conclusions. Summary of what was addressed in the project and of the most important conclusions. Takeaway messages. The conclusions are not required to be spectacular, but fair and honest in terms of what you discovered.
- References. Refer to the sources of information that you have employed (for the data, for information on the data, for the statistical analyses, etc).
Mandatory format guidelines:
- Structure: title, authors, abstract, first section, second section and so on. Like this. Do not use a cover.
- Font size: 12 points.
- Spacing: single space, single column.
- Length limit: less than 5000 words and 15 pages. You are not required to make use of all the space.
All students in a group will be graded evenly. Take this into account when forming the group. The grading is on a scale of 0-10 (plus 2 bonus points) and will be performed according to the following breakdown:
- Originality of the problem studied and data acquisition process (up to 2 points).
- Statistical analyses presented and their depth (up to 3 points). At least two different techniques should be employed (simple and multiple linear regression count as different, but the use of other techniques as well is mostly encouraged). Graded depending on their adequacy to the problem studied and the evidence you demonstrate about your knowledge of them.
- Accuracy of the interpretation of the analyses (up to 2 points). Graded depending on the detail and rigor of the insights you elaborate from the statistical analyses.
- Reproducibility of the study (1.5 point). Awarded if the code for reproducing the study, as well as the data, is provided in a ready-to-use way (e.g. the outputs from
R Commander’s report mode along with the data).
- Presentation of the report (1.5 point). This involves the correct usage of English, the readability, the conciseness and the overall presentation quality.
- Excellence (2 bonus points). Awarded for creativity of the analysis, use of advanced statistical tools, use of points briefly covered in lessons/labs, advanced insights into the methods, completeness of the report, use of advanced presentation tools, etc. Only awarded if the sum of regular points is above 7.5.
The ratio “quality project”/“group size” might be taken into account in extreme cases (e.g. poor report written by 5 people, extremely good report written by 3 people).
Evidences of academic fraud will have serious consequences, such as a zero grade for the whole group and the reporting of the fraud detection to the pertinent academic authorities. Academic fraud includes (but is not limited to) plagiarism, use of sources without proper credit, project outsourcing and the use of external tutoring not mentioned explicitly.
- Think about a topic that could be reused for other subjects, or take inspiration from previous projects you did. In that way, this project could serve as the quantification of another subject’s project. If you do this, add an explicit mention in the report.
- Data sources. Here are some useful data sources:
- A list of all the datasets included in
R. See Section 2.9.1 of lab notes for how to load them.
- Some datasets employed in the course.
- The World Bank contains a huge collection of economic and sociological variables for countries and regions, for long periods of time.
- SIPRI contains several databases about international transfers of arms.
- The Global Health Observatory is the World Health Organization’s main health statistics repository.
- Sport statistics (teams, players) are a great source if you like sports. Sport webpages usually have a section on statistics.
- A list of all the datasets included in
- Inspiration for the project’s topic.
- The case studies covered (and left as exercises) in the lab notes might serve as a good starting point for defining a project.
- The Economist usually has some good and up-to-date political/economical analyses that could serve as motivation.
- Try to quantify the impact in society of certain laws (traffic, education, gender violence, etc).
- Is there a continuous variable that you would like to predict from others? (linear regression)
- Is there a binary variable that you would like to predict from others? (logistic regression)
- Would you like to assess which combination of variables explain most of the variability of your data, so you can visualize it in 2D or 3D? (principal component analysis)
- Would you like aggregate individuals according to several characteristics in order to classify them? (clustering)
R Commander’s report mode (Appendix D) to simplify the generation of graphs and summaries directly from the statistical analysis. Use that code to make the analysis reproducible.
- Make use of office hours before it is too late.
- Pro-tip: if you come to my office with a printed draft of the project, I can provide you some quick feedback on what could be improved.
Submit the reports before December the 23rd at 16:59 through Aula Global. Not by email. Reports received after the deadline will not be evaluated.