ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.1 β tibble 3.2.1
β lubridate 1.9.4 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 252,565
Columns: 7
$ athlete_name <chr> "A Dijiang", "A Lamusi", "Gunnar Aaby", "Edgar Aabye", "Cβ¦
$ year <int> 1992, 2012, 1920, 1900, 1932, 1932, 1952, 2000, 1996, 200β¦
$ sex <fct> M, M, M, M, F, F, M, M, F, F, M, M, M, M, M, M, M, M, M, β¦
$ country <fct> "China", "China", "Denmark", "Denmark/Sweden", "Netherlanβ¦
$ sport <fct> "Basketball", "Judo", "Football", "Tug-Of-War", "Athleticβ¦
$ medal <fct> No medal, No medal, No medal, Gold, No medal, No medal, Nβ¦
$ event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweiβ¦
Question1-Analysis
How has gender equality in athlete participation evolved across Olympic history, and what trends are visible in the representation of men and women from 1896 to 2024?
Yr_num <- olympics_dfclean |>distinct(athlete_name, year, sex) |># Ensure each athlete is only counted once per year per sex (avoid duplicates)group_by(year, sex) |>summarise(athlete_count =n()) |># Count the number of athletes per year and gendergroup_by(year) |># Group again by year to compute total athletes in each Olympic yearmutate(total =sum(athlete_count), # Total athletes for that yearpercentage =round((athlete_count / total) *100, 1)) |># Percentage of each genderarrange(year, sex)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# View the tableYr_num
# A tibble: 61 Γ 5
# Groups: year [31]
year sex athlete_count total percentage
<int> <fct> <int> <int> <dbl>
1 1896 M 176 176 100
2 1900 F 23 1215 1.9
3 1900 M 1192 1215 98.1
4 1904 F 6 642 0.9
5 1904 M 636 642 99.1
6 1906 F 6 838 0.7
7 1906 M 832 838 99.3
8 1908 F 44 2008 2.2
9 1908 M 1964 2008 97.8
10 1912 F 53 2386 2.2
# βΉ 51 more rows
Key Insight:
The Olympic Games have transformed in terms of gender representation from 1896 to 2024. Initially, female participation was minimal, with women making up less than 3% of athletes before 1920. This changed gradually, reaching about 10β15% by the 1960s. Significant growth occurred in the 1970s, followed by more increases in the 1980s and 1990s. By 2000, women represented 38.2% of athletes, climbing to 44.3% by 2012. The most recent Paris Olympics also saw a female athlete representation of 47.8% in 2020 and 49.1% in the 2024 Paris Olympics.
# Save the interactive plothtmlwidgets::saveWidget(p1_plotly, "out/interactive_plot1.html", selfcontained =TRUE)
Question2-Analysis
Among the most popular Olympic sports since 1992, which show a balanced gender representation, and which show persistent gender disparities?
# Get top 10 sports by total number of athletestop_sport <- olympics_dfclean |>filter(year >=1992) |># Focus only on Olympic years from 1992 onwarddistinct(athlete_name, year, sport, sex) |># Remove duplicate entries group_by(sport) |>summarise(athlete_num =n()) |># Count number of unique athlete entries per sporttop_n(10, athlete_num) # Select top 10 sports with the highest participation# Get the gender breakdown within those top sportsgender_sport <- olympics_dfclean |>filter(year >=1992, sport %in% top_sport$sport) |>distinct(athlete_name, year, sport, sex) |>group_by(sport, sex) |>summarise(athlete_num =n()) |>group_by(sport) |># Group again to calculate percentage per sportmutate(total =sum(athlete_num), #Create a new column 'total': total number of athletes (male + female) in each sportpercentage =round((athlete_num / total) *100, 1)) |># Create a new column 'percentage': % of athletes by gender in that sportarrange(sport, sex)
`summarise()` has grouped output by 'sport'. You can override using the
`.groups` argument.
gender_sport
# A tibble: 20 Γ 5
# Groups: sport [10]
sport sex athlete_num total percentage
<fct> <fct> <int> <int> <dbl>
1 Athletics F 8160 18372 44.4
2 Athletics M 10212 18372 55.6
3 Cycling F 956 3376 28.3
4 Cycling M 2420 3376 71.7
5 Football F 1516 4042 37.5
6 Football M 2526 4042 62.5
7 Handball F 1429 3037 47.1
8 Handball M 1608 3037 52.9
9 Hockey F 1563 3341 46.8
10 Hockey M 1778 3341 53.2
11 Judo F 1471 3531 41.7
12 Judo M 2060 3531 58.3
13 Rowing F 1875 5004 37.5
14 Rowing M 3129 5004 62.5
15 Sailing F 1202 3538 34
16 Sailing M 2336 3538 66
17 Shooting F 1328 3489 38.1
18 Shooting M 2161 3489 61.9
19 Swimming F 3515 7900 44.5
20 Swimming M 4385 7900 55.5
Key Insight:
The above analysis indicates that gender representation in the top 10 Olympic sports since 1992 has become more balanced in some areas, although notable differences still exist. Sports such as Handball and Hockey display relatively even participation between men and women, with women representing approximately 47% of all athletes. On the contrary, some sports, like Cycling and sailing, have a lower female representation. For instance, in Cycling, women account for around 28.3% of participants, with a difference of over 1,464 athletes compared to men. Female participation rates in widely participated sports like Athletics and Swimming are around 44β45%, which indicates that gradual movement toward greater inclusion.
Question2-Visualization
p2 <-ggplot(gender_sport, aes(x =reorder(sport, -total), y = athlete_num, fill = sex)) +geom_bar(stat ="identity", position ="stack", width =0.7) +scale_fill_manual(values =c("F"="#f7b6d2", "M"="#aec7e8"), labels =c("Female", "Male")) +labs(title ="Gender Distribution in Top 10 Olympic Sports (Since 1992)", x ="Sport", y ="No. of Athletes", caption ="Source: Kaggle | Author: Brandon") +theme_light(base_size =13) +theme(plot.title =element_text(face ="bold", size =13, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1),legend.position ="top",legend.title =element_text(size =12),legend.text =element_text(size =12))ggplotly(p2)
p2_plotly <-ggplotly(p2)# Save the interactive plothtmlwidgets::saveWidget(p2_plotly, "out/interactive_plot2.html", selfcontained =TRUE)
Question3-Analysis
Which countries had gender-balanced athlete representation (45%β55% female) at the 2024 Olympics, and how are these countries distributed globally?
female_participation <- olympics_dfclean |>filter(year ==2024) |>distinct(country, sex, athlete_name) |># Ensure each athlete is counted only once per country/sexgroup_by(country, sex) |>summarise(athletes_num =n()) |># Count athletes by country and genderpivot_wider( names_from = sex,values_from = athletes_num,values_fill =0) |># Reshape so that the Female and Male athlete are separate columnsmutate(total = F + M, # Create a total column for calculating total athletes per countrypercentage_female =round(F / total *100, 1)) # Calculate % of female athletes
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
#Keep countries with at least 30 total athletesfemale_participation_filtered <- female_participation |>filter(total >=30)#Identify gender-balanced countries (45β55% female)balanced_countries <- female_participation_filtered |>filter(percentage_female >=45& percentage_female <=55) |># Filter to near-balancemutate(diff_from_50 =abs(percentage_female -50)) |># How close each country is to 50%arrange(diff_from_50) # Sort from most balanced to least (within the range above)balanced_countries
# A tibble: 31 Γ 6
# Groups: country [31]
country F M total percentage_female diff_from_50
<fct> <int> <int> <int> <dbl> <dbl>
1 Slovenia 47 47 94 50 0
2 Belgium 88 89 177 49.7 0.300
3 Spain 202 198 400 50.5 0.5
4 Portugal 37 38 75 49.3 0.700
5 Germany 225 232 457 49.2 0.800
6 Jamaica 32 33 65 49.2 0.800
7 France 295 306 601 49.1 0.900
8 Romania 47 49 96 49 1
9 Ukraine 69 72 141 48.9 1.10
10 New Zealand 101 107 208 48.6 1.40
# βΉ 21 more rows
# Count how many countries had at least 30 athletes n_total_countries <-n_distinct(female_participation_filtered$country)n_total_countries
[1] 66
# Count how many countries are gender-balancedn_balanced <-nrow(balanced_countries)n_balanced
[1] 31
Key Insight:
The above analysis looked at 2024 Olympic teams and identified countries with gender-balanced participation (45%β55% female athletes). among countries with at least 30 total athletes, 31 countries fell within this range. At the top of the list, Slovenia had a perfect 50/50 split between men and women (47 each). Countries like Belgium, Spain, Portugal, and Germany also had close-to-even splits, ranging from 49.2% to 50.5% female. Even larger delegations such as France, the United States, and Great Britain achieved balanced representation.
Question3-Visualization οΈ
# Merges gender participation data with world map shapes and classifies countriesworld_map <-ne_countries(scale ="medium", returnclass ="sf")df_merge <-left_join(world_map, female_participation_filtered, by =c("name_long"="country")) |># Merge the cleaned Olympics gender data (female_participation_filtered) with the map data (world_map)mutate(percentage_female =replace_na(percentage_female, 0), # NA for percentage_female, if didnt have the Olympics databalanced =if_else(percentage_female >=45& percentage_female <=55, TRUE, FALSE)) #Like excel operation, assigns TRUE if the countryβs female participation is between 45% and 55%, Otherwise, assigns FALSEmap_graph <-ggplot(df_merge) +geom_sf(aes(geometry = geometry, fill = balanced), color ="white", size =0.2) +scale_fill_manual(values =c("TRUE"="#E53888", "FALSE"="grey90"),name ="Gender-Balanced (45β55%)",labels =c("FALSE"="Not Balanced", "TRUE"="Balanced")) +labs(title ="Countries with Gender-Balanced Teams at the 2024 Olympics",subtitle ="Only countries with β₯ 30 athletes") +coord_sf(expand =FALSE) +theme_void(base_size =18) +theme(legend.position ="bottom",legend.title =element_text(size =16),legend.text =element_text(size =14),plot.title =element_text(hjust =0.5, size =16, face ="bold"),plot.subtitle =element_text(hjust =0.5, size =16),plot.caption =element_text(hjust =0.5, size =16),plot.margin =margin(10, 10, 10, 10))map_graph
# save the plot as a fileggsave("out/plot3.png", map_graph, width =11, height =6, dpi =300)
Question4-Analysis
Among countries with at least 30 total athletes, which had the lowest percentage of female representation at the 2024 Olympics?
female_participation <- olympics_dfclean |>filter(year ==2024) |>distinct(country, sex, athlete_name) |>group_by(country, sex) |>summarise(athletes =n()) |>pivot_wider(names_from = sex, # Create new columns based on unique values in 'sex'values_from = athletes, # Fill those new columns with numbers from the athletes columnvalues_fill =0 ) |>#If country has no athletes for gender, fill with 0 instead of NAmutate(total = F + M, # Create new column for total athletes per countrypercentage_female =round(F / total *100, 1)) # % of athletes who are female
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
female_participation_filtered <- female_participation |>filter(total >=30)# Identify bottom 15 countries by % female athletesbottom15 <- female_participation_filtered |>arrange(desc(percentage_female)) |>tail(15) |># Sort the last 15 rows arrange(percentage_female) |>mutate(percentage_male =round(100- percentage_female, 1)) # Add % male for the analysis table# Calculate additional gender gap columnbottom15 <- bottom15 |>mutate(gap_size =abs(percentage_female - percentage_male)) # Add gap size column (% difference)bottom15
This analysis identifies the 15 countries with the lowest percentage of female athletes at the 2024 Summer Olympics, based on delegations with at least 30 total participants. Croatia had the lowest proportion, with women making up only 20.5% of its Olympic team. Other countries with low female representation included Argentina (23.1%), Iran (26.8%), and Morocco (29.5%).Most of these countries had less than 40% female participation, revealing that gender gaps still persist in Olympic representation in certain regions.
Question4-Visualization οΈ
gap_graph <-ggplot(bottom15, aes(y =reorder(country, percentage_female))) +# Gradient segmentgeom_segment(aes(x = percentage_female, xend = percentage_male,yend = country, color = gap_size), size =2.5) +# Gradient legendscale_color_gradient(low ="#FFE5B4", high ="#D73027", name ="Gap Size (%)") +guides(color =guide_colorbar(barheight =unit(0.4, "npc"), barwidth =unit(0.02, "npc"))) +# Female pointgeom_point(aes(x = percentage_female), color ="deeppink", size =7) +geom_text(aes(x = percentage_female, label =paste0(percentage_female, "%")),color ="deeppink", size =8, fontface ="bold",hjust =1, nudge_x =-6) +# Male pointgeom_point(aes(x = percentage_male), color ="steelblue", size =7) +geom_text(aes(x = percentage_male, label =paste0(percentage_male, "%")),color ="steelblue", size =8, fontface ="bold",hjust =0, nudge_x =6) +# X-axis scalescale_x_continuous(limits =c(0, 100),expand =expansion(mult =c(0.05, 0.05))) +# Labelslabs(title ="Gender Gap in Countries with Lowest % of Female Athletes (2024 Olympics)",subtitle ="Gradient lines reflect how large the gap is between male and female athletes",x ="% Female vs % Male",y ="Country",caption ="Filtered: Countries with β₯ 30 athletes") +theme_light(base_size =25) +theme(plot.title =element_text(face ="bold", hjust =0.5, size =30),plot.subtitle =element_text(size =25, hjust =0.5, color ="gray40"),plot.caption =element_text(size =23, hjust =0.5), legend.title =element_text(size =18), legend.text =element_text(size =16), axis.title.x =element_text(size =20, face ="bold", margin =margin(t =12)), axis.title.y =element_text(size =24),axis.text.x =element_text(size =22),axis.text.y =element_text(size =22, lineheight =1.5),plot.margin =margin(15, 20, 15, 20))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
βΉ Please use `linewidth` instead.
# Save the graphggsave("out/plot4.png", gap_graph, width =16, height =12, dpi =400, units ="in")
Question5-Analysis
Among all female athletes, which ones have won the most gold medals for their country? (One top athlete per country)
#Filter only female athletes who won gold medalstop_female_gold <- olympics_dfclean |>filter(sex =="F", medal =="Gold") |>#Remove duplicates β only count unique medals by athlete, year, and eventdistinct(athlete_name, year, event, country, medal) |>#Count the number of gold medals per athlete within each countrygroup_by(country, athlete_name) |>summarise(gold_medals =n()) |>#For each country, keep only the athlete with the highest gold medal countgroup_by(country) |>filter(gold_medals ==max(gold_medals)) |>filter(row_number() ==1) |>#Sort by number of gold medals and keep only the top 10 countriesarrange(desc(gold_medals)) |>head(10)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
top_female_gold
# A tibble: 10 Γ 3
# Groups: country [10]
country athlete_name gold_medals
<fct> <chr> <int>
1 Soviet Union Larysa (diriy-) 9
2 New Zealand Lisa Carrington 8
3 United States Jennifer (-cumpelik) 8
4 Czechoslovakia Vra (-odloilov) 7
5 Russia Svetlana Romashina 7
6 East Germany Kristin Otto 6
7 Italy Maria Vezzali 6
8 Australia McKEON Emma 5
9 China Chen Ruolin 5
10 Germany Birgit Fischer-schmidt 5
Key Insight:
The above analysis highlights the top-performing female Olympic athletes based on the number of gold medals won, while selecting only one top athlete per country. The athlete with the most gold medals is Larysa from the Soviet Union, with 9 golds, followed closely by Lisa Carrington (New Zealand) and Jennifer (USA), each with 8 gold medals.
Question5-Visualization οΈ
p3 <-ggplot(top_female_gold, aes(x =reorder(athlete_name, gold_medals), y = gold_medals, color = country)) +geom_segment(aes(xend = athlete_name, y =0, yend = gold_medals), size =1.2) +geom_point(size =5) +coord_flip() +labs(title ="Top 10 Female Athletes with Most Gold Medals (One per Country)",subtitle ="Each line represents a country's top female gold medalist",x =NULL, # Remove "Athlete" label so as to reduce overlapy ="Gold Medals",color ="Country", caption ="Source: Kaggle | Author: Brandon") +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", hjust =0.5),plot.subtitle =element_text(hjust =0.5),legend.position ="right",axis.text.y =element_text(margin =margin(r =10))) # add the spacing so names will not overlap the labelp3_plotly <-ggplotly(p3)p3_plotly
# Save the plot as a filehtmlwidgets::saveWidget(p3_plotly, "out/interactive_plot5.html", selfcontained =TRUE)
Question6-Analysis
Which countries with substantial initial participation (β₯ 50 female athletes in 2000s) have demonstrated the fastest average percentage growth in supporting female Olympians from 2000 to 2024?
female_growth <- olympics_dfclean |>filter(sex =="F", year >=2000) |># Scope the the focus on only the female athletes from 2000 onwardsdistinct(athlete_name, country, year) |>group_by(country, year) |>summarise(female_athletes =n()) |>arrange(country, year) |>group_by(country) |>summarise(start =first(female_athletes), # First data point (starting value like 2000)end =last(female_athletes), # Last data point (ending value like 2024)years =n(), # Number of Olympic Games that take into accounttotal_growth_percentage =round(((end - start) / start) *100, 1), # Overall % growth from first to last year avg_growth_percentage =round(total_growth_percentage / (years -1), 1),) |># Average % growth per Olympic cyclefilter(start >=50, years >=3) |># Keep only countries that had at least 50 female athletes in the base year and appeared in at least 3 Gamesarrange(desc(avg_growth_percentage))
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
# Store top 10 countries into the top10 data frame top10_countries_df <-head(female_growth, 10)top10_countries <- top10_countries_df$country# Prepare trend data separately for the data visualization belowtrend_data <- olympics_dfclean |>filter(sex =="F", year >=2000, year <=2024, country %in% top10_countries) |>distinct(athlete_name, country, year) |>group_by(country, year) |>summarise(female_athletes =n())
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
trend_data
# A tibble: 70 Γ 3
# Groups: country [10]
country year female_athletes
<fct> <int> <int>
1 Brazil 2000 87
2 Brazil 2004 115
3 Brazil 2008 125
4 Brazil 2012 113
5 Brazil 2016 203
6 Brazil 2020 150
7 Brazil 2024 164
8 Canada 2000 142
9 Canada 2004 128
10 Canada 2008 146
# βΉ 60 more rows
Key Insight:
This analysis examines which countries with substantial female participation in the early 2000s (β₯ 50 athletes) have achieved the fastest average growth rates in supporting female Olympians by 2024. France leads the list, growing from 125 to 295 female athletes, an average increase of 22.7% per Olympics. Poland, Netherlands, Japan, and Spain also show strong, consistent growth, each with over 16% average growth across seven Olympic Games. Some large delegations such as the United States, Germany, and Great Britain experienced more modest growth, likely due to already high participation levels in 2000.
Question6-Visualization οΈ
# Custom colorsline_colors <-c("#FFB6C1", "#B0E0E6", "#98FB98", "#D8BFD8", "#F0E68C","#FFA07A", "#AFEEEE", "#90EE90", "#FFD700", "#9370DB")# Updated plotp4 <-ggplot(trend_data, aes(x = year, y = female_athletes, color = country, group = country)) +geom_line(size =1.2) +geom_point(size =2.5) +scale_color_manual(values = line_colors) +facet_wrap(~ country, ncol =2, scales ="free_y") +scale_y_continuous(labels = scales::comma) +scale_x_continuous(breaks =seq(2000, 2024, by =4), expand =expansion(mult =c(0.05, 0.1))) +labs(title ="Female Olympic Participation Trends by Country (2000β2024)",subtitle ="Top 10 countries with the fastest average growth in female athlete participation",x ="Olympic Year",y ="Number of Female Athletes",caption ="Source: Kaggle | Author: Brandon") +theme_light(base_size =23) +theme(plot.title =element_text(size =25, face ="bold", hjust =0.5),plot.subtitle =element_text(size =20, hjust =0.5, color ="gray30"),axis.title =element_text(size =20),axis.text.x =element_text(size =18, angle =45, hjust =1),axis.text.y =element_text(size =18, margin =margin(r =15)),strip.text =element_text(face ="bold", size =18), legend.position ="none",plot.caption =element_text(size =15))# Save the plot ggsave("out/plot6.png", p4, width =14, height =13, dpi =400)