DATA VISUALIZATION WITH R
A disease is a condition that develops gradually over a respectably long period of time [1]. A disease can be brought on by either internal dysfunctions or external sources like infections. Internal immune system abnormalities, for instance, can result in a wide range of diseases, such as different types of immunodeficiency, hypersensitivity, allergies, and autoimmune disorders.
Influenza is a contagious viral infection primarily brought on by the influenza virus A or B. Although the heart, brain, and muscles can also be impacted, the upper respiratory system—which includes the nose, throat, bronchi, and infrequently the lungs—is where it most commonly manifests. It occurs frequently and causes a sizable amount of morbidity and mortality based on the pandemic, epidemic, or seasonal patterns.
The flu may seem like a common cold with a runny nose, sneezing and sore throat. Fever, aching muscles, chills and sweat, and headache constitute some of the symptoms of flu. Mostly the flu gets better with time, its consequences can occasionally be fatal. This occurs along with other age-related illnesses or immune deficiencies. Although the annual influenza vaccine isn’t 100% effective, it lowers the risk of developing serious infection-related consequences [2].
In this research, we will analyze a plethora of data that includes specifics about Influenza that occurred throughout the United States from 1888 to 2013. Project Tycho, which collaborates with national and international health institutes and researchers to make data more accessible for use in advancing global health, is the source of this information.
The Analysis is done through visualization of the data.
Every year, a significant proportion of the population in the United States suffers sick from the flu, which can also have catastrophic complications that can end in hospitalization or even death. The flu virus is incredibly unpredictable. They might generate new difficulties every year for those who produce vaccines, public health organizations, healthcare professionals, and patients [3].
This outlines the significance of influenza research in detail. The genetic makeup of influenza viruses, as well as the resultant proteins that interact with the immune system, are constantly evolving. Based on this information, I decided to formulate some research questions that I hope will enable me to investigate the prevalence of the influenza virus in 19th and 20th century, in various states of the United States.
In this project we are studying the about the spread of influenza disease across USA to find meaningful patterns from the data though visualization of data.
The main objectives of the project is to analyze the data and to study the behavior of the influenza disease in the USA.
Through the use of graphs, map plots, and other visualization techniques, I was able to respond to the following queries from the isolated data.
R is a powerful language mostly used for machine learning models and data analysis. We are analyzing and visualizing the outcomes using R. Below is a list of the libraries we used for the research.
tidyverse: An opinionated set of R tools created for data research is called the tidyverse. Each package has a common data structure, language, and design philosophy.
ggplot2: A R package devoted to data visualization is called ggplot2. It will increase your efficiency in producing graphics while also enhancing their quality and beauty.
ggpubr: The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs.
lubridate: Lubridate makes it possible to accomplish things that R does not with date-times and easy to do things that R does with date-times.
forcats: The forcats package aims to offer a collection of tools that address typical issues with factors, such as rearranging the levels or values.
Rcolorbrewer: An R package called RColorBrewer includes pre-made color palettes for making beautiful graphics.
There are countless libraries available for exploratory analysis within R. One of the important steps in the analysis is to understand which all libraries are necessary and relevant. Outlining the research questions for the analysis helped me to figure out and import the appropriate libraries.
The CSV file which contains the data is loaded into R, and from that a data frame is created. Since our area of research is related to the disease Influenza, we have selected only the data points corresponding to that.
data <- read.csv('ProjectTycho_Level2_v1.1.0.csv')
influenza_df <- data[data$disease == 'INFLUENZA',]
In data preprocessing following steps were carried out - from_date and to_date column data type is converted into Date type. - Extracting the Year and Month data from the from_date column.
## correcting the data types of the date columns
influenza_df$from_date <- as.Date(influenza_df$from_date, '%Y-%m-%d')
influenza_df$to_date <- as.Date(influenza_df$to_date, '%Y-%m-%d')
## extracting year and month from from_date into new columns
influenza_df['year'] <- year(influenza_df$from_date)
influenza_df['month'] <- month(influenza_df$from_date)
Here we are inspecting the data types of each column and the unique values in each column which gives us an idea of the the dataset we are working with.
glimpse(influenza_df)
Rows: 236,673
Columns: 13
$ epi_week <int> 191945, 191945, 191945, 191945, 191945, 191945, 191945, 191945, 191945, 19…
$ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
$ state <chr> "AL", "AR", "CA", "CT", "FL", "GA", "IL", "IN", "IA", "LA", "MA", "MT", "N…
$ loc <chr> "ALABAMA", "ARKANSAS", "CALIFORNIA", "CONNECTICUT", "FLORIDA", "GEORGIA", …
$ loc_type <chr> "STATE", "STATE", "STATE", "STATE", "STATE", "STATE", "STATE", "STATE", "S…
$ disease <chr> "INFLUENZA", "INFLUENZA", "INFLUENZA", "INFLUENZA", "INFLUENZA", "INFLUENZ…
$ event <chr> "CASES", "CASES", "CASES", "CASES", "CASES", "CASES", "CASES", "CASES", "C…
$ number <int> 4, 24, 31, 5, 31, 37, 56, 29, 5, 12, 20, 2, 26, 35, 8, 1, 1, 16, 21, 13, 1…
$ from_date <date> 1919-11-02, 1919-11-02, 1919-11-02, 1919-11-02, 1919-11-02, 1919-11-02, 1…
$ to_date <date> 1919-11-08, 1919-11-08, 1919-11-08, 1919-11-08, 1919-11-08, 1919-11-08, 1…
$ url <chr> "https://www.tycho.pitt.edu/raw/PDF/1919/46.pdf", "https://www.tycho.pitt.…
$ year <dbl> 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 19…
$ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11…
As we filtered influenza from the data frame, the data from 1919 to 1951 was only available. All the following analysis is done during this time frame unless mentioned otherwise.
My research’s fundamental and main goal was to determine whether the prevalence of the influenza virus shifted over time. We can gain a general understanding of the disease’s spread by looking at the disease’s year-by-year progression from 1919 to 1951. To determine which year the cases peaked and to determine whether there is a yearly pattern in the reported instances, the average cases for each year from 1919 to 1951 are plotted. The line graph clearly shows that the majority of instances were reported in 1950, followed by 1951, 1928, 1920, 1945, and 1941. The years 1924 through 1927 saw the lowest number of instances reported. In 1950, there were seven times as many instances as there were in 1919.The graph also indicates that there is a trend toward a slightly higher number of instances reported with high variability.
avg_cases_per_year <- influenza_df%>%group_by(year = year(from_date))%>%summarise(count = mean(number))
p1 <- ggplot(avg_cases_per_year, aes(x = year, y = count)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = avg_cases_per_year$year) +
ggtitle('Average influenza cases over the years (1919 to 1951)') +
theme(axis.text.x = element_text(angle = 90, hjust=1))
p1
Fig 1: Average Influenza cases over the years
Secondly, I wanted to look into the top time periods where the most reported incidents occurred.The data was categorized to determine when there were the most cases reported in 6 days, and a table listing the top 10 6-day periods with the most cases reported was produced. Most instances were reported between 1928 December 16 and 1928 December 22. The most average instances reported in a year were in 1951, yet that year is not shown in the top 10 years chart, which is an intriguing discovery. In comparison to the other peak years, 1951’s cases would have been more evenly distributed.
periods_high_spread <- influenza_df%>%group_by(from_date, to_date)%>%summarise(count = sum(number))%>%arrange(-count)%>%head(10)
`summarise()` has grouped output by 'from_date'. You can override using the `.groups`
argument.
periods_high_spread
options(dplyr.summarise.inform = FALSE)
Here, a bar graph is used to analyze the average number of cases recorded per month from 1919 to 1951. Most cases were reported in December, followed by January, February, and March which is the winter season in the USA. The reported influenza cases appear to be particularly prevalent throughout the winter. The fewest cases were documented in the months of June, July, and August.This is reassuring considering Emily Elert’s findings, which revealed that since vitamin D and melatonin, both require sunlight to be created, fewer days during the winter lead to low amounts of these compounds. As a result, our immune systems are compromised, which lowers our ability to fight off the infection.In colder, drier environments, the influenza virus may be able to endure longer and infect more people [4].
avg_cases_per_month <- influenza_df%>%group_by(month = month(from_date))%>%summarise(count = mean(number))
p2 <- ggplot(avg_cases_per_month, aes( x = month, y = count , fill = count)) +
geom_bar(stat = 'identity')+
scale_x_continuous(breaks = avg_cases_per_month$month) +
ggtitle('Average influenza cases for months over the years (1919 to 1951)') +
theme(axis.text.x = element_text(angle = 90, hjust=1))
p2
Fig 2: Average Influenza cases for months over the years
I then wanted to investigate which years had the greatest influenza disease spread. On the basis of the information from part 5.5.1, the top 10 years for average spread were chosen. As a result, we have decided to focus our subsequent investigation in this particular section on the following years: 1950, 1951, 1928, 1949, 1920, 1945, 1941, 1943, 1929, and 1944.
In terms of the average cases recorded in a month, the top 10 years with the most cases were displayed. I also included the month-by-month count and corresponding years in the data, which showed a pattern that was consistent with the general tendency of more occurrences being reported in the winter months in the USA.
top10_years <- avg_cases_per_year%>%arrange(-count)%>%head(10)
top10_influenza_years <- influenza_df[influenza_df$year %in% top10_years$year,]
p3 <- ggplot(top10_influenza_years, aes( x = month, y = number, group = 1)) +
stat_summary(fun = 'mean', geom = 'bar') +
ggtitle('Average influenza cases for each months of the top 10 years') +
scale_x_continuous(breaks = top10_influenza_years$month) +
theme(axis.text.x = element_text(angle = 90, hjust=1, size = 5))
p3 + facet_wrap(.~factor(top10_influenza_years$year, levels = as.character(top10_years$year)),
ncol = 5, nrow = 2, scales = 'free')
Fig 3: Average Influenza cases for each months of the top 10 years
My next step in this assignment was to examine the prevalence of the illness in various states around the country in order to answer my second research question. I hypothesized that this study might contribute to a better understanding of the relationship between state geography and disease occurrence.
This chart depicts the average number of cases reported in each state from 1919 to 1951. The state with the most average cases reported over this time was Mississippi, followed by Texas. When compared to the states in the North, the average number of cases recorded by the southern states appears to be a little higher.
# getting average no. of cases per state
map_df <- influenza_df%>%
group_by(state)%>%
summarise(count = mean(number))%>%
arrange(-count)
# merging with US data
new_df <- left_join(statepop, map_df, by = c("abbr" = "state"))
# Get centroids
centroid_labels <- usmapdata::centroid_labels("states")
# Join centroids to data
state_labels <- merge(new_df, centroid_labels, by = "fips")
# plot chloropleth graph
p4.1 <- plot_usmap(data = new_df, regions = "state", values = "count", color = "white") +
geom_text(data = state_labels, aes(x = x, y = y, label = abbr.x), color = 'white', size = 3) +
scale_fill_gradient(low = "pink", high = "dark red", name = "avg cases") +
theme(legend.position = "right") +
ggtitle('Average spread across US states over the years')
Warning: Ignoring unknown parameters: linewidth
p4.1
Fig 4: Average spread across US states over the years
This study provided a clearer interpretation of the identical data from 5.5.5.A horizontal plot is made to give a better numerical illustration of the difference in the case reported in each state. From the graph, the top 5 states with most average cases in this time frame is Mississippi, Texas, South Carolina, Iowa, and Arkansas. The Average cases reported in the Mississippi is more than double the cases reported by next most reported state.The lowest number of cases were reported in the states Delaware and Vermont. One of the reasons might be that, those states are smaller than other states.The 10 states with highest and lowest average spread of disease from these statistics were chosen for additional study in the session that followed.
p4.2 <-influenza_df%>%group_by(state)%>%
summarise(count = mean(number))%>%
ggplot() +
geom_col(aes(count, state, fill = count)) +
geom_text(aes(x = count, y = state,label = round(count,2), hjust = -0.1), size = 2.5) +
ggtitle('Average influenza cases for each state') +
theme(axis.text.y = element_text(size = 5)) +
ggtitle('Average spread across US states over the years')
p4.2
Fig 5: Average spread across US states over the years
Here we plot the Top 10 states that is having the highest and lowest average case over the years. In the first graph, the bright blue represents the highest count, whereas in the second graph it denotes the lowest count. In the top 10 states with the highest average spread of cases over the years except Mississippi, other top 10 states have relatively same average spread over the years. In the lowest scenario Delaware, Vermont, and New Hampshire is having the relative lower cases compared to other states.
p5.1 <- influenza_df%>%
group_by(state)%>%
summarise(count = mean(number))%>%
arrange(-count)%>%
head(10)%>%
ggdotchart(x = "state", y = "count",
color = "count", # Color by groups
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
theme_cleveland() + # Add dashed grids
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8)) +
ggtitle('Top 10 states with the highest average spread')
p5.2 <- influenza_df%>%
group_by(state)%>%
summarise(count = mean(number))%>%
arrange(count)%>%
head(10)%>%
ggdotchart(x = "state", y = "count",
color = "count", # Color by groups
sorting = "ascending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
scale_color_continuous(trans = "reverse") +
theme_cleveland() + # Add dashed grids
ggtitle('Top 10 states with the lowest average spread') +
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8) )
ggarrange(p5.1, p5.2, ncol = 2)
Fig 6: Top 10 states with the highest and lowest average spread over years
Here we plot the graphs for all the states for the top 10 years of average reported cases. From the graph it can be seen that in the years 1950, 1945, 1928, 1943, 1949, and 1929 the cases were concentrated in one or two states majorly. For the rest of the years the case was spread across more than 3 states which shows a country or a region wide spread of disease.
p6 <- ggplot(top10_influenza_years, aes(number, state) ) +
stat_summary(fun = 'mean', geom = 'col') +
ggtitle('State-wise average cases for top 10 years with highest spread') +
theme(axis.text.y = element_text(size = 5),
axis.text.x = element_text(size = 6))
p6 + facet_wrap(.~factor(top10_influenza_years$year, levels = as.character(top10_years$year)),
ncol = 5, nrow = 2, scales = 'free')
Fig 7: State-wise average cases for top 10 years with highest spread
Since the data includes the cities where the cases were reported, a comparison was done to see whether states or cities reported more cases. The bar graph shows that throughout time, states reported more average instances than cities.
p7 <- influenza_df%>%group_by(loc_type)%>%
summarise(count = mean(number))%>%
ggplot(aes(x = loc_type, y = count, color = loc_type)) +
geom_col() +
ggtitle('Average spread across location type over the years')
p7
Fig 8: Average spread across location type over the years
The analysis of the average illness transmission in each city was my next goal, and it provided the answer to my third research question.
Over the years, Erie City reported much more average instances than other US cities; about twice as many as the second-most reported city, New Port. The lowest reported cities in the USA were Waterloo and Barre, which had significantly less cases than the other cities.
# top 10 cities with highest average spread over the years
p8.1 <- influenza_df%>%
filter(loc_type == 'CITY')%>%
group_by(loc)%>%
summarise(count = mean(number))%>%
arrange(-count)%>%
head(10)%>%
ggdotchart(x = "loc", y = "count",
color = "count", # Color by groups
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
theme_cleveland() + # Add dashed grids
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8)) +
ggtitle('Top 10 cities with the highest average spread')
# top 10 cities with lowest average spread over the years
p8.2 <- influenza_df%>%
filter(loc_type == 'CITY')%>%
group_by(loc)%>%
summarise(count = mean(number))%>%arrange(count)%>%head(10)%>%
ggdotchart(x = "loc", y = "count",
color = "count", # Color by groups
sorting = "ascending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
scale_color_continuous(trans = "reverse") +
theme_cleveland() + # Add dashed grids
ggtitle('Top 10 cities with the lowest average spread') +
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8) )
ggarrange(p8.1, p8.2, ncol = 2)
Fig 9: Top 10 cities with the highest and lowest average spread over years
Analyzing the death cases of influenza disease
In this section, I looked over the data to identify the 10 states with the highest and lowest rates of influenza-related deaths. That provided the solution to my fourth research question. It’s also crucial to note that there is no correlation between the average dispersion and mortality. This means that even though Mississippi had the highest rate of disease spread, New York had the highest overall mortality rate.On the other hand, it is also true that the cities with lower spread had lower death rates. Delaware and Vermont serve as illustrations of same. New York state is having the highest death due to influenza in USA, while the Idaho reported the lowest deaths due to influenza disease.
# Top 10 states with the highest deaths due to influenza over years
p9.1 <- influenza_df%>%
filter(event == 'DEATHS')%>%
group_by(state)%>%
summarise(count = mean(number))%>%
arrange(-count)%>%
head(10)%>%
ggdotchart(x = "state", y = "count",
color = "count", # Color by groups
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
theme_cleveland() + # Add dashed grids
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8)) +
ggtitle('Top 10 states with the highest deaths')
# Top 10 states with the lowest deaths due to influenza over years
p9.2 <- influenza_df%>%
filter(event == 'DEATHS')%>%
group_by(state)%>%
summarise(count = mean(number))%>%
arrange(count)%>%
head(10)%>%
ggdotchart(x = "state", y = "count",
color = "count", # Color by groups
sorting = "ascending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
scale_color_continuous(trans = "reverse") +
theme_cleveland() + # Add dashed grids
ggtitle('Top 10 states with the lowest deaths') +
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8) )
ggarrange(p9.1, p9.2, ncol = 2)
Fig 10: Top 10 states with the highest and lowest deaths
Finally, the top 10 cities with highest and lowest death rate was calculated. It’s interesting to note that the cities with the highest and lowest illness transmission also had the highest and lowest disease-related deaths.Erie was the city with highest number of death followed by Newyork, Clifton, and Wilkes- Barre.Another intriguing fact is that there were nearly no documented deaths in the cities of Waterloo, Sioux City, Muskogee, Grand Forks, and Aberdeen.
# Top 10 cities with highest avg deaths over years'
p10.1 <- influenza_df%>%
filter(event == 'DEATHS')%>%
filter(loc_type == 'CITY')%>%
group_by(loc)%>%
summarise(count = mean(number))%>%
arrange(-count)%>%
head(10)%>%
ggdotchart(x = "loc", y = "count",
color = "count", # Color by groups
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
theme_cleveland() + # Add dashed grids
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8)) +
ggtitle('Top 10 cities with highest avg deaths')
# Top 10 cities with lowest avg deaths over years
p10.2 <- influenza_df%>%
filter(event == 'DEATHS')%>%
filter(loc_type == 'CITY')%>%
group_by(loc)%>%
summarise(count = mean(number))%>%
arrange(count)%>%
head(10)%>%
ggdotchart(x = "loc", y = "count",
color = "count", # Color by groups
sorting = "ascending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 4 , # Large dot size
ggtheme = theme_pubr(), # ggplot2 theme
) +
scale_color_continuous(trans = "reverse") +
theme_cleveland() + # Add dashed grids
ggtitle('Top 10 cities with lowest avg deaths') +
theme(legend.position = "none", axis.text.y = element_text( size = 7 ),
axis.text.x = element_text( size = 8 ),
title = element_text(size = 8) )
ggarrange(p10.1, p10.2, ncol = 2)
Fig 11: Top 10 cities with highest and lowest avg deaths
As observed in the preceding visualization, New York had the highest number of influenza-related deaths during the specified time period, while Mississippi had the highest average reported cases. The city with the most fatalities was Erie. The flu appears to be more prevalent in the southern United States. The incidence was highest in 1950 and lowest from 1924 to 1927, according to reports. From 1919 to 1951, the average number of reported cases appears to be increasing. December, January, and February in the winter appear to be the months with the highest number of instances reported.
This project focused on the influenza virus’s transmission across the US from 1919 to 1951. Different research problems were addressed utilizing R and various visualization techniques. Plotting the average number of influenza cases recorded each year allowed to determine which year had the most and least influenza transmission (figure 1). I was able to determine a correlation between the number of instances and winter by additionally analyzing the average dispersion of cases across the months of the year. The encouraging information for the same is illustrated in figure 2 and figure 3.This investigation also looked into the states and cities with the greatest and lowest average influenza transmission rates and reported death rates.
The R tool was extensively utilized in this project to represent a significant amount of data. We handled and cleansed a substantial amount of data in accordance with our own interests, which subsequently enabled me to present several graphs that answered my research queries.
These studies may aid researchers in this field in gaining a general understanding of the pattern of illness spread throughout a given time span. Studying how diseases behave in various states and cities can also help, and precautions should be taken as necessary. These analyses could serve as the starting point for more in-depth future studies that could improve society’s well-being.