quarto-blog/_site/search.json

177 lines
40 KiB
JSON
Raw Normal View History

2023-10-11 09:41:10 -04:00
[
{
"objectID": "posts/welcome/index.html",
"href": "posts/welcome/index.html",
"title": "Welcome To My Blog",
"section": "",
"text": "This is the first post in a Quarto blog. Welcome!\n\nSince this post doesnt specify an explicit image, the first image in the post will be used in the listing page of posts."
},
{
2023-10-11 11:15:38 -04:00
"objectID": "index.html",
"href": "index.html",
"title": "About",
2023-10-11 09:41:10 -04:00
"section": "",
2023-10-11 11:15:38 -04:00
"text": "I am a highly accomplished Medical Technologist with an extensive 14-year track record in the medical industry, consistently demonstrating the ability to effectively bridge the divide between medical professionals and information technologists. Proficient in the application of machine learning techniques to enhance medical data analysis and adept at developing innovative R Shiny apps to streamline healthcare processes and improve patient outcomes."
2023-10-11 09:41:10 -04:00
},
{
2023-10-11 11:15:38 -04:00
"objectID": "index.html#bio",
"href": "index.html#bio",
"title": "About",
"section": "",
"text": "I am a highly accomplished Medical Technologist with an extensive 14-year track record in the medical industry, consistently demonstrating the ability to effectively bridge the divide between medical professionals and information technologists. Proficient in the application of machine learning techniques to enhance medical data analysis and adept at developing innovative R Shiny apps to streamline healthcare processes and improve patient outcomes."
},
{
"objectID": "index.html#education",
"href": "index.html#education",
"title": "About",
"section": "Education",
"text": "Education\nCampbell University | Buies Creek, NC\nDoctorate of Health Sciences | August 2020 - May 2023\nUniversity of Central Florida | Orlando, FL\nM.S. in Healthcare Informatics | August 2018 - May 2020\nWestern Carolina University | Cullowhee, NC\nB.S. in Clinical Laboratory Science | August 2005 - May 2009"
},
{
"objectID": "index.html#experience",
"href": "index.html#experience",
"title": "About",
"section": "Experience",
"text": "Experience\nRoche Diagnositcs | IT Workflow Consultant | Oct 2021 - Present\nRoche Diagnostics | Field Application Specialist | July 2012 - Sept 2021\nCape Fear Valley Hospital | Lead Medical Laboratory Scientist | June 2011 - July 2012\nCape Fear Valley Hospital | Medical Laboratory Scientist | June 2009 - June 2011"
},
{
"objectID": "blog.html",
"href": "blog.html",
"title": "Posts",
2023-10-11 09:41:10 -04:00
"section": "",
2023-10-12 08:52:22 -04:00
"text": "Diabetes in Rural North Carolina : Data Collection and Cleaning\n\n\nThis is the second post in the series exploring Diabetes in rural North Carolina. This post will explore the data used for this project, from collection, cleaning, and analysis ready data.\n\n\n\n\n\n\n\n\n\nJul 25, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n \n\n\n\n\nImporting Excel Data with Multiple Header Rows\n\n\nA solution for importing Excel Data that contains two header rows.\n\n\n\n\n\n\n\n\n\nJun 22, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n \n\n\n\n\nBasic Exploration of WHO Tuberculosis Data\n\n\nToday I am going to dive into some real life data from the World Health Organization (WHO), exploring new and relapse cases of Tuberculosis. I clean up the data, and then make a few graphs to explore different variables.\n\n\n\n\n\n\n\n\n\nFeb 13, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n \n\n\n\n\nLine Graphs and Interactivity\n\n\nTableau for Healthcare Chapter 10. Static and Interactive examples\n\n\n\n\n\n\n\n\n\nFeb 10, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n \n\n\n\n\nFacets and a Lesson in Humility\n\n\nA look at Tableau for Healthcare Chapter 8. Table Lens graph.\n\n\n\n\n\n\n\n\n\nJan 29, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n \n\n\n\n\nMy Start to R\n\n\nA short introduction to my blog, and R journey.\n\n\n\n\n\n\n\n\n\nJan 24, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\nNo matching items"
2023-10-11 11:15:38 -04:00
},
{
"objectID": "posts/post-with-code/index.html",
"href": "posts/post-with-code/index.html",
"title": "Post With Code",
"section": "",
"text": "This is a post with executable code."
2023-10-11 15:22:24 -04:00
},
{
"objectID": "posts/2020-01-04_my-start-to-r/my-start-to-r.html",
"href": "posts/2020-01-04_my-start-to-r/my-start-to-r.html",
"title": "My Start to R",
"section": "",
"text": "Today starts my attempt at sharing my R journey with the world! I have been learning R off and on now since late 2019, I have begun to take it much more serious as I work through my Data Analytics class at UCF. My love for all things numbers and graphs has really blossomed, and I am choosing to share that love with anyone who cares to read. I will not claim to be the best at R, or any programming for that matter, but these are my attempts. Each post in this serious will be replicated a graph created in Tableau from the book Tableau for Healthcare. Todays graph is a simple horizontal bar chart, in transferring to both a new blog site and computer I have unfortunately lost the original bar graph, but trust me the one I created looks just like it.\n\nLoad Libraries\n\nlibrary(tidyr)\nlibrary(magrittr)\nlibrary(ggplot2)\nlibrary(stringr)\nlibrary(dplyr)\n\n\n\nImport Data\n\nds <- readxl::read_excel(\n path = \"Tableau 10 Training Practice Data.xlsx\" \n ,sheet = \"02 - Patient Falls-Single Hosp\"\n )\n\n\n\nClean Data Names\n\n#should make reusable forumla at later time\nnames(ds) <- tolower(names(ds))\nnames(ds) <- str_replace_all(names(ds),\" \", \"_\")\n\n\n\nConvert Data to Long Form\n\nds1 <- ds %>% \n gather(\"patient_falls_no_injury_rate\" , \"patient_falls_with_injury_rate\"\n ,key = \"injury\" \n ,value = \"rate\" ) %>% \n mutate(injury = (injury == \"patient_falls_with_injury_rate\"))\n\n\n\nGraph 5.1\n\nb1 <- ds %>% \n ggplot(mapping = aes(x = reorder(type_of_care,total_patient_falls_rate ) , y = total_patient_falls_rate)) +\n geom_col(fill = \"#2b83ba\") + \n coord_flip() +\n scale_y_continuous(breaks = NULL) +\n theme(axis.ticks = element_blank()) +\n labs(title = \"Rate of Patient Falls (per 1,000 Pateint Days)\\nby Type of Care for FY2017\"\n ,x = NULL\n ,y = NULL\n ) +\n theme_classic() +\n geom_text(aes(label = format(total_patient_falls_rate, digits = 2)), nudge_y = -.25, color = \"white\")\n \nb1\n\n\n\n\n\n\n\n\nCitationBibTeX citation:@online{belanger2020,\n author = {Belanger, Kyle},\n title = {My {Start} to {R}},\n date = {2020-01-24},\n langid = {en}\n}\nFor attribution, please cite this work as:\nBelanger, Kyle. 2020. “My Start to R.” January 24, 2020."
2023-10-12 08:45:37 -04:00
},
{
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html",
"title": "Basic Exploration of WHO Tuberculosis Data",
"section": "",
"text": "Today I am going to dive into some real life data from the World Health Organization (WHO), exploring new and relapse cases of Tuberculosis. I clean up the data, and then make a few graphs to explore different variables."
},
{
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#a-different-way-to-look",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#a-different-way-to-look",
"title": "Basic Exploration of WHO Tuberculosis Data",
"section": "A different way to look",
"text": "A different way to look\nCould there be any correlation between a countries population and the amount of TB cases? Maybe its just as simple as having more people means more people to get sick? Lets bring in another data set, again from World Bank Found Here, this contains total population data by country.\n\npop_raw <- read.csv(\"API_SP.POP.TOTL_DS2_en_csv_v2_713131.csv\"\n ,skip = 4)\n#If this looks famialer its because it is, the data set looks very simalar to the GDP data\n#In the future this could be moved to a function to allow cleaning much easier\npop1 <- pop_raw %>% \n select(-(Indicator.Name:X2012)\n ,-X2019\n ,-X) %>% \n pivot_longer(cols = X2013:X2018\n ,names_to = \"year\" \n ,values_to = \"population\") %>% \n mutate_if(is.character\n ,str_remove_all\n ,pattern = \"X(?=\\\\d*)\")\n\n#now lets combine this into are overall data set\n\nwho_combined <- who_combined %>% \n mutate(year = as.character(year)) %>% \n left_join(y = pop1) %>% \n select(-Country.Name)\n\n#now lets Graph again\n\ng3 <- who_combined %>% \n filter(str_detect(age,\"014|15plus|u\"),year == 2018) %>% \n group_by(country) %>% \n summarise(sum_tb_cases = (sum(values,na.rm = TRUE)/10000)\n ,population = first(population)/1000000\n ,who_region = first(g_whoregion)) %>% \n mutate(\n label = ifelse((population>250), yes = as.character(country),no = \"\")) %>%\n ggplot(aes(x = population, y = sum_tb_cases )) +\n geom_point(aes(color = who_region)) +\n ggrepel::geom_text_repel(aes(x = population, y = sum_tb_cases, label = label)) +\n labs(\n title = \"Total TB Cases by Country compared to Gross Domestic Product (GDP)\"\n ,x = \"Population (in Millions)\"\n ,y = \"Total TB Case (per 10,000 cases)\"\n ,color = \"WHO Region\"\n ) +\n theme_bw() \n\n g3 \n\n\n\n\n\nFurther Exploration\nMaybe we are on to something, the more people, the more likely they are to get sick! However India seems to have a very large number of cases so lets break these cases down further by age group for 2018.\n\ng4 <- who_combined %>% \n filter(year == 2018\n ,country == \"India\"\n ,!(str_detect(age,\"15plus|ageunk|u|014\"))\n ,(str_detect(sex,\"m|f\"))\n ) %>% \n mutate(age_range = glue::glue(\"{age_start} -- {age_end}\")) %>% \n ggplot(aes(x = reorder(age_range, as.numeric(age_start)), y = (values/1000), fill = sex)) +\n geom_col(position = \"dodge\") +\n labs(\n title = \"TB Case in India by age and gender 2018\"\n ,x = NULL\n ,y = \"Total Cases (per 1000)\"\n ,fill = \"Gender\") +\n scale_fill_manual(labels = c(\"Female\",\"Male\"), values = c(\"#e9a3c9\",\"#67a9cf\") )\n \ng4\n\n\n\n\nThere seems to be a huge spike in cases after adolescences. Females have a sharp decline the older they get, where as male case stay elevated with a slight decrease at 55."
},
{
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#last-exploration",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#last-exploration",
"title": "Basic Exploration of WHO Tuberculosis Data",
"section": "Last Exploration",
"text": "Last Exploration\nLets look at overall cases in India, going back to 1980 and see if there as been any trends. To get these numbers we will go back to our raw data and strip everything out expect the total count\n\ng5 <- who_raw %>% \n filter(country == \"India\") %>% \n select(year, c_newinc) %>% \n ggplot(aes(x = year, y = c_newinc/1000000)) +\n geom_line() +\n geom_point() +\n labs(\n title = \"New and Relapse Tuberculosis Cases In India \\n1980 -- 2018\"\n ,x = NULL\n ,y = \"Total Cases (in millions)\") +\n theme_bw() +\n theme(plot.title = element_text(hjust = 0.5)) + #center title \n scale_x_continuous(breaks = seq(1980,2020,5)) +\n scale_y_continuous(breaks = scales::pretty_breaks(n=10)) #different way to add tick marks\ng5\n\n\n\n\nCases were steadily rising from 1980 to 1990, then suddenly feel off. Starting in the early 2010s there was a sharp increase and the amount of new and relapse cases just keep growing."
},
{
"objectID": "posts/2020-01-29_facets-and-humility/facets-and-a-lesson-in-humility.html",
"href": "posts/2020-01-29_facets-and-humility/facets-and-a-lesson-in-humility.html",
"title": "Facets and a Lesson in Humility",
"section": "",
"text": "Todays post is a lesson in Facets, as well as humility. The task this week was to replicate the graph in Chapter 8 of Tableau for Healthcare in R. The graph in question is called a Table Lens (This is the name the book uses, however I did have trouble finding this name in Google searches), it is a collection of charts with a common theme, this time looking at countries in various WHO regions and some statistics associated with mortality as well as health expenditure. I say this is a lesson in humiltiy as I have read through the excellent book R for Data Science, and yet the idea of faceting a ggplot graph slipped my mind. This ended with hours of trying to find a package in R to line up graphs, and way more time then I care to admit spent on getting things prefect. I did find such a package called cowplots, which can be found here. While this is an excellent package, its use was unecessary and I reverted back to using the excellent facet feature of GGplot, which can be seen below! \n\nLoad Libraries\n\nlibrary(magrittr) #pipes\nlibrary(ggplot2) #ploting \nlibrary(dplyr)\nlibrary(tidyr)\n\n\n\nImport Data\n\nds <- readxl::read_xlsx(path = \"../2020-01-04_my-start-to-r/Tableau 10 Training Practice Data.xlsx\"\n ,sheet = \"03 - WHO Life Expect & Mort\"\n )\n\n\n\nClean Names and Transform\n\nvarnames <- c(\"who_region\", \"country\", \"year\" , \"sex\" , \"life_expect_birth\" , \"neo_mort\"\n ,\"under_five_mort\" , \"health_expenditure\")\nnames(ds) <- varnames\n\n# Order Countries based on Life Expectancy at Birth\n\nds$country <- factor(ds$country, levels = ds$country[order(ds$life_expect_birth)]) \n\n#To \"Long\" Form\n\nds1 <- ds %>% pivot_longer(5:8)#select columns 5 throuh 8, leave new columns at default names\n\n# Set up labels for Facet, as well as function for Facet Labeller\n\nfacet_labels <- list(\n\"life_expect_birth\" = \"Life Expectancy at Birth \" \n,\"neo_mort\" = \"Neonatal Mortality Rate\" \n,\"under_five_mort\" = \"Under-Five Mortality Rate\"\n,\"health_expenditure\" = \"Health Expenditure per Capita (US$)\" )\n\nvariable_labeller <- function(variable,value){\n return(facet_labels[value])\n}\n\n\n\nGraphs\n\nhightlight_countries <- (c(\"Mauritania\", \"South Africa\")) \n\ng1 <- ds1 %>% filter(who_region == \"Africa\") %>% \n mutate(name = factor(name, levels = c(\"life_expect_birth\" , \"neo_mort\"\n ,\"under_five_mort\" , \"health_expenditure\"))\n ,highlight = country %in% hightlight_countries) %>% \n ggplot(aes(x = country, y = value, fill = highlight)) +\n geom_col(show.legend = FALSE) +\n coord_flip() +\n labs(\n title = \"World Bank Life Expectancy, Neonatal & Under-Five Mortality Rates, and Health Expenditure Analysis\"\n ,x = NULL\n ,y = NULL\n ) +\n facet_grid(~name, scales = \"free_x\",labeller = variable_labeller) +\n theme_bw() +\n geom_text(aes(label = round(value, 0)), hjust = 0) +\n scale_y_continuous(expand = expand_scale(mult = c(0,0.2))) +\n scale_fill_manual(values = c(\"TRUE\" = \"#fc8d59\", \"FALSE\" = \"#2b83ba\"))\ng1\n\n\n\n\n\n\n\n\nReusehttps://creativecommons.org/licenses/by/4.0/CitationBibTeX citation:@online{belanger2020,\n author = {Belanger, Kyle},\n title = {Facets and a {Lesson} in {Humility}},\n date = {2020-01-29},\n langid = {en}\n}\nFor attribution, please cite this work as:\nBelanger, Kyle. 2020. “Facets and a Lesson in Humility.”\nJanuary 29, 2020."
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html",
"title": "Line Graphs and Interactivity",
"section": "",
"text": "Todays post is all about line graphs using both ggplot for a static graph as well as a package called plotly for interactivity (more on this later). The example graph and data is again coming from Tableau for Healthcare, Chapter 10."
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#load-libraries",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#load-libraries",
"title": "Line Graphs and Interactivity",
"section": "Load Libraries",
"text": "Load Libraries\nAs always first step is to load in our libraries, I am using quite a few here, some are a bit overkill for this example but I wanted to play around with some fun features today.\n\nlibrary(magrittr) #pipes\nlibrary(ggplot2) #ploting \nlibrary(dplyr) # data manipulation\nlibrary(tidyr) # tidy data\nlibrary(lubridate) #work with dates\nlibrary(stringr) # manipulate strings\nlibrary(plotly)"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#import-data",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#import-data",
"title": "Line Graphs and Interactivity",
"section": "Import Data",
"text": "Import Data\nNext lets import our data, this week we are using the sheet Flu Occurrence FY2013-2016. I am unsure if this is form a real data set or not but it is good for demonstration purposes! After importing we can glimpse at our data to understand what is contained within.\n\nds <- readxl::read_xlsx(path = \"../2020-01-04_my-start-to-r/Tableau 10 Training Practice Data.xlsx\"\n ,sheet = \"05 - Flu Occurrence FY2013-2016\"\n )\nds %>% glimpse()\n\nRows: 48\nColumns: 4\n$ Date <dttm> 2012-10-27, 2012-11-24, …\n$ `Tests (+) for Influenza (count)` <dbl> 995, 3228, 22368, 24615, …\n$ `Total Respiratory Specimens Tested (count)` <dbl> 18986, 24757, 66683, 7561…\n$ `% Tests (+) for Influenza` <dbl> 0.05240704, 0.13038737, 0…"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#transform-data",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#transform-data",
"title": "Line Graphs and Interactivity",
"section": "Transform Data",
"text": "Transform Data\nI went a bit overboard today with renaming the variables. I wanted to practice writing a function and while it might not be the prettiest or the best way to do this, it worked for what I was trying to accomplish. Note the use of sapply, which lets us run the function on each column name.\n\nformat_names <- function(x) {\n #Fucntion to set all names to lower case, and strip unneeded characters\n x <- tolower(x)\n x <- str_replace_all(x,c(#set each pattern equal to replacement\n \" \" = \"_\"\n ,\"\\\\(\\\\+\\\\)\" = \"pos\" #regualr experssion to match (+)\n ,\"\\\\(\" = \"\"\n ,\"\\\\)\" = \"\"\n ,\"\\\\%\" = \"pct\"\n )\n ) \n }\n\n#run the format name function on all names from DS\ncolnames(ds) <- sapply(colnames(ds),format_names) \n\nNow is were the fun really starts! For this particular data set there are a couple things we need to add to replicate the example. In the original data set the date is stored with month, day, and year; the day is irrelevant and we need to pull out the month as well as the year. For this we can use the lubridate package, first we pull out the month and set it as a factor. For this example our year actually starts in October, so we set our factor to start at October (10), and end with September (9). We then pull out the year, which presents us with a different problem. Again our year starts in October, instead of January. To solve this I have created a variable called date adjustment, in this column is our month is 10 or greater, we will place a 1, if not a 0. We then set our fiscal year to be the actual year plus the date adjustment, this allows us to have our dates in the right fiscal year. Last the percent column is currently listed as a decimal, so we will convert this to a percentage.\n\n# split date time\nds1 <- ds %>% mutate(\n #create month column, then set factors and labels to start fiscal year in Oct\n month = month(ds$date)\n ,month = factor(month\n ,levels = c(10:12, 1:9)\n ,labels = c(month.abb[10:12],month.abb[1:9]))\n ,year = year(ds$date)\n ,date_adjustment = ifelse(month(ds$date) >= 10, 1,0 )\n ,fiscal_year = factor(year + date_adjustment)\n #convert % Pos from decmial to pct\n ,pct_tests_pos_for_influenza = round(pct_tests_pos_for_influenza * 100, digits = 0)\n )\n\nds1 %>% glimpse()\n\nRows: 48\nColumns: 8\n$ date <dttm> 2012-10-27, 2012-11-24, 2012…\n$ tests_pos_for_influenza_count <dbl> 995, 3228, 22368, 24615, 1179…\n$ total_respiratory_specimens_tested_count <dbl> 18986, 24757, 66683, 75614, 5…\n$ pct_tests_pos_for_influenza <dbl> 5, 13, 34, 33, 23, 17, 11, 6,…\n$ month <fct> Oct, Nov, Dec, Jan, Feb, Mar,…\n$ year <dbl> 2012, 2012, 2012, 2013, 2013,…\n$ date_adjustment <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,…\n$ fiscal_year <fct> 2013, 2013, 2013, 2013, 2013,…"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#ggplot",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#ggplot",
"title": "Line Graphs and Interactivity",
"section": "GGplot",
"text": "GGplot\nThe graph here is pretty straight forward with one exception, group! For this line graph we want ggplot to connect the lines of the same year, if we do not explicitly state this using the group mapping, ggplot will try to connect all the lines together, which of course is not at all what we want!\n\ng1 <- ds1 %>% \n ggplot(aes(x = month, y = pct_tests_pos_for_influenza, color = fiscal_year\n ,group = fiscal_year)) +\n geom_line() +\n labs(\n x = NULL\n ,y = \"% Tests (+) for Influenza\"\n ,color = NULL\n ,title = \"Flu Viral Surveillance: % Respiratory Specimens Positive for Influenza \\nOctober - September \\nFor Flu Seasons 2013 - 2016\"\n ) +\n theme_classic() +\n scale_y_continuous(breaks = seq(0,40,5)) +\n scale_color_manual(values = c(\"#a6611a\",\"#dfc27d\",\"#80cdc1\",\"#018571\"))\n\ng1"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#plotly",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#plotly",
"title": "Line Graphs and Interactivity",
"section": "plotly",
"text": "plotly\nOne of the nice features of Tableau is the fact the graphs are interactive, while a good graph should speak for itself, end users love pretty things. I have been experimenting with Plotly, which has an open source package for R (as well as many other programming languages!). This example only just scratches the surface, but there will be many more to come!\n\ng2 <- ds1 %>% \n plot_ly(x = ~month, y = ~pct_tests_pos_for_influenza, type = \"scatter\", mode = \"lines\" \n ,color = ~fiscal_year\n ,colors = c(\"#a6611a\",\"#dfc27d\",\"#80cdc1\",\"#018571\")\n , hoverinfo = 'y') %>% \n layout(xaxis = list(\n title = \"\"\n )\n ,yaxis = list(\n title = \"% Tests (+) for Influenza\"\n )\n ,title = \"Flu Viral Surveillance: % Respiratory Specimens Positive for Influenza\"\n ,legend = list(\n x = 100\n ,y = 0.5\n ) \n \n )\n\ng2"
2023-10-12 08:52:22 -04:00
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "",
"text": "This is the second post in the series exploring Diabetes in rural North Carolina. This post will explore the data used for this project, from collection, cleaning, and analysis ready data."
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#overall",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#overall",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Overall",
"text": "Overall\nOverall there are four data sources that have been used to create the analysis ready data for this project. There is one additional metadata file that contains the list of all county FIP codes, used for linking the various data sets. All data sets use the county FIPS as the county identifier, the county name is added at the end using the metadata. The image below shows the steps taken to achieve the analysis data set, as well as a table below showing the structure of each data set.\n\n\n\n\n\nData Sources\n\n\nData\nStructure\nSource\nNotes\n\n\n\n\n2010 Census Rural/Urban Housing\none row per county\nUS Census\nNA\n\n\nCounty Health Rankings\none row per county, year\nCounty Health Rankings\nRaw data is one year per file\n\n\nPopulation Estimates\none row per county, year, age group\nUS Census\nNA\n\n\nDiabetes Data\none row per county, year\nCDC Diabetes Atlas\nRaw data is one year per file"
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#rural-housing",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#rural-housing",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Rural Housing",
"text": "Rural Housing\nThe first data set comes from the US Census, and contains the amount of housing units inside both Urban and Rural areas. The raw data was taken and used to calculate the percentage of housing units in rural areas, as well as adding the classifications of Rural, Mostly Rural, and Mostly Urban. More about these classifications can be read here. This data set is from the 2010 US Census, which is then used to set the rural classification until the next Census (2020).\nView greeter script here\n\n\n\nRural Housing Data Set\n\n\nCounty Fips\nPct Rural\nRural\n\n\n\n\n05131\n20.41\nMostly Urban\n\n\n05133\n69.29\nMostly Rural\n\n\n05135\n77.84\nMostly Rural\n\n\n05137\n100.00\nRural\n\n\n05139\n55.07\nMostly Rural\n\n\n05141\n100.00\nRural\n\n\n\nNote: \n\n\n\n\n Displaying 6 of 3,143 rows"
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#county-health-rankings",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#county-health-rankings",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "County Health Rankings",
"text": "County Health Rankings\nThe second data set comes from County Health Rankings and contains data for the risk factors associated with diabetes, this data set is complied from many different data sources. The data was downloaded by year, and then combine to form one data set. County Health Rankings uses this data to rate health outcomes across all counties of the United States, for this analysis four categories have been extracted from the overall data set. Note that the food environment index is itself a combine measure, it is a score of both access to healthy food based on distance to grocery stores, as well as access based on cost.\nView greeter script here\n\n\n\nCounty Health Rankings Sources\n\n\nMeasure\nData Source\nFirst Year Available\n\n\n\n\nAdult smoking\nBehavioral Risk Factor Surveillance System\n2010\n\n\nAdult obesity\nCDC Diabetes Interactive Atlas\n2010\n\n\nPhysical inactivity\nCDC Diabetes Interactive Atlas\n2011\n\n\nFood environment index\nUSDA Food Environment Atlas, Map the Meal Gap\n2014\n\n\n\nSource: \n\n\n\n\n https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/2020-measures\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty Risk Factors Data Set\n\n\nCounty Fips\nYear\nAdult Smoking Percent\nAdult Obesity Percent\nPhysical Inactivity Percent\nFood Environment Index\n\n\n\n\n01001\n2010\n28.1\n30.0\nNA\nNA\n\n\n01003\n2010\n23.1\n24.5\nNA\nNA\n\n\n01005\n2010\n22.7\n36.4\nNA\nNA\n\n\n01007\n2010\nNA\n31.7\nNA\nNA\n\n\n01009\n2010\n23.4\n31.5\nNA\nNA\n\n\n01011\n2010\nNA\n37.3\nNA\nNA\n\n\n\nNote: \n\n\n\n\n\n\n\n Displaying 6 of 34,555 rows"
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#population-estimates",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#population-estimates",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Population Estimates",
"text": "Population Estimates\nThe third data set also comes from the US Census and contains population estimates for each county in the United States broken down by: year, age-group, sex, race, and ethnicity. For each row in the table the percent of each type of population was calculated using the yearly population total for the county. This breakdown is useful for this project as African-Americans and Hispanics suffer from diabetes at a higher rate then other groups.\nView greeter script here\n\n\n\n\n\nUS Population Estimates Data Set\n\n\nCounty Fips\nYear\nAge Group\nYear Total Population\nTotal Male Population\nTotal Female Population\nWhite Male Population\nWhite Female Population\nBlack Male Population\nBlack Female Population\nAmerican Indian Male Population\nAmerican Indian Female Population\nAsian Male Population\nAsian Female Population\nNative Hawaiian Male Population\nNative Hawaiian Female Population\nNot Hispanic Male Population\nNot Hispanic Female Population\nHispanic Male Population\nHispanic Female Population\nPct Hsipanic Female Population\nPct Male\nPct Female\nPct White Male Population\nPct White Female Population\nPct Black Male Population\nPct Black Female Population\nPct American Indian Male Population\nPct American Indian Female Population\nPct Asian Male Population\nPct Asian Female Population\nPct Native Hawaiian Male Population\nPct Native Hawaiian Female Population\nPct not Hispanic Male Population\nPct not Hispanic Female Population\nPct Hispanic Male Population\n\n\n\n\n01001\n2010\n0-4\n54773\n1863\n1712\n1415\n1314\n356\n319\n3\n2\n13\n15\n0\n0\n1778\n1653\n85\n59\n0.11\n3.40\n3.13\n2.58\n2.40\n0.65\n0.58\n0.01\n0.00\n0.02\n0.03\n0.00\n0.00\n3.25\n3.02\n0.16\n\n\n01001\n2010\n5-9\n54773\n1984\n1980\n1506\n1517\n398\n369\n15\n6\n15\n22\n1\n4\n1916\n1908\n68\n72\n0.13\n3.62\n3.61\n2.75\n2.77\n0.73\n0.67\n0.03\n0.01\n0.03\n0.04\n0.00\n0.01\n3.50\n3.48\n0.12\n\n\n01001\n2010\n10-14\n54773\n2163\n2129\n1657\n1621\n427\n409\n13\n13\n23\n19\n4\n1\n2098\n2064\n65\n65\n0.12\n3.95\n3.89\n3.03\n2.96\n0.78\n0.75\n0.02\n0.02\n0.04\n0.03\n0.01\n0.00\n3.83\n3.77\n0.12\n\n\n01001\n2010\n15-19\n54773\n2182\n2047\n1601\n1551\n497\n426\n13\n6\n25\n16\n4\n2\n2125\n1996\n57\n51\n0.09\n3.98\n3.74\n2.92\n2.83\n0.91\n0.78\n0.02\n0.01\n0.05\n0.03\n0.01\n0.00\n3.88\n3.64\n0.10\n\n\n01001\n2010\n20-24\n54773\n1573\n1579\n1223\n1219\n306\n316\n6\n7\n6\n7\n3\n2\n1511\n1537\n62\n42\n0.08\n2.87\n2.88\n2.23\n2.23\n0.56\n0.58\n0.01\n0.01\n0.01\n0.01\n0.01\n0.00\n2.76\n2.81\n0.11\n\n\n01001\n2010\n25-29\n54773\n1574\n1617\n1251\n1235\n289\n341\n1\n4\n9\n23\n6\n3\n1505\n1570\n69\n47\n0.09\n2.87\n2.95\n2.28\n2.25\n0.53\n0.62\n0.00\n0.01\n0.02\n0.04\n0.01\n0.01\n2.75\n2.87\n0.13\n\n\n\n\n\nNote: \n\n Displaying 6 of 565560 rows"
},
{
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#diabetes-percentages",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#diabetes-percentages",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Diabetes Percentages",
"text": "Diabetes Percentages\nThe final data set comes from the CDC Diabetes Atlas and contains the estimated prevalence of diabetes in each county of the United States, by year. The data set also includes the upper and lower estimated limits, see the previous post for an explanation of how these numbers are calculated. The data was downloaded by year, and then merged into one data set for the project.\nView greeter script here\n\n\n\nUS Diabetes Data\n\n\nYear\nCounty Fips\nDiabetes Percentage\nDiabetes Lower Limit\nDiabetes Upper Limit\n\n\n\n\n2010\n01001\n11.2\n8.8\n13.9\n\n\n2010\n01003\n10.2\n8.7\n11.9\n\n\n2010\n01005\n13.0\n10.6\n15.9\n\n\n2010\n01007\n10.6\n8.2\n13.3\n\n\n2010\n01009\n12.6\n9.8\n15.7\n\n\n2010\n01011\n16.1\n12.4\n20.4"
},
{
"objectID": "posts/2020-06-22_excel-data-multiple-headers/importing-excel-data-with-multiple-header-rows.html",
"href": "posts/2020-06-22_excel-data-multiple-headers/importing-excel-data-with-multiple-header-rows.html",
"title": "Importing Excel Data with Multiple Header Rows",
"section": "",
"text": "Problem\nRecently I tried to important some Microsoft Excel data into R, and ran into an issue were the data actually had two different header rows. The top row listed a group, and then the second row listed a category within that group. Searching goggle I couldnt really find a good example of what I was looking for, so I am putting it here in hopes of helping someone else!\n\n\nExample Data\nI have created a small Excel file to demonstrate what I am talking about. Download it here. This is the data from Excel. \n\n\nCheck Data\nFirst we will read the file in using the package readxl and view the data without doing anything special to it.\n\nlibrary(readxl) # load the readxl library\nlibrary(tidyverse) # load the tidyverse for manipulating the data\nfile_path <- \"example_data.xlsx\" # set the file path\nds0 <- read_excel(file_path) # read the file\nds0\n\n# A tibble: 7 × 7\n Name `Test 1` ...3 ...4 `Test 2` ...6 ...7 \n <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n1 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3\n2 Max 22 23 24 25 26 27 \n3 Phoebe 34 34 32 34 51 12 \n4 Scamp 35 36 21 22 23 24 \n5 Chance 1234 1235 1236 1267 173 1233 \n6 Aimee 420 123 690 42 45 12 \n7 Kyle 22 23 25 26 67 54 \n\n\n\n\nNew Header Names\n\nStep 1\nFirst lets read back the data, this time however with some options. We will set the n_max equal to 2, to only read the first two rows, and set col_names to FALSE so we do not read the first row as headers.\n\nds1 <- read_excel(file_path, n_max = 2, col_names = FALSE)\nds1\n\n# A tibble: 2 × 7\n ...1 ...2 ...3 ...4 ...5 ...6 ...7 \n <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n1 Name Test 1 <NA> <NA> Test 2 <NA> <NA> \n2 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3\n\n\n\n\nStep 2\nNow that we have our headers lets first transpose them to a vertical matrix using the base function t(), then we will turn it back into a tibble to allow us to use tidyr fill function.\n\nnames <- ds1 %>%\n t() %>% #transpose to a matrix\n as_tibble() #back to tibble\nnames\n\n# A tibble: 7 × 2\n V1 V2 \n <chr> <chr>\n1 Name <NA> \n2 Test 1 Run 1\n3 <NA> Run 2\n4 <NA> Run 3\n5 Test 2 Run 1\n6 <NA> Run 2\n7 <NA> Run 3\n\n\nNote that tidyr fill can not work row wise, thus the need to flip the tibble so it is long vs wide.\n\n\nStep 3\nNow we use tidyr fill function to fill the NAs with whatever value it finds above.\n\nnames <- names %>% fill(V1) #use dplyr fill to fill in the NA's\nnames\n\n# A tibble: 7 × 2\n V1 V2 \n <chr> <chr>\n1 Name <NA> \n2 Test 1 Run 1\n3 Test 1 Run 2\n4 Test 1 Run 3\n5 Test 2 Run 1\n6 Test 2 Run 2\n7 Test 2 Run 3\n\n\n\n\nStep 4\nThis is where my data differed from many of the examples I could find online. Because the second row is also a header we can not just get rid of them. We can solve this using paste() combined with dplyr mutate to form a new column that combines the first and second column.\n\nnames <- names %>%\n mutate(\n new_names = paste(V1,V2, sep = \"_\")\n )\nnames\n\n# A tibble: 7 × 3\n V1 V2 new_names \n <chr> <chr> <chr> \n1 Name <NA> Name_NA \n2 Test 1 Run 1 Test 1_Run 1\n3 Test 1 Run 2 Test 1_Run 2\n4 Test 1 Run 3 Test 1_Run 3\n5 Test 2 Run 1 Test 2_Run 1\n6 Test 2 Run 2 Test 2_Run 2\n7 Test 2 Run 3 Test 2_Run 3\n\n\n\n\nStep 4a\nOne more small clean up task, in the example data the first column header Name, did not have a second label, this has created a name with an NA attached. We can use stringr to remove this NA.\n\nnames <- names %>% mutate(across(new_names, ~str_remove_all(.,\"_NA\")))\nnames\n\n# A tibble: 7 × 3\n V1 V2 new_names \n <ch
2023-10-11 09:41:10 -04:00
}
]