quarto-blog/_site/search.json

331 lines
124 KiB
JSON
Raw Permalink Normal View History

2023-10-11 09:41:10 -04:00
[
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html",
"href": "posts/2024-08-09-learning-Julia/index.html",
"title": "Learning Julia by WebScraping Amtrak Data",
2023-10-11 09:41:10 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Recently two things happened quite close together that started me on the journey to this post.\nSo these two things lead me to this, pulling Amtrak data from the web using Julia. I do not claim to be an expert on Julia but I am learning and I wanted to share my journey, nor to I claim to be an expert at Web Scraping. Taking those things in account lets follow along."
2023-10-12 10:33:35 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#load-packages",
"href": "posts/2024-08-09-learning-Julia/index.html#load-packages",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Load Packages",
"text": "Load Packages\nFirst off I will load the Julia packages I am going to use. The first three all have to do with web scraping, and getting the data off the website. CairoMakie will be used to make the plot. All of the rest are for data wrangling. I already have all of these packages in this project environment so I just need to let the Julia REPL know to load them. If you are brand new to Julia this site really helped explain the idea of project environments to me. I also use VSCode along with the Julia extension which does a great job of handling the project environment.\n\nusing HTTP\nusing Gumbo\nusing Cascadia\nusing DataFrames\nusing DataFramesMeta\nusing Dates\nusing Statistics\nusing CategoricalArrays\nusing CairoMakie"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#setting-up-the-web-scraping",
"href": "posts/2024-08-09-learning-Julia/index.html#setting-up-the-web-scraping",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Setting up the Web Scraping",
"text": "Setting up the Web Scraping\nNow that the packages are loaded, we can start setting up the web scraping. From my internet searching I found that Amtrak does have an API but it is quite challenging to use. I found this website Amtrak Status which does a great job of showing the data I was looking for. In this example I am just going to pull data for two trains, train 97 and train 98. You can see in the link I set those as the train numbers, and if you follow the link you will see it sets it up in a nice table to view the historical data. When then use the HTTP package to get the raw website data and then use Gumbo to parse the HTML into a table. The Cascadia package gives the various CSS selectors to help pull the info I want of the entire page. The page table does not have an ids but it is also the only table on the page. I was able to use the CSS Selector “tr” to get each row of the table into a vector. If we examine the third item in the rows vector we see that it has the information we want (the first two rows are headers for the table)\n\n\nurl = \"https://juckins.net/amtrak_status/archive/html/history.php?train_num=97%2C98&station=&date_start=07%2F01%2F2024&date_end=07%2F31%2F2024\";\nresp = HTTP.get(url);\npage = parsehtml(String(resp.body));\n\nrows = eachmatch(sel\"tr\",page.root);\n\nrows[3]"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#creating-the-dataframe",
"href": "posts/2024-08-09-learning-Julia/index.html#creating-the-dataframe",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Creating the DataFrame",
"text": "Creating the DataFrame\nNow that each row of the table is stored in a vector we need to rebuild the table into a dataframe in Julia. First I am intializing an empty dataframe by creating each column that will hold data. The column names match those of the header in the table on the website. Then I loop through each item in the rows vector. The text variable is a vector of all the td elements in the row. If the text vector is not empty and has more than one item in it, then we loop through the items and push the text into the row_data vector. Finally we push the row_data vector into the dataframe created prior to the loop. By having the nested if I can remove the footer column at the end of the table from the website. The website table header uses a different CSS selector than the rest of the table but the footer does not. At the end of the loop I now have the same table that is on the website but stored as a dataframe in Julia.\n\n# create empty DataFrame and then populate it with the table from website\ndf = DataFrame(train = String[], origin_date = [], station = String[], sch_dp = [], act_dp = String[], comments = [], s_disrupt = [], cancellations = [])\n\nfor i in rows\n text = eachmatch(Selector(\"td\"), i)\n row_data = []\n if !isempty(text) && length(text) > 1\n for item in text\n push!(row_data, nodeText(item))\n end\n push!(df, row_data)\n end\nend"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#cleaning-the-dataframe",
"href": "posts/2024-08-09-learning-Julia/index.html#cleaning-the-dataframe",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Cleaning the DataFrame",
"text": "Cleaning the DataFrame\nComing from R I am quite familiar with data cleaning using dpylr and the rest of the tidyverse packages. When looking at options I really liked what the DataFramesMeta package brings, so I have used that here to get the data were I want it. I first filter out any trains that have a service disruption as well as any that are blank in the departure column. Next I select only the station, train, and the comments column. I originally tried using the two departure columns but was having an issue with trains that arrived at the stations on day but then left the next. These were causing the delay to be quite large as it was calculating as if it actually left before arriving. The comments column has what I needed I just had to pull the string out and convert it to a numeric. After selecting the columns I first create the delay column. This pulled the comment string out of the comment column only if it contains Dp: as this indicates how late or early the train left. Next I am pulling out the time in minutes and hours from the delay string and converting those numbers to integers. The total delay column adds the minutes and hours together and if the word late is not in the column it will convert the number to negative. A negative delay in this case means the train left early. Finally I transform the columns to categorical so that they are easier to work with in the future. You can notice that for the last transformation I could not figure out how to select two columns using the transform macro. Also for those coming from R note the .=> this is the broadcast operator and it lets Julia know to perform the action on the entire vector (I think I am explaining this right!) I end the block by showing the first 5 rows of the modified dataframe.\n\n\nmod_df = @chain df begin\n @rsubset :act_dp != \"\" && :s_disrupt != \"SD\"\n @select :train :station :comments\n #can't perform match if there is nothing there\n @rtransform :delay = occursin(r\"Dp:\", :comments) ? match(r\"Dp:.*\", :comments).match : \"\"\n @rtransform :min = occursin(r\"min\", :delay) ? parse(Int,match(r\"([0-9]*) min\", :delay)[1]) : Int(0)\n @rtransform :hour = occursin(r\"hr\", :delay) ? parse(Int,match(r\"([0-9]*) hr\", :delay)[1]) *60 : Int(0)\n @rtransform :total_delay_mins = :min + :hour |> x -> occursin(r\"late\", :delay) ? x : x *-1 #if word late does not appear, train left early\n transform([:station, :train] .=> categorical, renamecols = false)\nend\n\nfirst(mod_df, 5)\n\n5×7 DataFrame\n\n\n\nRow\ntrain\nstation\ncomments\ndelay\nmin\nhour\ntotal_delay_mins\n\n\n\nCat…\nCat…\nAny\nAbstract…\nInt64\nInt64\nInt64\n\n\n\n\n1\n97\nRMT\nDp: 1 min late.\nDp: 1 min late.\n1\n0\n1\n\n\n2\n98\nFLO\nAr: 7 min early. | Dp: On time.\nDp: On time.\n0\n0\n0\n\n\n3\n98\nKTR\nDp: 12 min late.\nDp: 12 min late.\n12\n0\n12\n\n\n4\n97\nPTB\nDp: 6 min late.\nDp: 6 min late.\n6\n0\n6\n\n\n5\n97\nRVR\nAr: 8 min late. | Dp: 5 min late.\nDp: 5 min late.\n5\n0\n5"
2023-10-12 10:33:35 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#grouping-and-summarizing",
"href": "posts/2024-08-09-learning-Julia/index.html#grouping-and-summarizing",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Grouping and Summarizing",
"text": "Grouping and Summarizing\nNow that I have the data I want, I want to group and summarize to create some graphs. Again using DataFramesMeta and the by keyword I can group by the train and station columns and then create the mean, median, max, and min columns. This action felt very to summarize in dplyr. DataFramesMeta does allow you to do the grouping and combining as two separate steps, but the by keyword combines in into one step. I then ordered by the station column and then by the train column. I then created a column that shows the difference in the mean delay between the two trains. I didnt end up using this for now but I might make something with it later. Last I created two columns that contain the level code for the station and train columns. I will talk about the reason for this in the next section. The function levelcode is from the CategoricalArrays package and it creates an integer column that matches the level of the categorical name. Last I display the first 5 rows of the dataframe.\n\ngd = @chain mod_df begin\n @by _ [:train,:station] begin\n :mean = Float32[Statistics.mean(:total_delay_mins)]\n :median = Statistics.median(:total_delay_mins)\n :max = maximum(:total_delay_mins)\n :min = minimum(:total_delay_mins) \n end \n @orderby :station :train\n @groupby :station\n @transform :diff = [missing; diff(:mean)]\n @rtransform _ begin\n :station_code = levelcode(:station)\n :train_code = levelcode(:train)\n end\nend\n\nfirst(gd, 5)\n\n5×9 DataFrame\n\n\n\nRow\ntrain\nstation\nmean\nmedian\nmax\nmin\ndiff\nstation_code\ntrain_code\n\n\n\nCat…\nCat…\nFloat32\nFloat64\nInt64\nInt64\nFloat32?\nInt64\nInt64\n\n\n\n\n1\n97\nALX\n70.4\n50.0\n287\n0\nmissing\n1\n1\n\n\n2\n98\nALX\n101.387\n77.0\n399\n-16\n30.9871\n1\n2\n\n\n3\n97\nBAL\n53.3333\n27.0\n267\n3\nmissing\n2\n1\n\n\n4\n98\nBAL\n120.226\n104.0\n414\n0\n66.8925\n2\n2\n\n\n5\n97\nCHS\n71.1\n53.0\n286\n0\nmissing\n3\n1"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#plotting",
"href": "posts/2024-08-09-learning-Julia/index.html#plotting",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Plotting",
"text": "Plotting\nComing from R and the ggplot package (also having played around a bit in Plotly for R) there was a rather step learning curve to Makie! I do feel there is a ton of flexibility in Makie, but learning to use it is a beast, and was probably the hardest part of this whole thing. The first challenge was Makie does not like categorical variables (at least for barplots, dont know if this is always true), thus the need for using the level codes so I could pass a numerical vector to the x axis. I am then able to label that axis with the categorical labels. Makie does also allow you to just call the barplot function without all the other set up, and it will automatically create the figure and axis, however I wanted to do it manually and really build up the graph. First step was setting a color gradient, I used Dark2 from the ColorBrewer schemes, just as a personal preference for one I really like. Next up I create the figure. Directly from the Makie docs, The Figure is the outermost container object. I could pass some arguments to the Figure constructor, and change size or colors, but for this one I just left everything as the defaults. Next up is creating the axis. I placed it at position 1,1 within the previously created figure. I also pass labels for the x and y axis, a title, and then the labels for the xticks. The label roation is in radian so pi/2 rotates the labels 90 degrees. Next I generate the barplot. Not the ! in the function call allows for plotting on an existing axis. (More info on the Bang Operator) Last I set up Labels and Colors for the Legend, and the place the Legend at position 1,2 of the existing figure.\n\ncolors = cgrad(:Dark2_6)\nf = Figure();\nax = Axis(f[1,1], xlabel = \"Station\", ylabel = \"Mean Delay (mins)\", title = \"Mean Delay by Station\", xticks = (1:length(levels(gd.station_code)), levels(gd.station)), xticklabelrotation = pi/2)\nbarplot!(ax, gd.station_code, gd.mean, dodge = gd.train_code, color = colors[gd.train_code]) \n\nlabels = [\"$i\" for i in unique(gd.train)]\nelements = [PolyElement(polycolor = colors[i]) for i in unique(gd.train_code)]\n\nLegend(f[1,2],elements, labels, \"Train Number\")\n\n\nf"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-08-09-learning-Julia/index.html#conclusion",
"href": "posts/2024-08-09-learning-Julia/index.html#conclusion",
"title": "Learning Julia by WebScraping Amtrak Data",
"section": "Conclusion",
"text": "Conclusion\nThere is still a lot that could be done with this data set, and I am interested to keep playing around with it to see what kind of insights I could gather. Overall I learned a lot about Julia but as I learned with R there is always more to learn! I look forward to see where this journey takes me."
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2023-10-12_DHSC_Capstone/index.html",
"href": "posts/2023-10-12_DHSC_Capstone/index.html",
"title": "Reflex Testing using Machine Learning in the Clinical Laboratory",
2023-10-12 10:33:35 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Full Paper\nTo view the full paper please go to the following link\n\n\nAbstract\nIntroduction: This research study focuses on developing and testing a machine learning algorithm to predict the FT4 result or diagnose hyper or hypothyroidism in clinical chemistry. The goal is to bridge the gap between hard-coded reflex testing and fully manual reflective testing using machine learning algorithms. The significance of this study lies in the increasing healthcare costs, where laboratory services contribute significantly to medical decisions and budgets. By implementing automated reflex testing with machine learning algorithms, unnecessary laboratory tests can be reduced, resulting in cost savings and improved efficiency in the healthcare system.\nMethods: The study was performed using the Medical Information Mart for Intensive Care (MIMIC) database for data collection. The database consists of de-identified health-related data from critical care units. Eighteen variables, including patient demographics and lab values, were selected for the study. The data set was filtered based on specific criteria, and an outcome variable was created to determine if the Free T4 value was diagnostic. The data handling and modeling were performed using R and R Studio. Regression and classification models were screened using a random grid search to tune hyperparameters, and random forest models were selected as the final models based on their performance. The selected hyperparameters for both regression and classification models are specified.\nResults: The study analyzed a dataset of 11,340 observations, randomly splitting it into a training set (9071 observations) and a testing set (2269 observations) based on the Free T4 laboratory diagnostic value stratification. Classification algorithms were used to predict whether Free T4 would be diagnostic, achieving an accuracy of 0.796 and an AUC of 0.918. The model had a sensitivity of 0.632 and a specificity of 0.892. The importance of individual analytes was assessed, with TSH being the most influential variable. The study also evaluated the predictability of Free T4 results using regression, achieving a Root Mean Square Error (RMSE) of 0.334. The predicted results had an accuracy of 0.790, similar to the classification model.\nDiscussion: The study found that the diagnostic value of Free T4 can be accurately predicted 80% of the time using machine learning algorithms. However, the model had limitations in terms of sensitivity, with a false negative rate of 16% for elevated TSH results and 20% for decreased TSH results. The model achieved a specificity of 89% but did not meet the threshold for clinical deployment. The importance of individual analytes was explored, revealing unexpected correlations between TSH and hematology results, which could be valuable for future algorithms. Real-world applications could use predictive models in clinical decision-making systems to determine the need for Free T4 lab tests based on predictions and patient signs and symptoms. However, implementing such algorithms in existing laboratory information systems poses challenges.\n\n\n\n\nReuseCC BY 4.0CitationBibTeX citation:@online{belanger2023,\n author = {Belanger, Kyle},\n title = {Reflex {Testing} Using {Machine} {Learning} in the {Clinical}\n {Laboratory}},\n date = {2023-10-12},\n langid = {en}\n}\nFor attribution, please cite this work as:\nBelanger, Kyle. 2023. “Reflex Testing Using Machine Learning in\nthe Clinical Laboratory.” October 12, 2023."
2023-10-11 09:41:10 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2021-02-26_tidytuesday-hbcu-enrollment/tidytuesday-2021-week-6-hbcu-enrolment.html",
"href": "posts/2021-02-26_tidytuesday-hbcu-enrollment/tidytuesday-2021-week-6-hbcu-enrolment.html",
"title": "TidyTuesday 2021 Week 6: HBCU Enrollment",
2024-06-08 08:28:40 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Introduction\nRecently I was struggling to find a data project to work on, I felt a bit stuck with some of my current projects, so I begun to scour the internet to find something to work on. I stumbled upon (TidyTuesday)[https://github.com/rfordatascience/tidytuesday] a weekly project where untidy data is posted from various sources, for the goal of practicing cleaning and visualizing. There is not right or wrong answers for TidyTuesday, this was exactly what I was looking for! This week (well by the time this was posted, a few weeks ago) the data set was about Historically Black Colleges and Universities. Within the posted data there were a few different data sets, I chose to work with the set dealing with High school Graduation rates, throughout this post I will explain my steps for cleaning and then present a few different graphs. It should also be noted that in the first section my code blocks will build upon themselves, so the same code will be duplicated as I add more steps to it.\n\n\nLoad Data\nIn this first block we will load some required libraries as well as load in the raw data. This dataset contains data for Highschool graduation rates by race. One thing to point out here is the use of import::from(), will its use here is a bit overkill, it was more for my practice. In this case I am importing the function %nin from the Hmisc package, which in the opposite of the function %in% from base R.\n\nlibrary(dplyr)\nlibrary(ggplot2)\n\nimport::from(Hmisc, `%nin%`)\n\nhs_students_raw <- readxl::read_xlsx(\"104.10.xlsx\", sheet = 1)\n\nglimpse(hs_students_raw)\n\nRows: 48\nColumns: 19\n$ Total <dbl> 1910…\n$ `Total, percent of all persons age 25 and over` <dbl> 13.5…\n$ `Standard Errors - Total, percent of all persons age 25 and over` <chr> \"(—)…\n$ White1 <chr> \"—\",…\n$ `Standard Errors - White1` <chr> \"(†)…\n$ Black1 <chr> \"—\",…\n$ `Standard Errors - Black1` <chr> \"(†)…\n$ Hispanic <chr> \"—\",…\n$ `Standard Errors - Hispanic` <chr> \"(†)…\n$ `Total - Asian/Pacific Islander` <chr> \"—\",…\n$ `Standard Errors - Total - Asian/Pacific Islander` <chr> \"(†)…\n$ `Asian/Pacific Islander - Asian` <chr> \"—\",…\n$ `Standard Errors - Asian/Pacific Islander - Asian` <chr> \"(†)…\n$ `Asian/Pacific Islander - Pacific Islander` <chr> \"—\",…\n$ `Standard Errors - Asian/Pacific Islander - Pacific Islander` <chr> \"(†)…\n$ `American Indian/\\r\\nAlaska Native` <chr> \"—\",…\n$ `Standard Errors - American Indian/\\r\\nAlaska Native` <chr> \"(†)…\n$ `Two or more race` <chr> \"—\",…\n$ `Standard Errors - Two or more race` <chr> \"()\n\n\nNow we are going to start cleaning the data. First I am going to filter for years 1985 and up, prior to this year the data set is a bit spardic, so to keep it clean I am only going to look at 1985 and up. There are also 3 odd years (19103,19203,19303) that I am not sure what those are so I will remove that data as well.\n\nhs_students <- hs_students_raw %>% \n filter(Total >= 1985) %>% \n filter(Total %nin% c(19103, 19203, 19303))\n\nNext I am going to convert all columns to be numeric, because of some blanks in the original import all of the columns read in as characters instead of numeric.\n\nhs_students <- hs_students_raw %>% \n filter(Total >= 1985) %>% \n filter(T
2023-10-11 11:15:38 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
2024-06-08 08:28:40 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "This is the second post in the series exploring Diabetes in rural North Carolina. This post will explore the data used for this project, from collection, cleaning, and analysis ready data."
2023-10-11 11:15:38 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#overall",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#overall",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Overall",
"text": "Overall\nOverall there are four data sources that have been used to create the analysis ready data for this project. There is one additional metadata file that contains the list of all county FIP codes, used for linking the various data sets. All data sets use the county FIPS as the county identifier, the county name is added at the end using the metadata. The image below shows the steps taken to achieve the analysis data set, as well as a table below showing the structure of each data set.\n\n\n\n\n\nData Sources\n\n\nData\nStructure\nSource\nNotes\n\n\n\n\n2010 Census Rural/Urban Housing\none row per county\nUS Census\nNA\n\n\nCounty Health Rankings\none row per county, year\nCounty Health Rankings\nRaw data is one year per file\n\n\nPopulation Estimates\none row per county, year, age group\nUS Census\nNA\n\n\nDiabetes Data\none row per county, year\nCDC Diabetes Atlas\nRaw data is one year per file"
2023-10-11 15:22:24 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#rural-housing",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#rural-housing",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Rural Housing",
"text": "Rural Housing\nThe first data set comes from the US Census, and contains the amount of housing units inside both Urban and Rural areas. The raw data was taken and used to calculate the percentage of housing units in rural areas, as well as adding the classifications of Rural, Mostly Rural, and Mostly Urban. More about these classifications can be read here. This data set is from the 2010 US Census, which is then used to set the rural classification until the next Census (2020).\nView greeter script here\n\n\n\nRural Housing Data Set\n\n\nCounty Fips\nPct Rural\nRural\n\n\n\n\n05131\n20.41\nMostly Urban\n\n\n05133\n69.29\nMostly Rural\n\n\n05135\n77.84\nMostly Rural\n\n\n05137\n100.00\nRural\n\n\n05139\n55.07\nMostly Rural\n\n\n05141\n100.00\nRural\n\n\n\nNote: \n\n\n\n\n Displaying 6 of 3,143 rows"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#county-health-rankings",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#county-health-rankings",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "County Health Rankings",
"text": "County Health Rankings\nThe second data set comes from County Health Rankings and contains data for the risk factors associated with diabetes, this data set is complied from many different data sources. The data was downloaded by year, and then combine to form one data set. County Health Rankings uses this data to rate health outcomes across all counties of the United States, for this analysis four categories have been extracted from the overall data set. Note that the food environment index is itself a combine measure, it is a score of both access to healthy food based on distance to grocery stores, as well as access based on cost.\nView greeter script here\n\n\n\nCounty Health Rankings Sources\n\n\nMeasure\nData Source\nFirst Year Available\n\n\n\n\nAdult smoking\nBehavioral Risk Factor Surveillance System\n2010\n\n\nAdult obesity\nCDC Diabetes Interactive Atlas\n2010\n\n\nPhysical inactivity\nCDC Diabetes Interactive Atlas\n2011\n\n\nFood environment index\nUSDA Food Environment Atlas, Map the Meal Gap\n2014\n\n\n\nSource: \n\n\n\n\n https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/2020-measures\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty Risk Factors Data Set\n\n\nCounty Fips\nYear\nAdult Smoking Percent\nAdult Obesity Percent\nPhysical Inactivity Percent\nFood Environment Index\n\n\n\n\n01001\n2010\n28.1\n30.0\nNA\nNA\n\n\n01003\n2010\n23.1\n24.5\nNA\nNA\n\n\n01005\n2010\n22.7\n36.4\nNA\nNA\n\n\n01007\n2010\nNA\n31.7\nNA\nNA\n\n\n01009\n2010\n23.4\n31.5\nNA\nNA\n\n\n01011\n2010\nNA\n37.3\nNA\nNA\n\n\n\nNote: \n\n\n\n\n\n\n\n Displaying 6 of 34,555 rows"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#population-estimates",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#population-estimates",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Population Estimates",
"text": "Population Estimates\nThe third data set also comes from the US Census and contains population estimates for each county in the United States broken down by: year, age-group, sex, race, and ethnicity. For each row in the table the percent of each type of population was calculated using the yearly population total for the county. This breakdown is useful for this project as African-Americans and Hispanics suffer from diabetes at a higher rate then other groups.\nView greeter script here\n\n\n\n\nUS Population Estimates Data Set\n\n\nCounty Fips\nYear\nAge Group\nYear Total Population\nTotal Male Population\nTotal Female Population\nWhite Male Population\nWhite Female Population\nBlack Male Population\nBlack Female Population\nAmerican Indian Male Population\nAmerican Indian Female Population\nAsian Male Population\nAsian Female Population\nNative Hawaiian Male Population\nNative Hawaiian Female Population\nNot Hispanic Male Population\nNot Hispanic Female Population\nHispanic Male Population\nHispanic Female Population\nPct Hsipanic Female Population\nPct Male\nPct Female\nPct White Male Population\nPct White Female Population\nPct Black Male Population\nPct Black Female Population\nPct American Indian Male Population\nPct American Indian Female Population\nPct Asian Male Population\nPct Asian Female Population\nPct Native Hawaiian Male Population\nPct Native Hawaiian Female Population\nPct not Hispanic Male Population\nPct not Hispanic Female Population\nPct Hispanic Male Population\n\n\n\n\n01001\n2010\n0-4\n54773\n1863\n1712\n1415\n1314\n356\n319\n3\n2\n13\n15\n0\n0\n1778\n1653\n85\n59\n0.11\n3.40\n3.13\n2.58\n2.40\n0.65\n0.58\n0.01\n0.00\n0.02\n0.03\n0.00\n0.00\n3.25\n3.02\n0.16\n\n\n01001\n2010\n5-9\n54773\n1984\n1980\n1506\n1517\n398\n369\n15\n6\n15\n22\n1\n4\n1916\n1908\n68\n72\n0.13\n3.62\n3.61\n2.75\n2.77\n0.73\n0.67\n0.03\n0.01\n0.03\n0.04\n0.00\n0.01\n3.50\n3.48\n0.12\n\n\n01001\n2010\n10-14\n54773\n2163\n2129\n1657\n1621\n427\n409\n13\n13\n23\n19\n4\n1\n2098\n2064\n65\n65\n0.12\n3.95\n3.89\n3.03\n2.96\n0.78\n0.75\n0.02\n0.02\n0.04\n0.03\n0.01\n0.00\n3.83\n3.77\n0.12\n\n\n01001\n2010\n15-19\n54773\n2182\n2047\n1601\n1551\n497\n426\n13\n6\n25\n16\n4\n2\n2125\n1996\n57\n51\n0.09\n3.98\n3.74\n2.92\n2.83\n0.91\n0.78\n0.02\n0.01\n0.05\n0.03\n0.01\n0.00\n3.88\n3.64\n0.10\n\n\n01001\n2010\n20-24\n54773\n1573\n1579\n1223\n1219\n306\n316\n6\n7\n6\n7\n3\n2\n1511\n1537\n62\n42\n0.08\n2.87\n2.88\n2.23\n2.23\n0.56\n0.58\n0.01\n0.01\n0.01\n0.01\n0.01\n0.00\n2.76\n2.81\n0.11\n\n\n01001\n2010\n25-29\n54773\n1574\n1617\n1251\n1235\n289\n341\n1\n4\n9\n23\n6\n3\n1505\n1570\n69\n47\n0.09\n2.87\n2.95\n2.28\n2.25\n0.53\n0.62\n0.00\n0.01\n0.02\n0.04\n0.01\n0.01\n2.75\n2.87\n0.13\n\n\n\nNote: \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Displaying 6 of 565560 rows"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#diabetes-percentages",
"href": "posts/2020-07-25_diabetes-data-collection-and-cleaning/diabetes-in-rural-north-carolina-data-collection-and-cleaning.html#diabetes-percentages",
"title": "Diabetes in Rural North Carolina : Data Collection and Cleaning",
"section": "Diabetes Percentages",
"text": "Diabetes Percentages\nThe final data set comes from the CDC Diabetes Atlas and contains the estimated prevalence of diabetes in each county of the United States, by year. The data set also includes the upper and lower estimated limits, see the previous post for an explanation of how these numbers are calculated. The data was downloaded by year, and then merged into one data set for the project.\nView greeter script here\n\n\n\nUS Diabetes Data\n\n\nYear\nCounty Fips\nDiabetes Percentage\nDiabetes Lower Limit\nDiabetes Upper Limit\n\n\n\n\n2010\n01001\n11.2\n8.8\n13.9\n\n\n2010\n01003\n10.2\n8.7\n11.9\n\n\n2010\n01005\n13.0\n10.6\n15.9\n\n\n2010\n01007\n10.6\n8.2\n13.3\n\n\n2010\n01009\n12.6\n9.8\n15.7\n\n\n2010\n01011\n16.1\n12.4\n20.4"
2024-06-08 08:28:40 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-06-22_excel-data-multiple-headers/importing-excel-data-with-multiple-header-rows.html",
"href": "posts/2020-06-22_excel-data-multiple-headers/importing-excel-data-with-multiple-header-rows.html",
"title": "Importing Excel Data with Multiple Header Rows",
2024-06-08 08:28:40 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Problem\nRecently I tried to important some Microsoft Excel data into R, and ran into an issue were the data actually had two different header rows. The top row listed a group, and then the second row listed a category within that group. Searching goggle I couldnt really find a good example of what I was looking for, so I am putting it here in hopes of helping someone else!\n\n\nExample Data\nI have created a small Excel file to demonstrate what I am talking about. Download it here. This is the data from Excel. \n\n\nCheck Data\nFirst we will read the file in using the package readxl and view the data without doing anything special to it.\n\nlibrary(readxl) # load the readxl library\nlibrary(tidyverse) # load the tidyverse for manipulating the data\nfile_path <- \"example_data.xlsx\" # set the file path\nds0 <- read_excel(file_path) # read the file\nds0\n\n# A tibble: 7 × 7\n Name `Test 1` ...3 ...4 `Test 2` ...6 ...7 \n <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n1 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3\n2 Max 22 23 24 25 26 27 \n3 Phoebe 34 34 32 34 51 12 \n4 Scamp 35 36 21 22 23 24 \n5 Chance 1234 1235 1236 1267 173 1233 \n6 Aimee 420 123 690 42 45 12 \n7 Kyle 22 23 25 26 67 54 \n\n\n\n\nNew Header Names\n\nStep 1\nFirst lets read back the data, this time however with some options. We will set the n_max equal to 2, to only read the first two rows, and set col_names to FALSE so we do not read the first row as headers.\n\nds1 <- read_excel(file_path, n_max = 2, col_names = FALSE)\nds1\n\n# A tibble: 2 × 7\n ...1 ...2 ...3 ...4 ...5 ...6 ...7 \n <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n1 Name Test 1 <NA> <NA> Test 2 <NA> <NA> \n2 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3\n\n\n\n\nStep 2\nNow that we have our headers lets first transpose them to a vertical matrix using the base function t(), then we will turn it back into a tibble to allow us to use tidyr fill function.\n\nnames <- ds1 %>%\n t() %>% #transpose to a matrix\n as_tibble() #back to tibble\nnames\n\n# A tibble: 7 × 2\n V1 V2 \n <chr> <chr>\n1 Name <NA> \n2 Test 1 Run 1\n3 <NA> Run 2\n4 <NA> Run 3\n5 Test 2 Run 1\n6 <NA> Run 2\n7 <NA> Run 3\n\n\nNote that tidyr fill can not work row wise, thus the need to flip the tibble so it is long vs wide.\n\n\nStep 3\nNow we use tidyr fill function to fill the NAs with whatever value it finds above.\n\nnames <- names %>% fill(V1) #use dplyr fill to fill in the NA's\nnames\n\n# A tibble: 7 × 2\n V1 V2 \n <chr> <chr>\n1 Name <NA> \n2 Test 1 Run 1\n3 Test 1 Run 2\n4 Test 1 Run 3\n5 Test 2 Run 1\n6 Test 2 Run 2\n7 Test 2 Run 3\n\n\n\n\nStep 4\nThis is where my data differed from many of the examples I could find online. Because the second row is also a header we can not just get rid of them. We can solve this using paste() combined with dplyr mutate to form a new column that combines the first and second column.\n\nnames <- names %>%\n mutate(\n new_names = paste(V1,V2, sep = \"_\")\n )\nnames\n\n# A tibble: 7 × 3\n V1 V2 new_names \n <chr> <chr> <chr> \n1 Name <NA> Name_NA \n2 Test 1 Run 1 Test 1_Run 1\n3 Test 1 Run 2 Test 1_Run 2\n4 Test 1 Run 3 Test 1_Run 3\n5 Test 2 Run 1 Test 2_Run 1\n6 Test 2 Run 2 Test 2_Run 2\n7 Test 2 Run 3 Test 2_Run 3\n\n\n\n\nStep 4a\nOne more small clean up task, in the example data the first column header Name, did not have a second label, this has created a name with an NA attached. We can use stringr to remove this NA.\n\nnames <- names %>% mutate(across(new_names, ~str_remove_all(.,\"_NA\")))\nnames\n\n# A tibble: 7 × 3\n V1 V2 new_names \n <ch
2023-10-12 08:45:37 -04:00
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html",
"title": "Line Graphs and Interactivity",
"section": "",
"text": "Todays post is all about line graphs using both ggplot for a static graph as well as a package called plotly for interactivity (more on this later). The example graph and data is again coming from Tableau for Healthcare, Chapter 10."
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#load-libraries",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#load-libraries",
"title": "Line Graphs and Interactivity",
"section": "Load Libraries",
"text": "Load Libraries\nAs always first step is to load in our libraries, I am using quite a few here, some are a bit overkill for this example but I wanted to play around with some fun features today.\n\nlibrary(magrittr) #pipes\nlibrary(ggplot2) #ploting \nlibrary(dplyr) # data manipulation\nlibrary(tidyr) # tidy data\nlibrary(lubridate) #work with dates\nlibrary(stringr) # manipulate strings\nlibrary(plotly)"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#import-data",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#import-data",
"title": "Line Graphs and Interactivity",
"section": "Import Data",
"text": "Import Data\nNext lets import our data, this week we are using the sheet Flu Occurrence FY2013-2016. I am unsure if this is form a real data set or not but it is good for demonstration purposes! After importing we can glimpse at our data to understand what is contained within.\n\nds <- readxl::read_xlsx(path = \"../2020-01-04_my-start-to-r/Tableau 10 Training Practice Data.xlsx\"\n ,sheet = \"05 - Flu Occurrence FY2013-2016\"\n )\nds %>% glimpse()\n\nRows: 48\nColumns: 4\n$ Date <dttm> 2012-10-27, 2012-11-24, …\n$ `Tests (+) for Influenza (count)` <dbl> 995, 3228, 22368, 24615, …\n$ `Total Respiratory Specimens Tested (count)` <dbl> 18986, 24757, 66683, 7561…\n$ `% Tests (+) for Influenza` <dbl> 0.05240704, 0.13038737, 0…"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#transform-data",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#transform-data",
"title": "Line Graphs and Interactivity",
"section": "Transform Data",
"text": "Transform Data\nI went a bit overboard today with renaming the variables. I wanted to practice writing a function and while it might not be the prettiest or the best way to do this, it worked for what I was trying to accomplish. Note the use of sapply, which lets us run the function on each column name.\n\nformat_names <- function(x) {\n #Fucntion to set all names to lower case, and strip unneeded characters\n x <- tolower(x)\n x <- str_replace_all(x,c(#set each pattern equal to replacement\n \" \" = \"_\"\n ,\"\\\\(\\\\+\\\\)\" = \"pos\" #regualr experssion to match (+)\n ,\"\\\\(\" = \"\"\n ,\"\\\\)\" = \"\"\n ,\"\\\\%\" = \"pct\"\n )\n ) \n }\n\n#run the format name function on all names from DS\ncolnames(ds) <- sapply(colnames(ds),format_names) \n\nNow is were the fun really starts! For this particular data set there are a couple things we need to add to replicate the example. In the original data set the date is stored with month, day, and year; the day is irrelevant and we need to pull out the month as well as the year. For this we can use the lubridate package, first we pull out the month and set it as a factor. For this example our year actually starts in October, so we set our factor to start at October (10), and end with September (9). We then pull out the year, which presents us with a different problem. Again our year starts in October, instead of January. To solve this I have created a variable called date adjustment, in this column is our month is 10 or greater, we will place a 1, if not a 0. We then set our fiscal year to be the actual year plus the date adjustment, this allows us to have our dates in the right fiscal year. Last the percent column is currently listed as a decimal, so we will convert this to a percentage.\n\n# split date time\nds1 <- ds %>% mutate(\n #create month column, then set factors and labels to start fiscal year in Oct\n month = month(ds$date)\n ,month = factor(month\n ,levels = c(10:12, 1:9)\n ,labels = c(month.abb[10:12],month.abb[1:9]))\n ,year = year(ds$date)\n ,date_adjustment = ifelse(month(ds$date) >= 10, 1,0 )\n ,fiscal_year = factor(year + date_adjustment)\n #convert % Pos from decmial to pct\n ,pct_tests_pos_for_influenza = round(pct_tests_pos_for_influenza * 100, digits = 0)\n )\n\nds1 %>% glimpse()\n\nRows: 48\nColumns: 8\n$ date <dttm> 2012-10-27, 2012-11-24, 2012…\n$ tests_pos_for_influenza_count <dbl> 995, 3228, 22368, 24615, 1179…\n$ total_respiratory_specimens_tested_count <dbl> 18986, 24757, 66683, 75614, 5…\n$ pct_tests_pos_for_influenza <dbl> 5, 13, 34, 33, 23, 17, 11, 6,…\n$ month <fct> Oct, Nov, Dec, Jan, Feb, Mar,…\n$ year <dbl> 2012, 2012, 2012, 2013, 2013,…\n$ date_adjustment <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,…\n$ fiscal_year <fct> 2013, 2013, 2013, 2013, 2013,…"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#ggplot",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#ggplot",
"title": "Line Graphs and Interactivity",
"section": "GGplot",
"text": "GGplot\nThe graph here is pretty straight forward with one exception, group! For this line graph we want ggplot to connect the lines of the same year, if we do not explicitly state this using the group mapping, ggplot will try to connect all the lines together, which of course is not at all what we want!\n\ng1 <- ds1 %>% \n ggplot(aes(x = month, y = pct_tests_pos_for_influenza, color = fiscal_year\n ,group = fiscal_year)) +\n geom_line() +\n labs(\n x = NULL\n ,y = \"% Tests (+) for Influenza\"\n ,color = NULL\n ,title = \"Flu Viral Surveillance: % Respiratory Specimens Positive for Influenza \\nOctober - September \\nFor Flu Seasons 2013 - 2016\"\n ) +\n theme_classic() +\n scale_y_continuous(breaks = seq(0,40,5)) +\n scale_color_manual(values = c(\"#a6611a\",\"#dfc27d\",\"#80cdc1\",\"#018571\"))\n\ng1"
},
{
"objectID": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#plotly",
"href": "posts/2020-02-10_line-graphs-and-interactivity/line-graphs-and-interactivity.html#plotly",
"title": "Line Graphs and Interactivity",
"section": "plotly",
"text": "plotly\nOne of the nice features of Tableau is the fact the graphs are interactive, while a good graph should speak for itself, end users love pretty things. I have been experimenting with Plotly, which has an open source package for R (as well as many other programming languages!). This example only just scratches the surface, but there will be many more to come!\n\ng2 <- ds1 %>% \n plot_ly(x = ~month, y = ~pct_tests_pos_for_influenza, type = \"scatter\", mode = \"lines\" \n ,color = ~fiscal_year\n ,colors = c(\"#a6611a\",\"#dfc27d\",\"#80cdc1\",\"#018571\")\n , hoverinfo = 'y') %>% \n layout(xaxis = list(\n title = \"\"\n )\n ,yaxis = list(\n title = \"% Tests (+) for Influenza\"\n )\n ,title = \"Flu Viral Surveillance: % Respiratory Specimens Positive for Influenza\"\n ,legend = list(\n x = 100\n ,y = 0.5\n ) \n \n )\n\ng2"
2023-10-12 08:52:22 -04:00
},
2023-10-12 10:33:35 -04:00
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-01-04_my-start-to-r/my-start-to-r.html",
"href": "posts/2020-01-04_my-start-to-r/my-start-to-r.html",
"title": "My Start to R",
2023-10-12 11:26:16 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Today starts my attempt at sharing my R journey with the world! I have been learning R off and on now since late 2019, I have begun to take it much more serious as I work through my Data Analytics class at UCF. My love for all things numbers and graphs has really blossomed, and I am choosing to share that love with anyone who cares to read. I will not claim to be the best at R, or any programming for that matter, but these are my attempts. Each post in this serious will be replicated a graph created in Tableau from the book Tableau for Healthcare. Todays graph is a simple horizontal bar chart, in transferring to both a new blog site and computer I have unfortunately lost the original bar graph, but trust me the one I created looks just like it.\n\nLoad Libraries\n\nlibrary(tidyr)\nlibrary(magrittr)\nlibrary(ggplot2)\nlibrary(stringr)\nlibrary(dplyr)\n\n\n\nImport Data\n\nds <- readxl::read_excel(\n path = \"Tableau 10 Training Practice Data.xlsx\" \n ,sheet = \"02 - Patient Falls-Single Hosp\"\n )\n\n\n\nClean Data Names\n\n#should make reusable forumla at later time\nnames(ds) <- tolower(names(ds))\nnames(ds) <- str_replace_all(names(ds),\" \", \"_\")\n\n\n\nConvert Data to Long Form\n\nds1 <- ds %>% \n gather(\"patient_falls_no_injury_rate\" , \"patient_falls_with_injury_rate\"\n ,key = \"injury\" \n ,value = \"rate\" ) %>% \n mutate(injury = (injury == \"patient_falls_with_injury_rate\"))\n\n\n\nGraph 5.1\n\nb1 <- ds %>% \n ggplot(mapping = aes(x = reorder(type_of_care,total_patient_falls_rate ) , y = total_patient_falls_rate)) +\n geom_col(fill = \"#2b83ba\") + \n coord_flip() +\n scale_y_continuous(breaks = NULL) +\n theme(axis.ticks = element_blank()) +\n labs(title = \"Rate of Patient Falls (per 1,000 Pateint Days)\\nby Type of Care for FY2017\"\n ,x = NULL\n ,y = NULL\n ) +\n theme_classic() +\n geom_text(aes(label = format(total_patient_falls_rate, digits = 2)), nudge_y = -.25, color = \"white\")\n \nb1\n\n\n\n\n\n\n\n\n\n\n\n\nReuseCC BY 4.0CitationBibTeX citation:@online{belanger2020,\n author = {Belanger, Kyle},\n title = {My {Start} to {R}},\n date = {2020-01-24},\n langid = {en}\n}\nFor attribution, please cite this work as:\nBelanger, Kyle. 2020. “My Start to R.” January 24, 2020."
2023-10-12 08:52:22 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "blog.html",
"href": "blog.html",
"title": "Posts",
2023-10-12 11:26:16 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Learning Julia by WebScraping Amtrak Data\n\n\n\n\n\n\nJulia\n\n\ndataViz\n\n\n\n\n\n\n\n\n\nAug 27, 2024\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nDoes a US Born Players Birthdate affect their shot at the NHL\n\n\n\n\n\n\ntidytuesday\n\n\nR\n\n\ndataViz\n\n\n\nInspired by TidyTuesday Week 2 - 2024 dataset about Candian Players, lets look at the same anaylyis for American Born Players\n\n\n\n\n\nJun 8, 2024\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nReflex Testing using Machine Learning in the Clinical Laboratory\n\n\nThis post contains the abstract of my Capstone for the Doctorate of Health Science program at Campbell University. \n\n\n\n\n\n\n\n\nOct 12, 2023\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nTidyTuesday 2021 Week 6: HBCU Enrollment\n\n\nTidyTuesday 2021 Week 6: HBCU Enrollment. Posts looks at tidying the data ,as well as making some graphs about the data. \n\n\n\nTidyTuesday\n\n\n\n\n\n\n\n\n\nFeb 26, 2021\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nConverting From Blogdown to Distill\n\n\nA meta post on transferring from a blogdown to distill blog site \n\n\n\nDistill\n\n\n\n\n\n\n\n\n\nJan 12, 2021\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nDiabetes in Rural North Carolina : Data Collection and Cleaning\n\n\nThis is the second post in the series exploring Diabetes in rural North Carolina. This post will explore the data used for this project, from collection, cleaning, and analysis ready data. \n\n\n\n\n\n\n\n\nJul 25, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nDiabetes in Rural North Carolina : Exploring Prevalence Trends\n\n\nThis post introduces the exploration of the Diabetes epidemic in North Carolina \n\n\n\n\n\n\n\n\nJun 25, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nImporting Excel Data with Multiple Header Rows\n\n\nA solution for importing Excel Data that contains two header rows. \n\n\n\n\n\n\n\n\nJun 22, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nBasic Exploration of WHO Tuberculosis Data\n\n\nToday I am going to dive into some real life data from the World Health Organization (WHO), exploring new and relapse cases of Tuberculosis. I clean up the data, and then make a few graphs to explore different variables. \n\n\n\n\n\n\n\n\nFeb 13, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nLine Graphs and Interactivity\n\n\nTableau for Healthcare Chapter 10. Static and Interactive examples \n\n\n\n\n\n\n\n\nFeb 10, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nFacets and a Lesson in Humility\n\n\nA look at Tableau for Healthcare Chapter 8. Table Lens graph. \n\n\n\n\n\n\n\n\nJan 29, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\n\n\n\n\n\n\nMy Start to R\n\n\nA short introduction to my blog, and R journey. \n\n\n\n\n\n\n\n\nJan 24, 2020\n\n\nKyle Belanger\n\n\n\n\n\n\nNo matching items"
2023-10-12 08:52:22 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "index.html",
"href": "index.html",
"title": "About",
"section": "",
"text": "I am a highly accomplished Medical Technologist with an extensive 14-year track record in the medical industry, consistently demonstrating the ability to effectively bridge the divide between medical professionals and information technologists. Proficient in the application of machine learning techniques to enhance medical data analysis and adept at developing innovative R Shiny apps to streamline healthcare processes and improve patient outcomes."
2023-10-12 08:52:22 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "index.html#bio",
"href": "index.html#bio",
"title": "About",
"section": "",
"text": "I am a highly accomplished Medical Technologist with an extensive 14-year track record in the medical industry, consistently demonstrating the ability to effectively bridge the divide between medical professionals and information technologists. Proficient in the application of machine learning techniques to enhance medical data analysis and adept at developing innovative R Shiny apps to streamline healthcare processes and improve patient outcomes."
2023-10-12 08:52:22 -04:00
},
2023-10-12 09:05:03 -04:00
{
2024-08-27 11:43:02 -04:00
"objectID": "index.html#education",
"href": "index.html#education",
"title": "About",
"section": "Education",
"text": "Education\nCampbell University | Buies Creek, NC\nDoctorate of Health Sciences | August 2020 - May 2023\nUniversity of Central Florida | Orlando, FL\nM.S. in Healthcare Informatics | August 2018 - May 2020\nWestern Carolina University | Cullowhee, NC\nB.S. in Clinical Laboratory Science | August 2005 - May 2009"
2023-10-12 11:26:16 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "index.html#experience",
"href": "index.html#experience",
"title": "About",
"section": "Experience",
"text": "Experience\nRoche Diagnositcs | IT Workflow Consultant | Oct 2021 - Present\nRoche Diagnostics | Field Application Specialist | July 2012 - Sept 2021\nCape Fear Valley Hospital | Lead Medical Laboratory Scientist | June 2011 - July 2012\nCape Fear Valley Hospital | Medical Laboratory Scientist | June 2009 - June 2011"
2023-10-12 11:26:16 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-01-29_facets-and-humility/facets-and-a-lesson-in-humility.html",
"href": "posts/2020-01-29_facets-and-humility/facets-and-a-lesson-in-humility.html",
"title": "Facets and a Lesson in Humility",
"section": "",
"text": "Todays post is a lesson in Facets, as well as humility. The task this week was to replicate the graph in Chapter 8 of Tableau for Healthcare in R. The graph in question is called a Table Lens (This is the name the book uses, however I did have trouble finding this name in Google searches), it is a collection of charts with a common theme, this time looking at countries in various WHO regions and some statistics associated with mortality as well as health expenditure. I say this is a lesson in humiltiy as I have read through the excellent book R for Data Science, and yet the idea of faceting a ggplot graph slipped my mind. This ended with hours of trying to find a package in R to line up graphs, and way more time then I care to admit spent on getting things prefect. I did find such a package called cowplots, which can be found here. While this is an excellent package, its use was unecessary and I reverted back to using the excellent facet feature of GGplot, which can be seen below! \n\nLoad Libraries\n\nlibrary(magrittr) #pipes\nlibrary(ggplot2) #ploting \nlibrary(dplyr)\nlibrary(tidyr)\n\n\n\nImport Data\n\nds <- readxl::read_xlsx(path = \"../2020-01-04_my-start-to-r/Tableau 10 Training Practice Data.xlsx\"\n ,sheet = \"03 - WHO Life Expect & Mort\"\n )\n\n\n\nClean Names and Transform\n\nvarnames <- c(\"who_region\", \"country\", \"year\" , \"sex\" , \"life_expect_birth\" , \"neo_mort\"\n ,\"under_five_mort\" , \"health_expenditure\")\nnames(ds) <- varnames\n\n# Order Countries based on Life Expectancy at Birth\n\nds$country <- factor(ds$country, levels = ds$country[order(ds$life_expect_birth)]) \n\n#To \"Long\" Form\n\nds1 <- ds %>% pivot_longer(5:8)#select columns 5 throuh 8, leave new columns at default names\n\n# Set up labels for Facet, as well as function for Facet Labeller\n\nfacet_labels <- list(\n\"life_expect_birth\" = \"Life Expectancy at Birth \" \n,\"neo_mort\" = \"Neonatal Mortality Rate\" \n,\"under_five_mort\" = \"Under-Five Mortality Rate\"\n,\"health_expenditure\" = \"Health Expenditure per Capita (US$)\" )\n\nvariable_labeller <- function(variable,value){\n return(facet_labels[value])\n}\n\n\n\nGraphs\n\nhightlight_countries <- (c(\"Mauritania\", \"South Africa\")) \n\ng1 <- ds1 %>% filter(who_region == \"Africa\") %>% \n mutate(name = factor(name, levels = c(\"life_expect_birth\" , \"neo_mort\"\n ,\"under_five_mort\" , \"health_expenditure\"))\n ,highlight = country %in% hightlight_countries) %>% \n ggplot(aes(x = country, y = value, fill = highlight)) +\n geom_col(show.legend = FALSE) +\n coord_flip() +\n labs(\n title = \"World Bank Life Expectancy, Neonatal & Under-Five Mortality Rates, and Health Expenditure Analysis\"\n ,x = NULL\n ,y = NULL\n ) +\n facet_grid(~name, scales = \"free_x\",labeller = variable_labeller) +\n theme_bw() +\n geom_text(aes(label = round(value, 0)), hjust = 0) +\n scale_y_continuous(expand = expand_scale(mult = c(0,0.2))) +\n scale_fill_manual(values = c(\"TRUE\" = \"#fc8d59\", \"FALSE\" = \"#2b83ba\"))\ng1\n\n\n\n\n\n\n\n\n\n\n\n\nReuseCC BY 4.0CitationBibTeX citation:@online{belanger2020,\n author = {Belanger, Kyle},\n title = {Facets and a {Lesson} in {Humility}},\n date = {2020-01-29},\n langid = {en}\n}\nFor attribution, please cite this work as:\nBelanger, Kyle. 2020. “Facets and a Lesson in Humility.”\nJanuary 29, 2020."
2023-10-12 11:26:16 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html",
"title": "Basic Exploration of WHO Tuberculosis Data",
2023-10-12 11:26:16 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Today I am going to dive into some real life data from the World Health Organization (WHO), exploring new and relapse cases of Tuberculosis. I clean up the data, and then make a few graphs to explore different variables."
2023-10-12 11:26:16 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#a-different-way-to-look",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#a-different-way-to-look",
"title": "Basic Exploration of WHO Tuberculosis Data",
"section": "A different way to look",
"text": "A different way to look\nCould there be any correlation between a countries population and the amount of TB cases? Maybe its just as simple as having more people means more people to get sick? Lets bring in another data set, again from World Bank Found Here, this contains total population data by country.\n\npop_raw <- read.csv(\"API_SP.POP.TOTL_DS2_en_csv_v2_713131.csv\"\n ,skip = 4)\n#If this looks famialer its because it is, the data set looks very simalar to the GDP data\n#In the future this could be moved to a function to allow cleaning much easier\npop1 <- pop_raw %>% \n select(-(Indicator.Name:X2012)\n ,-X2019\n ,-X) %>% \n pivot_longer(cols = X2013:X2018\n ,names_to = \"year\" \n ,values_to = \"population\") %>% \n mutate_if(is.character\n ,str_remove_all\n ,pattern = \"X(?=\\\\d*)\")\n\n#now lets combine this into are overall data set\n\nwho_combined <- who_combined %>% \n mutate(year = as.character(year)) %>% \n left_join(y = pop1) %>% \n select(-Country.Name)\n\n#now lets Graph again\n\ng3 <- who_combined %>% \n filter(str_detect(age,\"014|15plus|u\"),year == 2018) %>% \n group_by(country) %>% \n summarise(sum_tb_cases = (sum(values,na.rm = TRUE)/10000)\n ,population = first(population)/1000000\n ,who_region = first(g_whoregion)) %>% \n mutate(\n label = ifelse((population>250), yes = as.character(country),no = \"\")) %>%\n ggplot(aes(x = population, y = sum_tb_cases )) +\n geom_point(aes(color = who_region)) +\n ggrepel::geom_text_repel(aes(x = population, y = sum_tb_cases, label = label)) +\n labs(\n title = \"Total TB Cases by Country compared to Gross Domestic Product (GDP)\"\n ,x = \"Population (in Millions)\"\n ,y = \"Total TB Case (per 10,000 cases)\"\n ,color = \"WHO Region\"\n ) +\n theme_bw() \n\n g3 \n\n\n\n\n\n\n\n\n\nFurther Exploration\nMaybe we are on to something, the more people, the more likely they are to get sick! However India seems to have a very large number of cases so lets break these cases down further by age group for 2018.\n\ng4 <- who_combined %>% \n filter(year == 2018\n ,country == \"India\"\n ,!(str_detect(age,\"15plus|ageunk|u|014\"))\n ,(str_detect(sex,\"m|f\"))\n ) %>% \n mutate(age_range = glue::glue(\"{age_start} -- {age_end}\")) %>% \n ggplot(aes(x = reorder(age_range, as.numeric(age_start)), y = (values/1000), fill = sex)) +\n geom_col(position = \"dodge\") +\n labs(\n title = \"TB Case in India by age and gender 2018\"\n ,x = NULL\n ,y = \"Total Cases (per 1000)\"\n ,fill = \"Gender\") +\n scale_fill_manual(labels = c(\"Female\",\"Male\"), values = c(\"#e9a3c9\",\"#67a9cf\") )\n \ng4\n\n\n\n\n\n\n\n\nThere seems to be a huge spike in cases after adolescences. Females have a sharp decline the older they get, where as male case stay elevated with a slight decrease at 55."
},
{
"objectID": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#last-exploration",
"href": "posts/2020-02-13_basic-who-TB-data/basic-exploration-of-who-tuberculosis-data.html#last-exploration",
"title": "Basic Exploration of WHO Tuberculosis Data",
"section": "Last Exploration",
"text": "Last Exploration\nLets look at overall cases in India, going back to 1980 and see if there as been any trends. To get these numbers we will go back to our raw data and strip everything out expect the total count\n\ng5 <- who_raw %>% \n filter(country == \"India\") %>% \n select(year, c_newinc) %>% \n ggplot(aes(x = year, y = c_newinc/1000000)) +\n geom_line() +\n geom_point() +\n labs(\n title = \"New and Relapse Tuberculosis Cases In India \\n1980 -- 2018\"\n ,x = NULL\n ,y = \"Total Cases (in millions)\") +\n theme_bw() +\n theme(plot.title = element_text(hjust = 0.5)) + #center title \n scale_x_continuous(breaks = seq(1980,2020,5)) +\n scale_y_continuous(breaks = scales::pretty_breaks(n=10)) #different way to add tick marks\ng5\n\n\n\n\n\n\n\n\nCases were steadily rising from 1980 to 1990, then suddenly feel off. Starting in the early 2010s there was a sharp increase and the amount of new and relapse cases just keep growing."
},
{
"objectID": "posts/2020-06-25_diabetes-prevalence-in-nc/diabetes-in-rural-north-carolina-exploring-prevalence-trends.html",
"href": "posts/2020-06-25_diabetes-prevalence-in-nc/diabetes-in-rural-north-carolina-exploring-prevalence-trends.html",
"title": "Diabetes in Rural North Carolina : Exploring Prevalence Trends",
2023-10-12 11:26:16 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "Update\n2022-15-03: Since this was posted the CDC has updated how county level diabetes prevalance is calculated. The data presented here is using previous calcualtions and may no longer be correct. More can be read here\n\n\nAbstract\nDiabetes is growing at an epidemic rate in the United States. In North Carolina alone, diabetes and prediabetes cost an estimated $10.9 billion each year (American Diabetes Asssociation, 2015). This post introduces the exploration of the Diabetes epidemic in North Carolina. Through a series of posts this project will examine various public data available on diabetes and explore possible solutions to address the rise of diabetes in North Carolina. This investigation stems from the Capstone project of my Health Care Informatics Masters program. This post will answer the following questions:\n\n\nWhat is the overall trend of diabetes prevalence in the United States?\n\n\n\n\nWhat is the trend of diabetes at a State Level and how does diabetes prevalence vary by state and region?\n\n\n\n\nHow do trends in diabetes prevalence vary across counties of North Carolina?\n\n\n\n\nIn which counties of North Carolina does the largest change in diabetes prevalence occur?\n\n\n\n\nHow does change in diabetes prevalence compare between rural and urban counties?\n\n\n\n\nEnviroment\nThis section contains technical information for deeper analysis and reproduction. Casual readers are invited to skip it.\nPackages used in this report.\n\n\nCode\n# Attach these packages so their functions don't need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path\nlibrary(magrittr) # enables piping : %>%\nlibrary(dplyr) # data wrangling\nlibrary(ggplot2) # graphs\nlibrary(tidyr) # data tidying\nlibrary(maps)\nlibrary(mapdata)\nlibrary(sf)\nlibrary(readr)\n\n\nDefinitions of global object (file paths, factor levels, object groups ) used throughout the report.\n\n\nCode\n#set ggplot theme\nggplot2::theme_set(theme_bw())\n\n\n\n\nData\nThe data for this exploration comes from several sources:\n\nThe Diabetes data set for state and county levels were sourced from the US Diabetes Surveillance System; Division of Diabetes Translation - Centers for Disease Control and Prevention. The data was downloaded one year per file, and compiled into a single data set for analysis.\nThe Diabetes data set for National level data were sourced from the CDCs National Health Interview Survey (NHIS)\nThe list of rural counties was taken from The Office of Rural Health Policy, the list is available here\n\n\n\n\nCode\n# load the data, and have all column names in lowercase\n\nnc_diabetes_data_raw <- read_csv(\"https://raw.githubusercontent.com/mmmmtoasty19/nc-diabetes-epidemic-2020/62bdaa6971fbdff09214de7c013d40122abbe40d/data-public/derived/nc-diabetes-data.csv\") %>% \n rename_all(tolower)\n\nus_diabetes_data_raw <- read_csv(\"https://github.com/mmmmtoasty19/nc-diabetes-epidemic-2020/raw/62bdaa6971fbdff09214de7c013d40122abbe40d/data-public/raw/us_diabetes_totals.csv\"\n ,skip = 2)\n\nrural_counties <- read_csv(\"https://github.com/mmmmtoasty19/nc-diabetes-epidemic-2020/raw/b29bfd93b20b73a7000d349cb3b55fd0822afe76/data-public/metadata/rural-counties.csv\")\n\ncounty_centers_raw <- read_csv(\"https://github.com/mmmmtoasty19/nc-diabetes-epidemic-2020/raw/b29bfd93b20b73a7000d349cb3b55fd0822afe76/data-public/raw/nc_county_centers.csv\", col_names = c(\"county\", \"lat\",\"long\"))\n\ndiabetes_atlas_data_raw <- read_csv(\"https://raw.githubusercontent.com/mmmmtoasty19/nc-diabetes-epidemic-2020/b29bfd93b20b73a7000d349cb3b55fd0822afe76/data-public/raw/DiabetesAtlasData.csv\"\n ,col_types = cols(LowerLimit = col_skip(), \n UpperLimit = col_skip(),\n Percentage = col_double()), skip = 2)\n\n\n\n\n\nCode\n# load in both US State Map and NC County Map\n\nnc_counties_map_raw <- st_as_sf(map(\"county\",region = \"north carolina
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html",
"href": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html",
"title": "Converting From Blogdown to Distill",
2024-08-23 16:30:17 -04:00
"section": "",
2024-08-27 11:43:02 -04:00
"text": "I have since converted this blog to a quarto blog, but am leaving this post up in case anyone finds it useful"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#code-folding",
"href": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#code-folding",
"title": "Converting From Blogdown to Distill",
"section": "Code Folding",
"text": "Code Folding\nWhen I converted my blog on 12/30/2020, code folding was not included as an option by default in distill. At that time, an excellent package called Codefolder added the functionality. Since going live with the blog, code folding has been added to distill.1 Code folding is available for either the whole document or individual code sections. The default caption is “Show Code”, but instead of typing code_folding=TRUE, you can provide a string to change the caption.\n\n# Some awesome code \n# That does awesome things"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#customizing-the-home-page",
"href": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#customizing-the-home-page",
"title": "Converting From Blogdown to Distill",
"section": "Customizing the Home Page",
"text": "Customizing the Home Page\nBy default, a distill blogs home page will be the blog index page. I chose to edit my home page to be a landing page for myself and then have the blog index as a separate page. When creating a new blog, this is the default YAML header for your index page.\n---\ntitle: \"New Site\"\nsite: distill::distill_website\nlisting: posts\n---\nThe critical piece here is the line site: distill::distill_website. This line is what is needed to render the website. For my home page, I decided to use the package Postcard, which is used to generate simple landing pages. I wont go into every step as there is already a great post by Alison Hill on how to do that. However, I will point out the most crucial part of the new index page the YAML header needs to contain these two lines.\noutput:\n postcards::trestles\nsite: distill::distill_website"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#footnotes",
"href": "posts/2021-01-12_blogdown-to-distill/creating-a-distill-blog.html#footnotes",
"title": "Converting From Blogdown to Distill",
"section": "Footnotes",
"text": "Footnotes\n\n\nNote that as of publishing, code folding is only available in the development version of distill↩"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-05-15-US-NHL-Birthrate/index.html",
"href": "posts/2024-05-15-US-NHL-Birthrate/index.html",
"title": "Does a US Born Players Birthdate affect their shot at the NHL",
"section": "",
"text": "This post is inspired by this fantastic blog post on Jlaws Blog. In it they explore how in the first chapter Malcolm Gladwells Outliers he discusses how in Canadian Junior Hockey there is a higher likelihood for players to be born in the first quarter of the year. As it appears cutoff dates for USA hockey are different and they are currently using June 1st (if my internet searches are to be believed), I wondered if the same analysis would hold true for American Born Players."
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-05-15-US-NHL-Birthrate/index.html#distribution-of-births-by-month-in-the-united-states",
"href": "posts/2024-05-15-US-NHL-Birthrate/index.html#distribution-of-births-by-month-in-the-united-states",
"title": "Does a US Born Players Birthdate affect their shot at the NHL",
"section": "Distribution of Births by Month in the United States",
"text": "Distribution of Births by Month in the United States\nThe data for US Birth Rates can be pulled from CDC Wonder. The particular table of interest is the Natality, 2007 - 2022. CDC Wonder has a quite interesting API that requires a request with quite a few XML parameters. Thankfully you can build the request on the website and a nice package already exists to send the query. Check out the Wonderapi Page for more info.\n\nusa_raw <- wonderapi::send_query(\"D66\", here::here(\"posts\", \"2024-05-15-US-NHL-Birthrate\", \"cdc_wonder_request.xml\"))\n\nusa_births <- usa_raw %>%\n dplyr::group_by(Month) %>%\n dplyr::summarise(country_births = sum(Births), .groups = \"drop\") %>%\n dplyr::mutate(country_pct = country_births / sum(country_births))\n\n\nDistribution of Births Compared to Expected\nThe data from CDC Wonder pulls in quite nice, the only addition is adding a column for expected Births. This column gives each day of each month an equal chance for a person being born. Based on the data the summer months (June through August), and September have a slightly higher actual birth vs expected. Based on cut off Dates many of these kids would be the oldest in their age groups.\n\nusa_births %>%\n dplyr::mutate(expected_births = dplyr::case_when(\n Month %in% c(\"April\", \"June\", \"September\", \"November\") ~ 30 / 365\n , Month == \"February\" ~ 28 / 365\n , .default = 31 / 365\n )\n , difference = country_pct - expected_births\n , dplyr::across(Month, ~factor(., levels = month.name))\n , dplyr::across(c(country_pct, expected_births, difference), ~scales::percent(., accuracy = .1))\n ) %>%\n dplyr::arrange(Month) %>%\n dplyr::rename_with(~stringr::str_replace_all(., \"_\", \" \")) %>%\n dplyr::rename_with(stringr::str_to_title) %>%\n kableExtra::kbl() %>%\n kableExtra::kable_styling()\n\n\n\n\nMonth\nCountry Births\nCountry Pct\nExpected Births\nDifference\n\n\n\n\nJanuary\n5118343\n8.2%\n8.5%\n-0.3%\n\n\nFebruary\n4758741\n7.6%\n7.7%\n-0.1%\n\n\nMarch\n5205579\n8.3%\n8.5%\n-0.2%\n\n\nApril\n5001651\n8.0%\n8.2%\n-0.3%\n\n\nMay\n5226642\n8.3%\n8.5%\n-0.2%\n\n\nJune\n5226141\n8.3%\n8.2%\n0.1%\n\n\nJuly\n5528731\n8.8%\n8.5%\n0.3%\n\n\nAugust\n5635283\n9.0%\n8.5%\n0.5%\n\n\nSeptember\n5448101\n8.7%\n8.2%\n0.5%\n\n\nOctober\n5348495\n8.5%\n8.5%\n0.0%\n\n\nNovember\n5059952\n8.1%\n8.2%\n-0.2%\n\n\nDecember\n5227828\n8.3%\n8.5%\n-0.2%"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-05-15-US-NHL-Birthrate/index.html#hockey-data",
"href": "posts/2024-05-15-US-NHL-Birthrate/index.html#hockey-data",
"title": "Does a US Born Players Birthdate affect their shot at the NHL",
"section": "Hockey Data",
"text": "Hockey Data\nWhile I wish I could sit and type out how I sat and figured out the complexity of the NHL Stats API and how to bring it into R. In reality I took a great guide, that being Jlaws post, and tweaked what I needed. Instead of Canadian players, I pulled out just the US Born players and their birth dates. I did also pull out positions to see if that will make any sort of difference. What pulls out of the NHL API has a ton of great details and I look forward to diving into what is available to see what kind of graphics can be built.\n08/27/2024 Update Due to the the Coyotes moving to Utah, I had to edit the code slightly to adjust for this. When gathering the active roster data the API was returning a blank response. This was causing Tidyr Hoist to fail because it could not pull the columns from the nested data frame. I added a check to see if the data frame is empty and if it is, then I return an empty data frame and skip this step.\n\nteams <- httr::GET(\"https://api.nhle.com/stats/rest/en/team\") %>%\n httr::content() %>%\n .[[\"data\"]] %>%\n tibble::tibble(data = .) %>%\n tidyr::unnest_wider(data)\n\nget_roster <- function(team){\n df <- httr::GET(glue::glue(\"https://api-web.nhle.com/v1/roster/{team}/20232024\")) %>%\n httr::content() %>%\n purrr::flatten() %>%\n tibble::tibble(data = .)\n\n if (!nrow(df) == 0) {\n df <- df |>\n tidyr::hoist(\n .col = \"data\"\n , \"firstName\" = list(\"firstName\", 1L)\n , \"lastName\" = list(\"lastName\", 1L)\n , \"positionCode\"\n , \"birthDate\"\n , \"birthCountry\"\n )\n }\n return(df)\n}\n\nusa_roster <- purrr::map(teams$triCode, get_roster) %>%\n purrr::list_rbind() %>%\n dplyr::filter(!is.na(firstName)) %>%\n dplyr::filter(birthCountry == \"USA\") %>%\n dplyr::mutate(\n mob = lubridate::month(lubridate::ymd(birthDate), label = TRUE, abbr = FALSE)\n , mob_id = lubridate::month(lubridate::ymd(birthDate))\n ) %>%\n dplyr::count(mob_id, mob, name = \"players\") %>%\n dplyr::mutate(player_pct = players / sum(players))"
2024-08-23 16:30:17 -04:00
},
{
2024-08-27 11:43:02 -04:00
"objectID": "posts/2024-05-15-US-NHL-Birthrate/index.html#graph-it",
"href": "posts/2024-05-15-US-NHL-Birthrate/index.html#graph-it",
"title": "Does a US Born Players Birthdate affect their shot at the NHL",
"section": "Graph It",
"text": "Graph It\nLets now take a look at the graph. Using the ggimage package we can place nice logos for both the United States and NHL on the graph. This stands out quite nicely versus just using a colored point. Interesting enough the graph seems to show being born early on in the year may mean making the NHL is more likely.\n\nnhl_icon <- \"https://pbs.twimg.com/media/F9sTTAYakAAkRv6.png\"\nusa_icon <- \"https://cdn-icons-png.flaticon.com/512/197/197484.png\"\n\ncombined <- usa_roster %>%\n dplyr::left_join(usa_births, by = c(\"mob\" = \"Month\")) %>%\n dplyr::mutate(\n random = dplyr::case_when(\n mob_id %in% c(4, 6, 9, 11) ~ 30 / 365,\n mob_id %in% c(1, 3, 5, 7, 8, 10, 12) ~ 31 / 365,\n mob_id == 2 ~ 28 / 365\n )\n )\n\n# labels <- combined %>% glue::glue_data(\"{mob} <br> n = {players}\")\n\ng1 <- combined %>%\n ggplot(aes(x = forcats::fct_reorder(mob, -mob_id))) +\n geom_line(aes(y = random, group = 1), linetype = 2, color = \"grey60\") +\n geom_linerange(aes(ymin = country_pct, ymax = player_pct)) +\n geom_image(aes(image = nhl_icon, y = player_pct), size = 0.1) +\n geom_image(aes(image = usa_icon, y = country_pct), size = 0.08) +\n geom_text(aes(label = scales::percent(player_pct, accuracy = .1),\n y = dplyr::if_else(player_pct > country_pct, player_pct + .006, player_pct - .006)), size = 5) +\n geom_text(aes(label = scales::percent(country_pct, accuracy = .1),\n y = dplyr::if_else(country_pct > player_pct, country_pct + .006, country_pct - .006)), size = 5) +\n scale_y_continuous(labels = scales::percent) +\n # scale_x_discrete(labels = labels) +\n coord_flip() +\n labs(\n x = \"Month of Birth\"\n , y = \"Percentage of Births\"\n , title = \"Are United States Born NHL Players More Likely to be Born Early in the Year?\"\n , subtitle = \"Comparing the distribution of birth months between US NHL players and US in general\"\n , caption = glue::glue(\n \"<img src = {nhl_icon} width = '15' height=' 15' /> - US NHL Players Birth Month Distribution <br />\n <img src = {usa_icon} width = '15' height=' 15' /> - US Birth Month (2007-2022) Distribution\"\n )\n ) +\n theme_minimal() +\n theme(\n plot.caption = element_markdown()\n , plot.title.position = \"plot\"\n , text = element_text(size = 16)\n , axis.text = element_markdown()\n )\n\n\ng1\n\n\n\n\n\n\n\n# Stats ----\n\nbroom::tidy(chisq.test(x = combined$players, p = combined$country_pct))\n\n# A tibble: 1 × 4\n statistic p.value parameter method \n <dbl> <dbl> <dbl> <chr> \n1 7.34 0.771 11 Chi-squared test for given probabilities\n\n\nIf we look at this from a more stats based perspective, running a chi square test on the amount of players in the NHL per month, based on the US expected birth rate, we do see however there is quite a high p value. This is lets us know we can not reject the Null hypothesis that these are the same thing."
2024-09-27 14:53:58 -04:00
},
{
"objectID": "posts/2024-08-28-Genie_Microblog_Part1/index.html",
"href": "posts/2024-08-28-Genie_Microblog_Part1/index.html",
"title": "GenieFramework Microblog Part 1",
"section": "",
"text": "Quite some time ago I started attempting to learn python and web development. I claim in no way to be even close to an expert on either one of those things, in fact I am truly a beginner at both. While I never continued with python I have always enjoyed web development, creating quite a few small R Shiny apps along the way. As I have decided to learn Julia I am instantly drawn to Web Development, and I decided to try out the Genie Framework. While there exist some tutorials on the web, I find the all contain small pieces of information but lack putting everything together. Form my learning python days, I know that there exists a WONDERFUL tutorial for the python Flask framework (Found Here). I decided to challenge myself and recreate his website using Genie, and the Model, View, Controller model. I will attempt to document what I do to try and help others along the way. As stated above I AM NOT AN EXPERT, so at any time there is a good chance I am not doing something the best way possible! I encourage everyone to follow along and make suggestions for improvements. I am going to try my best to go in the order Miguel did, but for some chapters I will skip sections or combine things as needed to make them work for the framework."
},
{
"objectID": "posts/2024-08-28-Genie_Microblog_Part1/index.html#introduction",
"href": "posts/2024-08-28-Genie_Microblog_Part1/index.html#introduction",
"title": "GenieFramework Microblog Part 1",
"section": "",
"text": "Quite some time ago I started attempting to learn python and web development. I claim in no way to be even close to an expert on either one of those things, in fact I am truly a beginner at both. While I never continued with python I have always enjoyed web development, creating quite a few small R Shiny apps along the way. As I have decided to learn Julia I am instantly drawn to Web Development, and I decided to try out the Genie Framework. While there exist some tutorials on the web, I find the all contain small pieces of information but lack putting everything together. Form my learning python days, I know that there exists a WONDERFUL tutorial for the python Flask framework (Found Here). I decided to challenge myself and recreate his website using Genie, and the Model, View, Controller model. I will attempt to document what I do to try and help others along the way. As stated above I AM NOT AN EXPERT, so at any time there is a good chance I am not doing something the best way possible! I encourage everyone to follow along and make suggestions for improvements. I am going to try my best to go in the order Miguel did, but for some chapters I will skip sections or combine things as needed to make them work for the framework."
},
{
"objectID": "posts/2024-08-28-Genie_Microblog_Part1/index.html#getting-started",
"href": "posts/2024-08-28-Genie_Microblog_Part1/index.html#getting-started",
"title": "GenieFramework Microblog Part 1",
"section": "Getting Started",
"text": "Getting Started\nMiguels blog does a great job of going into installing python and flask as well as setting up virtual environments in python. I am going to skip most of this as there is great documentation out there on how to install Julia and set up a project (Genie will actually take care of this for us). Instead I will link here what I would say are the three prerequisites for getting started.\n\nDownload and install Julia\nThe IDE of your choice (I use VSCode, and the Julia Extension)\nAdd Genie to your Julia environment (see below)\n\nTo add Genie to your Julia environment, open the Julia REPL and type the following:\npkg> add Genie # press ] from julia> prompt to enter Pkg mode"
},
{
"objectID": "posts/2024-08-28-Genie_Microblog_Part1/index.html#creating-the-app",
"href": "posts/2024-08-28-Genie_Microblog_Part1/index.html#creating-the-app",
"title": "GenieFramework Microblog Part 1",
"section": "Creating The App",
"text": "Creating The App\nGenie will take care of creating a new directory for us, but we will want to open the Julia REPL from whatever directory we want the app folder to live in. Once that has been decided open a Julia REPL and type the following:\njulia> using Genie\n\njulia> Genie.Generator.newapp(\"Microblog\")\nUpon executing the command, Genie will:\n\nmake a new dir called Microblog and cd() into it,\ninstall all the apps dependencies\ncreate a new Julia project (adding the Project.toml and Manifest.toml files),\nactivate the project,\nautomatically load the new apps environment into the REPL,\nstart the web server on the default Genie port (port 8000) and host (127.0.0.1 aka localhost).\n\nAt this point you can confirm that everything worked as expected by visiting http://127.0.0.1:8000 in your favorite web browser. You should see Genies welcome page. If at any point you want to exit the REPL and reload the app perform the following:\njulia> using Genie\n\njulia> Genie.loadapp()\n\njulia> up()\nThis will reload the app and activate the web server. You can again visit http://127.0.0.1:8000 to test that everything is working."
},
{
"objectID": "posts/2024-08-28-Genie_Microblog_Part1/index.html#creating-a-hello-world-genie-app",
"href": "posts/2024-08-28-Genie_Microblog_Part1/index.html#creating-a-hello-world-genie-app",
"title": "GenieFramework Microblog Part 1",
"section": "Creating a Hello World Genie App",
"text": "Creating a Hello World Genie App\nWhile Genie by default has a welcome page, lets change it to a simple Hello World page to make the app our own. Open routes.jl and change the “/” route to the following:\n\n\nroutes.jl\n\nroute(\"/\") do\n \"Hello World!\"\nend\n\nIf we go to http://127.0.0.1:8000 we should now see the following:\n\n\n\nHello World Screenshot"
2023-10-11 09:41:10 -04:00
}
]