Kyle Belanger
2020-06-22
Importing Excel Data with Multiple Header Rows
<h1 class="title">Importing Excel Data with Multiple Header Rows</h1>
<p class="subtitle lead"></p><p>A solution for importing Excel Data that contains two header rows.</p><p></p>
<div class="quarto-title-meta">
<div class="quarto-title-meta-heading">Author</div>
<div class="quarto-title-meta-contents">
<p>Kyle Belanger</p>
<div class="quarto-title-meta-heading">Published</div>
<div class="quarto-title-meta-contents">
<p class="date">June 22, 2020</p>
<section id="problem" class="level1">
<p>Recently I tried to important some Microsoft Excel data into R, and ran into an issue were the data actually had two different header rows. The top row listed a group, and then the second row listed a category within that group. Searching goggle I couldn’t really find a good example of what I was looking for, so I am putting it here in hopes of helping someone else!</p>
<section id="example-data" class="level1">
<h1>Example Data</h1>
<p>I have created a small Excel file to demonstrate what I am talking about. Download it <a href="https://github.com/mmmmtoasty19/kyleb/tree/master/content/post/2020-06-15-importing-excel-data-with-multiple-headers/example_data.xlsx">here</a>. This is the data from Excel. <img src="example_data_img1.png" class="img-fluid" alt="image of example data"></p>
<section id="check-data" class="level1">
<h1>Check Data</h1>
<p>First we will read the file in using the package readxl and view the data without doing anything special to it.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(readxl) <span class="co"># load the readxl library</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(tidyverse) <span class="co"># load the tidyverse for manipulating the data</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>file_path <span class="ot"><-</span> <span class="st">"example_data.xlsx"</span> <span class="co"># set the file path</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>ds0 <span class="ot"><-</span> <span class="fu">read_excel</span>(file_path) <span class="co"># read the file</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>ds0</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 7
Name `Test 1` ...3 ...4 `Test 2` ...6 ...7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
2 Max 22 23 24 25 26 27
3 Phoebe 34 34 32 34 51 12
4 Scamp 35 36 21 22 23 24
5 Chance 1234 1235 1236 1267 173 1233
6 Aimee 420 123 690 42 45 12
7 Kyle 22 23 25 26 67 54 </code></pre>
<section id="new-header-names" class="level1">
<h1>New Header Names</h1>
<section id="step-1" class="level3">
<h3 class="anchored" data-anchor-id="step-1">Step 1</h3>
<p>First lets read back the data, this time however with some options. We will set the n_max equal to 2, to only read the first two rows, and set col_names to FALSE so we do not read the first row as headers.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>ds1 <span class="ot"><-</span> <span class="fu">read_excel</span>(file_path, <span class="at">n_max =</span> <span class="dv">2</span>, <span class="at">col_names =</span> <span class="cn">FALSE</span>)</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>ds1</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 2 × 7
...1 ...2 ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Name Test 1 <NA> <NA> Test 2 <NA> <NA>
2 <NA> Run 1 Run 2 Run 3 Run 1 Run 2 Run 3</code></pre>
<section id="step-2" class="level3">
<h3 class="anchored" data-anchor-id="step-2">Step 2</h3>
<p>Now that we have our headers lets first transpose them to a vertical matrix using the base function t(), then we will turn it back into a tibble to allow us to use tidyr fill function.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>names <span class="ot"><-</span> ds1 <span class="sc">%>%</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">t</span>() <span class="sc">%>%</span> <span class="co">#transpose to a matrix</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">as_tibble</span>() <span class="co">#back to tibble</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>names</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 2
V1 V2
<chr> <chr>
1 Name <NA>
2 Test 1 Run 1
3 <NA> Run 2
4 <NA> Run 3
5 Test 2 Run 1
6 <NA> Run 2
7 <NA> Run 3</code></pre>
<p>Note that tidyr fill can not work row wise, thus the need to flip the tibble so it is long vs wide.</p>
<section id="step-3" class="level3">
<h3 class="anchored" data-anchor-id="step-3">Step 3</h3>
<p>Now we use tidyr fill function to fill the NA’s with whatever value it finds above.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>names <span class="ot"><-</span> names <span class="sc">%>%</span> <span class="fu">fill</span>(V1) <span class="co">#use dplyr fill to fill in the NA's</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>names</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 2
V1 V2
<chr> <chr>
1 Name <NA>
2 Test 1 Run 1
3 Test 1 Run 2
4 Test 1 Run 3
5 Test 2 Run 1
6 Test 2 Run 2
7 Test 2 Run 3</code></pre>
<section id="step-4" class="level3">
<h3 class="anchored" data-anchor-id="step-4">Step 4</h3>
<p>This is where my data differed from many of the examples I could find online. Because the second row is also a header we can not just get rid of them. We can solve this using paste() combined with dplyr mutate to form a new column that combines the first and second column.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>names <span class="ot"><-</span> names <span class="sc">%>%</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(</span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="at">new_names =</span> <span class="fu">paste</span>(V1,V2, <span class="at">sep =</span> <span class="st">"_"</span>)</span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a> )</span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>names</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 3
V1 V2 new_names
<chr> <chr> <chr>
1 Name <NA> Name_NA
2 Test 1 Run 1 Test 1_Run 1
3 Test 1 Run 2 Test 1_Run 2
4 Test 1 Run 3 Test 1_Run 3
5 Test 2 Run 1 Test 2_Run 1
6 Test 2 Run 2 Test 2_Run 2
7 Test 2 Run 3 Test 2_Run 3</code></pre>
<section id="step-4a" class="level3">
<h3 class="anchored" data-anchor-id="step-4a">Step 4a</h3>
<p>One more small clean up task, in the example data the first column header Name, did not have a second label, this has created a name with an NA attached. We can use stringr to remove this NA.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>names <span class="ot"><-</span> names <span class="sc">%>%</span> <span class="fu">mutate</span>(<span class="fu">across</span>(new_names, <span class="sc">~</span><span class="fu">str_remove_all</span>(.,<span class="st">"_NA"</span>)))</span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>names</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 3
V1 V2 new_names
<chr> <chr> <chr>
1 Name <NA> Name
2 Test 1 Run 1 Test 1_Run 1
3 Test 1 Run 2 Test 1_Run 2
4 Test 1 Run 3 Test 1_Run 3
5 Test 2 Run 1 Test 2_Run 1
6 Test 2 Run 2 Test 2_Run 2
7 Test 2 Run 3 Test 2_Run 3</code></pre>
<section id="step-5" class="level3">
<h3 class="anchored" data-anchor-id="step-5">Step 5</h3>
<p>Now that are new name column is the way we want it, we can use dpylrs pull to return a vector of just that column</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>names <span class="ot"><-</span> names <span class="sc">%>%</span> <span class="fu">pull</span>(new_names)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<section id="final-data" class="level1">
<h1>Final Data</h1>
<p>Now that we have a vector of column names lets read in the original file using our new names. We set the skip argument to 2, to skip the first two rows, and set col_names equal to our vector of names. Note the last step I used the janitor package to provide names in snake case (the default for the clean names function.)</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>example_data <span class="ot"><-</span> readxl<span class="sc">::</span><span class="fu">read_excel</span>(file_path, <span class="at">col_names =</span> names, <span class="at">skip =</span> <span class="dv">2</span>) <span class="sc">%>%</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a> janitor<span class="sc">::</span><span class="fu">clean_names</span>()</span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>example_data</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 6 × 7
name test_1_run_1 test_1_run_2 test_1_run_3 test_2_run_1 test_2_run_2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Max 22 23 24 25 26
2 Phoebe 34 34 32 34 51
3 Scamp 35 36 21 22 23
4 Chance 1234 1235 1236 1267 173
5 Aimee 420 123 690 42 45
6 Kyle 22 23 25 26 67
# ℹ 1 more variable: test_2_run_3 <dbl></code></pre>
<section id="other-help" class="level1">
<h1>Other Help</h1>
<p>While searching for some solutions to my problem I found two good examples, however neither did exactly what I was trying to do.</p>
<ol type="1">
<li><p>This post by Lisa Deburine is pretty close to what I was trying to accomplish and gave me a good starting point. Read it <a href="https://debruine.github.io/posts/multi-row-headers/">here</a></p></li>
<li><p>This post by Alison Hill solves a simlar but slightly different problem. In her data the 2nd row is actually metadata not a second set of headers. Read it <a href="https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/">here</a></p></li>
<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div id="quarto-reuse" class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</a></div></div></section><section class="quarto-appendix-contents"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{belanger2020,
author = {Belanger, Kyle},
title = {Importing {Excel} {Data} with {Multiple} {Header} {Rows}},
date = {2020-06-22},
langid = {en}
</code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-belanger2020" class="csl-entry quarto-appendix-citeas" role="listitem">
Belanger, Kyle. 2020. <span>“Importing Excel Data with Multiple Header
Rows.”</span> June 22, 2020.
</div></div></section></div></main> <!-- /main -->
