I’ve recently been working on analysis of a large scale complex survey to understand asset poverty in Canada. The analysis takes place in the QICSS datalab and requires vetting before being released. The analysis produces a massive volume of information across two years of survey. My goal with that information is to transform it into understandable format that can be digested by me and eventually others. I’ll describe the great improvement in my workflow to reach the goal.
SAS –> .rtf –> Excel –> Microsoft Word
(1) SAS output to .rtf. For each year of the survey SAS would do the analysis and generate an .rtf file containing hundreds of tables that was 237 pages long (think poverty rates calculated several different ways across numerous demographic categories). (2) A very capable research assistant would then manually enter the data in an Excel workbook with 5 sheets corresponding to the tables. (3) Myself or the RA would then copy the information to a Microsoft Word document and format the tables for presentation.
* Problems: There are a number of obvious problems here. First, it’s horribly inefficient moving from format to format to format. Second, it takes a long time to coordinate the labor involved. Third, and most importantly, the process is error prone. The probability of errors in the final product rises proportional to the number of mouse clicks required to copy paste or type data manually. I confronted these errors and corrected them when I found them, but this was also time consuming and I was constantly wondering what I mised. I thought “there must be a better way”.
Slightly less awful workflow:
SAS –> Excel1 –> Excel2 –> Microsoft Word
Same as the awful workflow without step one. Via the ODS output statement I learned to bypass the .rtf. This put all the output into one Excel file directly. That Excel file still has over 4,000 lines of information that must be reduced to a manageable number. Moving from Excel1 to Excel2 requires copy and pasting and deleting: moving from a lot of information to a little information. Importantly that information does not change (hopefully). Then the tables are analyzed and prepared at Excel2 stage and moved to Microsoft Word for presentation. At this last stage much work goes into adjusting column and row heights and widths, text allignment, etc. A lot of clicking happens.
* While the likelihood of error has been reduced it’s still a long process getting from SAS output to Microsoft Word. When you finish all this work and discuss at the end you recoded a variable wrong and have to redo you want to pull your hair out.
A much improved but not perfect workflow
SAS –> Excel1 –> Excel2 –> R –> .pdf
The first three steps are the same. The big difference comes at the R stage. With the help of another research assistant (thanks, Chris) I learned the basic of a few packages in R (sweave, ggplot, xtable). These allowed me load the data from Excel2 saved as a series of flat .csv files (not Excel worksheets) into R. From R, I could do two important functions all by writing code. No clicks required! First, I could generate tables to accompany the manuscript. Once the data is in R, it takes less than 5 seconds to run the .Rnw file and generate the .pdf ouput (via LaTeX).
* Assuming no errors happened in Excel1–> Excel2 there will be no errors in the output. Further, when I realize my next coding error the time to reproduce the results will be a fraction of what it would be in the Slightly less awful scenario. All I need to do is upload the new data and run. Further, when the next cycle of the survey is released I can, again, just upload the new data output and rerun. I estimate my production time will be cut by about 75%.
* An important feature that I will save for another post is that R also allows me to generate presentations (beamer class) that include tables and bar charts from the data imported into R. Again, all in one click I generate an entire presentation that integrates text, tables, graphs, images, and charts. The best thing about it: no more clicks! (well at least a lot fewer)