How I reduced the clicking

I’ve recently been working on analysis of a large scale complex survey to understand asset poverty in Canada. The analysis takes place in the QICSS datalab and requires vetting before being released. The analysis produces a massive volume of information across two years of survey. My goal with that information is to transform it into understandable format that can be digested by me and eventually others. I’ll describe the great improvement in my workflow to reach the goal.

Awful workflow:
SAS –> .rtf –> Excel –> Microsoft Word
(1) SAS output to .rtf. For each year of the survey SAS would do the analysis and generate an .rtf file containing hundreds of tables that was 237 pages long (think poverty rates calculated several different ways across numerous demographic categories). (2) A very capable research assistant would then manually enter the data in an Excel workbook with 5 sheets corresponding to the tables. (3) Myself or the RA would then copy the information to a Microsoft Word document and format the tables for presentation.
* Problems: There are a number of obvious problems here. First, it’s horribly inefficient moving from format to format to format. Second, it takes a long time to coordinate the labor involved. Third, and most importantly, the process is error prone. The probability of errors in the final product rises proportional to the number of mouse clicks required to copy paste or type data manually. I confronted these errors and corrected them when I found them, but this was also time consuming and I was constantly wondering what I mised. I thought “there must be a better way”.

Slightly less awful workflow:
SAS –> Excel1 –> Excel2 –> Microsoft Word
Same as the awful workflow without step one. Via the ODS output statement I learned to bypass the .rtf. This put all the output into one Excel file directly. That Excel file still has over 4,000 lines of information that must be reduced to a manageable number. Moving from Excel1 to Excel2 requires copy and pasting and deleting: moving from a lot of information to a little information. Importantly that information does not change (hopefully). Then the tables are analyzed and prepared at Excel2 stage and moved to Microsoft Word for presentation. At this last stage much work goes into adjusting column and row heights and widths, text allignment, etc. A lot of clicking happens.
* While the likelihood of error has been reduced it’s still a long process getting from SAS output to Microsoft Word. When you finish all this work and discuss at the end you recoded a variable wrong and have to redo you want to pull your hair out.

A much improved but not perfect workflow
SAS –> Excel1 –> Excel2 –> R –> .pdf
The first three steps are the same. The big difference comes at the R stage. With the help of another research assistant (thanks, Chris) I learned the basic of a few packages in R (sweave, ggplot, xtable). These allowed me load the data from Excel2 saved as a series of flat .csv files (not Excel worksheets) into R. From R, I could do two important functions all by writing code. No clicks required! First, I could generate tables to accompany the manuscript. Once the data is in R, it takes less than 5 seconds to run the .Rnw file and generate the .pdf ouput (via LaTeX).
* Assuming no errors happened in Excel1–> Excel2 there will be no errors in the output. Further, when I realize my next coding error the time to reproduce the results will be a fraction of what it would be in the Slightly less awful scenario. All I need to do is upload the new data and run. Further, when the next cycle of the survey is released I can, again, just upload the new data output and rerun. I estimate my production time will be cut by about 75%.
* An important feature that I will save for another post is that R also allows me to generate presentations (beamer class) that include tables and bar charts from the data imported into R. Again, all in one click I generate an entire presentation that integrates text, tables, graphs, images, and charts. The best thing about it: no more clicks! (well at least a lot fewer)

Data visualization in social work research part II

In part I of this post I demonstrated the merits of the dot plot. In part II I will show an improved way to visualize a hierarchical regression model. Below shows a fairly standard table of multiple regression output for three models building sequentially on each other. Table 3 from Shiovitz-Ezra, S., & Leitsch, S. A. (2010). The role of social relationships in predicting loneliness: The national social life, health, and aging project. Social Work Research, 34(3), 157–167.
ShiovitzEzra.2010.Tab3Using R again I build a graph that overlays the three models and displays the coefficients graphically. This is based largely on the R code from David Sparks.

Choropleth tutorial

Coefficient plot

Coefficient plot walkthrough

I reordered the models and variables to appear in the desired order. Below is the replication of Shiovitz et al.


Data visualization in social work research part I

Okay, here we go, my long overdue post on data visualization in social work research.

You can find study details and data here on the Dataverse network.

First, let’s examine a published logistic regression table published in Trocme, N., Knoke, D., & Blackstock, C. (2004). Pathways to the overrepresentation of Aboriginal children in Canada’s child welfare system. Social Service Review, 78(4), 577–600.


I’ve replicated Trocmé et al’s results to produce a dotplot in R based on the work of William Jacoby.


This visual representation of the data appears to have several advantages.

1. The results are shown in visual form rather than forcing the reader to interpret numbers.

2. It is easier to distinguish statistically significant variables from those that are not statistically significant based on solid black v. empty dots.

3. The strength of the relationship between each IV and DV is more intuitive based on the distance from baseline of OR =1.

In a recent brownbag discussion, Toni suggested making the size of dot correspond to the size of the Wald statistic (a form of effect size). The graph is missing information that helps the viewer appraise the amount of uncertainty, e.g., standard errors. I’ll show an example of how to report the standard errors in the next post on this subject. For now, I am curious if you can think of additional advantages / disadvantages to the dotplot in comparison to the standard regression table.

Blog authors are solely responsible for the content of the blogs listed in the directory. Neither the content of these blogs, nor the links to other web sites, are screened, approved, reviewed or endorsed by McGill University. The text and other material on these blogs are the opinion of the specific author and are not statements of advice, opinion, or information of McGill.