22 Tidy Data
effort with the prospect of many false starts. While I hope that the tidy data framework is
not one of those false starts, I also don’t see it as the final solution. I hope others will build
on this framework to develop even better data storage strategies and better tools.
Surprisingly, I have found few principles to guide the design of tidy data, which acknowledge
both statistical and cognitive factors. To date, my work has been driven by my experience
doing data analysis, my knowledge of relational database design, and my own rumination on
the tools of data analysis. The human factors, user-centered design, and human-computer
interaction communities may be able to add to this conversation, but the design of data and
tools to work with it has not been an active research topic in those fields. In the future, I
hope to use methodologies from these fields (user-testing, ethnography, talk-aloud protocols)
to improve our understanding of the cognitive side of data analysis, and to further improve
our ability to design appropriate tools.
Other formulations of tidy data are possible. For example, it would be possible to construct
a set of tools for dealing with values stored in multidimensional arrays. This is a common
storage format for large biomedical datasets generated by microarrays or fMRI’s. It’s also
necessary for many multivariate methods based on matrix manipulation. Fortunately, because
there are many efficient tools for working with high-dimensional arrays, even sparse ones,
such an array-tidy format is not only likely to be quite compact and efficient, it should also
be able to easily connect with the mathematical basis of statistics. This, in fact, is the
approach taken by the Pandas python data analysis library (McKinney 2010). Even more
interestingly, we could consider tidy tools that can ignore the underlying data representation
and automatically choose between array-tidy and dataframe-tidy formats to optimise memory
usage and performance.
Apart from tidying, there are many other tasks involved in cleaning data: parsing dates
and numbers, identifying missing values, correcting character encodings (for international
data), matching similar but not identical values (created by typos), verifying experimental
design, and filling in structural missing values, not to mention model-based data cleaning that
identifies suspicious values. Can we develop other frameworks to make these tasks easier?
7. Acknowledgements
This work wouldn’t be possible without the many conversations I’ve had about data and how
to deal with them statistically. I’d particularly like to thank Phil Dixon, Di Cook, and Heike
Hofmann, who have put up with numerous questions over the years. I’d also like to thank the
users of the reshape package who have provided many challenging problems, and my students
who continue to challenge me to explain what I know in a way that they can understand. I’d
also like to thank Bob Muenchen, Burt Gunter, Nick Horton and Garrett Grolemund who
gave detailed comments on earlier drafts, and to particularly thank Ross Gayler who provided
the nice example of the challenges of defining a variable and Ben Bolker who showed me the
natural equivalence between a paired t-test and a mixed effects model.
References
Codd EF (1990). The Relational Model for Database Management: Version 2. Addison-Wesley