How to define tidy data if there is repeated measures?

Question

How to define tidy data if there is repeated measures?

machine

2022年5月6日 16:02

Some time ago I read R for Data Sciece and there is the following definition of tidy data:

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.

Each observation must have its own row.

Each value must have its own cell.

Back then, the idea seemed quite reasonable but now the concept seems not that consistent to me. In chapter 12 it is said that table1 is tidy and this is how it looks like:

I would expect the country to be the observation unit and therefore, countries must not appear in multiple rows. I expect the following to be the tidy data table of table1:

The problem with my solution seems to be that we have values (of the year) as part of variable names. On the other hand, the dataset table1as it is suggested to be tidy has no obvious observation unit in my understanding. We could say that the combination of the two columns country and yearforms the observation unit but there is nowhere a definition of such a rather complex observation unit (not in the book and not in the publication on tidy data).

Topic data time-series

Category Data Science

machine · Accepted Answer · 2020年4月7日 07:52

By coincident, I found a paper dealing with the problem I asked. The paper defines tidy temporal data as:

Index is a variable with inherent ordering from past to present.

Key is a set of variables that define observational units over time.

Each observation should be uniquely identified by index and key.

Each observational unit should be measured at a common interval, if regularly spaced.

So in table1 country is the key and yearis the index of the table. This makes the table tidy.

Erwan · Accepted Answer · 2020年4月6日 17:26

As you correctly observed, there is no assumption that an observation must be defined by a single key (variable). In this example an observation must be defined by a pair country + year, hence the correct tidy version in table1. It's not a complex case at all, this is very common and sometimes with more than two variables.

In general "tidying" a dataset increases the number of rows and often decreases the number of columns. A way to see that your second table is not tidy is that it would require new columns for every new year added to the dataset. As you noticed another indication is simply that it requires variables values in the columns names, which is a very bad design idea in general.

This being said, tidy data isn't a magical solution to every data design problem: it tends to demultiply the number of rows to an extent which makes it impractical in some cases.

How to define tidy data if there is repeated measures?

About