Row-based file formats, also known as Entity-attribute-value data (EAV), organize data by record, keeping all the data associated with a record in the same row. Row based files are the traditional way of organizing data.
In this article
1. Data Files
Row-based formats have three data sets that are stored in tabular form in your media of choice, such as text files, SQL databases, AWS, Azure, GCP, NoSQL databases, to name a few.
The three data sets are:
a. Respondent table - contains the respondent IDs
b. Dictionary table - defines the attributes
c. Data table - contains the data values
This section provides information about the three data sets and general guidance about Harmoni requirements for row-based tables.
a. Respondent table
A tabular form of the respondents included in the data, listing all ids. It is important that the data is sorted.
b. Dictionary table
A tabular form of the dictionary that determines the order in which you should have the questions and pre-coded answers.
In addition, it specifies the type of the questions, for example, questions with...
- numerical answers (e.g., exact age)
- text answers (e.g., email address)
- weight (numerical value but marked as weight)
- single-response questions with pre-coded answers
- multiple-response questions with pre-coded answers.
The table should be in the form:
<Question Label, Answer Label, Question Type, [Question Label Desc.], [Answer Label Desc.]>
(The last two fields are optional)
When there are no answer labels (for those questions with no pre-coded answers, for instance in the case of a question with a numerical answer), the answer label should be null or empty.
Following is a sample table illustrating a dictionary with all possible question types. In this example, question types with numerical answers (e.g., exact age), text answers (e.g., email address), weight (numerical value but marked as weight), single-response questions with pre-coded answers, and multiple-response questions with pre-coded answers, are coded as nv, tv, wt, sr, and mr, respectively. Any user-chosen codes are acceptable, so long as the codes are unique.
The dictionary table needs to be sorted by Question Label then Answer Label in the preferred order.
Question Label | Answer Label | Question Type | Question Label Description (optional) | Answer Label Description (optional) |
---|---|---|---|---|
Respondent ID (text) | tv | |||
Q11 | Male | sr | Gender | Males |
Q11 | Female | sr | Females | |
Marital Status | Married | sr | ||
Marital Status | Widowed | sr | ||
Marital Status | Divorced | sr | ||
Marital Status | Separated | sr | ||
Marital Status | Never married | sr | ||
Weight | wt | Main Weight | ||
Exact Age | nv | |||
News Sources | Internet | mr | Sources Used | |
News Sources | Newspapers | mr | Sources Used | |
News Sources | News magazines | mr | Sources Used | |
News Sources | Television | mr | Sources Used | |
News Sources | Radio | mr | Sources Used |
Using the two optional columns will produce different labels in Harmoni. For Q11 in the table above, here is an example of the axis and element labels, with and without the optional fields.
c. Data Table
A tabular form of the data, this is a long list of answers for each question asked of a respondent. The table should be in the form:
<Respondent ID, Question Label, Answer>
The table should be sorted by respondent ID then by question label.
Respondent ID | Question Label | Answer |
---|---|---|
0 | Respondent ID (text) | 0 |
1 | Respondent ID (text) | 1 |
0 | Gender | Male |
1 | Gender | Male |
0 | Marital Status | Divorced |
1 | Marital Status | Never married |
So as long as the respondent, dictionary, and data tables can be obtained, the actual data can be stored in any format the user chooses.
Data Accessibility
Typically, we envisage the data to be in a storage accessible via SQL commands. The following examples illustrate some sample SQL commands to obtain the respondent count, dictionary table, and data table.
1) caseIDs (respondent IDs)
SELECT distinct CaseIdCol FROM OurDataTable ORDER BY CaseIdCol;
2) Dictionary
SELECT DISTINCT ItemLabelCol, ValueLabelCol, ItemTypeCol, ItemLabelColDesc, ValueLabelColDesc FROM OurDictionaryTable order by ItemLabelColOrder, ElementLabelColOrder;
3) Data
SELECT DISTINCT CaseIdCol, ItemLabelCol, ValueLabelCol FROM OurDataTable ORDER BY CaseIdCol, ItemLabelCol;
2. Multi-Level Linking
Continuing on with the example in the section above, the caseIDs are represented in this data as Respondent ID and determine the link to other levels.
Respondent ID | Question Label | Answer |
---|---|---|
0 | Respondent ID (text) | 0 |
1 | Respondent ID (text) | 1 |
0 | Gender | Male |
1 | Gender | Male |
0 | Marital Status | Divorced |
1 | Marital Status | Never married |
The next level dataset down (child) will have its own caseIDs. These caseIDs are represented in the data as Visit ID. The child data will link with the parent data on Respondent ID.
Visit ID | Question Label | Answer |
---|---|---|
11 | Respondent ID | 0 |
12 | Respondent ID | 0 |
13 | Respondent ID | 1 |
11 | Visit ID | 11 |
12 | Visit ID | 12 |
13 | Visit ID | 13 |
11 | Location | Supermarket |
12 | Location | Convenience Store |
13 | Location | Supermarket |
If more levels are required, for example a purchase level, the same logic applies. The grandchild data will link with the child data on Visit ID.
Purchase ID | Question Label | Answer |
---|---|---|
0 | Visit ID | 11 |
1 | Visit ID | 11 |
2 | Visit ID | 12 |
3 | Visit ID | 12 |
4 | Visit ID | 12 |
5 | Visit ID | 13 |
0 | Purchase ID | 0 |
1 | Purchase ID | 1 |
2 | Purchase ID | 2 |
3 | Purchase ID | 3 |
4 | Purchase ID | 4 |
5 | Purchase ID | 5 |
0 | Beverage | Coffee |
1 | Beverage | Tea |
2 | Beverage | Energy Drink |
3 | Beverage | Flavored Milk |
4 | Beverage | Sports Drink |
5 | Beverage | Bottled Water |
3. Multiple Datasets on the Same Level
It is also possible to link datasets so that they are created at the same level. As with multi-level datasets, these must be linked by caseIDs. For example, in a dataset for visit levels that is measuring transport methods, this transport method data will act as a child to the existing visit level data, and will link on Visit ID. In this file the caseIDs are 11, 12 and 13.
Visit ID | Question Label | Answer |
---|---|---|
11 | Visit ID | 11 |
12 | Visit ID | 12 |
13 | Visit ID | 13 |
11 | Transport method | Bus |
12 | Transport method | Walk |
13 | Transport method | Car |
4. Useful Information
- Any ID that does not exist in the CaseIDs table will be ignored, even if it appears in the data table.
- Any variable not found in the dictionary table will be ignored, even if it exists in the data table.
- Any variable listed in the dictionary table that has no corresponding response in the data table will be output with a count of zero.
Where to from here?
Learn more about Harmoni