This is a brief guide to the essentials you need to know to write a Stata program file to read raw data; a type of file referred to as a Stata dictionary file.
Stata: How to Write a Dictionary to Read Raw Data
Information You Need Before You Start
Before you start to write a Stata program to read raw data, you need to know the following about your data. Most of this information can be found in the codebook.
- Your variable names:
Pick the variables you need from the codebook and note their mnemonic names, e.g., sex, status, age_sex, yr8st2a, aidsknow, etc.. Note that case counts. YEAR and year are two different variables.
If the codebook only has a description of each variable, you will have to make up variable names yourself. Stata variable names must meet the following criteria:
- Eight characters or less in length.
- Combinations of letters (A-Z or a-z), digits (0-9), and the underscore character (_) only!
- The first letter of a variable name must be a letter or the underscore character (ex: _var1 or OCCUP).
- A variable name must not be one of Stata's reserved names. See the Stata User's Guide for a list. Most of these start with an underscore or are the name of data types so if you start your variable names with a letter you are pretty safe.
- The start position of each variable, i.e., the first column of the variable's position in the record.
- Each variable's end position or its length i.e, the number of columns the variable takes up.
- If the variable is a string or numeric. String variables may contain numbers or may actually be numbers. Hint: If you know that all of the values of string variable are actually numbers, define that variable as numeric.
- The number of decimal places desired, if any, for numeric variables.
- The input file name and location of the raw data file you want to read in. The file extensions .raw or .dat work best for raw data, although Stata will read in files of other extensions.
- You may want to label your variables with something longer than the 8 character variable name. These can be added later, however.
- If you are very ambitious, you can also add value labels to individual variable, e.g., "Male", "Female", "Unknown". These can be added later, however.
Putting It Tgether
The Stata dictionary program can be written with any editor or word processor. Be sure to save it as an ASCII file with the file extension .dct.
- A Stata dictionary program begins with a line that looks like this:
dictionary using mydata.dat {
where mydata.dat is the name of your raw data file. If the file is somewhere besides the Stata directory (or your home directory on CUNIX), add the path name, e.g.:
dictionary using d:\MyDocs\sz2\mydata.dat {
dictionary using ~sz2/data/mydata.dat {
You don't need quotes around the name of the data file unless it has spaces or other odd characters in the file or directory name (a bad practice). - Then comes the definitions of individual variables. Each variable is defined by a line with the following 5 items:
- An underline and the word _column followed by the starting column of the variable in parentheses.
- The variable type (usually you only need to indicate string variables)
- The mnemonic name of the variable
- The variable input format which consists of
- a "%" sign
- a number stating the variable width
- a "." (period) followed by a number indicating the number of decimal places (omitted for integers and string variables)
- a letter indicating the format. The format is f for numbers and s for strings.
Some examples of input formats:
%2f 2 column integer variable
%12s 12 column string variable
%8.2f 8 column number with 2 implied decimal places.
(Note: periods actually typed in the data override formats declared in the program.)
- You can add a label (optional). Labels can be up to 80 characters long.
- The program ends with a "}" (close bracket). You also need a return character at the last line, that is, before you save the file move your cursor to the beginning of the next line below the "}". Finally, save your file with the file extension .dct, e.g., test.dct.
Example
Here is an example of the codebook, information that should come with a text file.
Variable Description Columns Format
IDNUM Assigned ID Number 1-3
FNAME First Name 4-15 String
LNAME Last Name 16-27 String
AGE Age at Death 28-29
SEX Sex 30
1=male
2=female
BYEAR Birth Year 31-34
DYEAR Death Year 35-38
STATUS Status 39
1=poor
2=middle class
3=rich
INTAX Inheritance Tax 40-47 2 implied decimals
And here is what the Stata dictionary program for the above data looks like.
dictionary using test.dat {
_column(1) idnum %3f
_column(4) str12 fname %12s
_column(16) str12 lname %12s
_column(28) age %2f
_column(30) sex %1f
_column(31) byear %4f "Year of Birth"
_column(35) dyear %4f "Year of Death"
_column(39) status %1f "Socioeconomic Status"
_column(40) intax %8.2f "Inheritance Tax:
}
In the example above, fname and lname are 12 column string variables and intax has 2 decimal places. Only byear, dyear, status, and intax have labels as the other mnemonic variable names are obvious. Also note that the names of the variables are in lower case. This just makes for easier typing when you get to the analysis stage. You could have used upper case, but case matters. IDNUM is not the same as idnum.
Note that you don't have to write a definition line for every variable in the dataset. You can skip the ones you don't need.
Executing the Program
To run the Stata dictionary program start up Stata and give the command:
infile using filename
where file name is the name of your file, e.g., test.dct. You don't have to type the .dct extension. If all is well, you will see the program appear on the screen followed by the message that it has read N observations. Check that N is the right number of observations in your dataset. Check on the variables with the describe command producing the following results:
Contains data
obs: 15
vars: 9
size: 840 (99.8% of memory free)
-------------------------------------------------------------------------------------
1. idnum float %9.0g
2. fname str12 %12s
3. lname str12 %12s
4. age float %9.0g
5. sex float %9.0g
6. byear float %9.0g Year of Birth
7. dyear float %9.0g Year of Death
8. status float %9.0g Socioeconomic Status
9. intax float %9.0g Inheritance Tax
-------------------------------------------------------------------------------------
Sorted by:
Note that all of the numeric variables have the type float. This is inefficient. To change them to their most efficient type, give the command compress producing the following results:
idnum was float now byte
age was float now byte
sex was float now byte
byear was float now int
dyear was float now int
status was float now byte
fname was str12 now str8
name was str12 now str9
Now save this as a Stata dataset with the command: save mydata where mydata is the name being assigned file.
You do not have to add the .dta extension. The default location is the Stata directory so you should add a path.
As always CHECK YOU DATA. Stata has a nice summarize command to give you summary statistics, but there is no substitute for doing frequencies (tab1) on the variables you will be using in analysis.
Example of a Program to Read Data with Multiple Records/Case
Here's the data layout for a file with records in 3 different format for one observation:
Variable Description Record Columns Format
idnum Assigned ID Number 1 1-4
treetype Type of Tree 1 5-6
idnum Assigned ID Number 1 1-4
soilphn Soil PH - North Side 2 5-7 (2 decimals)
soilphe Soil PH - East Side 2 8-10 (2 decimals)
soilphs Soil PH - South Side 2 11-13 (2 decimals)
soilphw Soil PH - West Side 2 14-16 (2 decimals)
idnum Assigned ID Number 3 1-4
height Height of Tree 3 5-9 (1 decimal)
circ Circumference of Tree 3 10-14 (1 decimal)
And here's what the Stata dictionary program for the above data looks like:
dictionary using tree.dat {
_lines(3)
_line(1)
_column(1) idnum %4f
_column(5) treetype %2f
_line(2)
_column(5) soilphn %3.2f "Soil PH - North Side"
_column(8) soilphe %3.2f "Soil PH - East Side"
_column(11) soilphs %3.2f "Soil PH - South Side"
_column(14) soilphw %3.2f "Soil PH - West Side"
_line(3)
_column(5) height %5.1f
_column(10) circ %5.1f
}
Statistical software is available on all the CUIT workstations and in Lehman Library's DSSC. For assistance visit or contact the Data Service within the DSSC.