Introduction to Writing a Dictionary File


This is a brief guide to the essentials you need to know to write a Stata program file to read raw data; a type of file referred to as a Stata dictionary file.

Information You Need Before You Start

Before you start to write a Stata program to read raw data, you need to know the following about your data. Most of this information can be found in the codebook.

  1. Your variable names:

    Pick the variables you need from the codebook and note their mnemonic names, e.g., sex, status, age_sex, yr8st2a, aidsknow, etc.. Note that case counts. YEAR and year are two different variables.

    If the codebook only has a description of each variable, you will have to make up variable names yourself. Stata variable names must meet the following criteria:

    • Eight characters or less in length.
    • Combinations of letters (A-Z or a-z), digits (0-9), and the underscore character (_) only!
    • The first letter of a variable name must be a letter or the underscore character (ex: _var1 or OCCUP).
    • A variable name must not be one of Stata's reserved names. See the Stata User's Guide for a list. Most of these start with an underscore or are the name of data types so if you start your variable names with a letter you are pretty safe.
  2. The start position of each variable, i.e., the first column of the variable's position in the record.
  3. Each variable's end position or its length i.e, the number of columns the variable takes up.
  4. If the variable is a string or numeric. String variables may contain numbers or may actually be numbers. Hint: If you know that all of the values of string variable are actually numbers, define that variable as numeric.
  5. The number of decimal places desired, if any, for numeric variables.
  6. The input file name and location of the raw data file you want to read in. The file extensions .raw or .dat work best for raw data, although Stata will read in files of other extensions.
  7. You may want to label your variables with something longer than the 8 character variable name. These can be added later, however.
  8. If you are very ambitious, you can also add value labels to individual variable, e.g., "Male", "Female", "Unknown". These can be added later, however.

Putting It Tgether

The Stata dictionary program can be written with any editor or word processor. Be sure to save it as an ASCII file with the file extension .dct.

  1. A Stata dictionary program begins with a line that looks like this:

            dictionary using mydata.dat {

    where mydata.dat is the name of your raw data file. If the file is somewhere besides the Stata directory (or your home directory on CUNIX), add the path name, e.g.:

            dictionary using d:\MyDocs\sz2\mydata.dat {
            dictionary using ~sz2/data/mydata.dat {

    You don't need quotes around the name of the data file unless it has spaces or other odd characters in the file or directory name (a bad practice).
  2. Then comes the definitions of individual variables. Each variable is defined by a line with the following 5 items:
    • An underline and the word _column followed by the starting column of the variable in parentheses.
    • The variable type (usually you only need to indicate string variables)
    • The mnemonic name of the variable
    • The variable input format which consists of
      • a "%" sign
      • a number stating the variable width
      • a "." (period) followed by a number indicating the number of decimal places (omitted for integers and string variables)
      • a letter indicating the format. The format is f for numbers and s for strings.

        Some examples of input formats:

        %2f         2 column integer variable
        %12s    12 column string variable
        %8.2f
             8 column number with 2 implied decimal places.

        (Note: periods actually typed in the data override formats declared in the program.)
    The format statement is actually more complicated than the above, but this will do for most data. See "infile" in the Reference Manual for more information if you have numbers in scientific notation or numbers with commas.
  3. You can add a label (optional). Labels can be up to 80 characters long.
  4. The program ends with a "}" (close bracket). You also need a return character at the last line, that is, before you save the file move your cursor to the beginning of the next line below the "}". Finally, save your file with the file extension .dct, e.g., test.dct.

Example

Here is an example of the codebook, information that should come with a text file.

Variable  Description        Columns  Format
 
IDNUM     Assigned ID Number    1-3
FNAME     First Name            4-15  String
LNAME     Last Name            16-27  String
AGE       Age at Death         28-29
SEX       Sex                  30
             1=male
             2=female
BYEAR     Birth Year           31-34
DYEAR     Death Year           35-38
STATUS    Status               39
             1=poor
             2=middle class
             3=rich
INTAX     Inheritance Tax      40-47 2  implied decimals

And here is what the Stata dictionary program for the above data looks like.

dictionary using test.dat {
  _column(1) idnum %3f
  _column(4) str12 fname %12s
  _column(16) str12 lname %12s
  _column(28) age %2f
  _column(30) sex %1f
  _column(31) byear %4f "Year of Birth"
  _column(35) dyear %4f "Year of Death"
  _column(39) status %1f "Socioeconomic Status"
  _column(40) intax %8.2f "Inheritance Tax:
}

In the example above, fname and lname are 12 column string variables and intax has 2 decimal places. Only byear, dyear, status, and intax have labels as the other mnemonic variable names are obvious. Also note that the names of the variables are in lower case. This just makes for easier typing when you get to the analysis stage. You could have used upper case, but case matters. IDNUM is not the same as idnum.

Note that you don't have to write a definition line for every variable in the dataset. You can skip the ones you don't need.

Executing the Program

To run the Stata dictionary program start up Stata and give the command:

infile using filename

where file name is the name of your file, e.g., test.dct. You don't have to type the .dct extension. If all is well, you will see the program appear on the screen followed by the message that it has read N observations. Check that N is the right number of observations in your dataset. Check on the variables with the describe command producing  the following results:

          Contains data
             obs:            15
             vars:             9
             size:          840 (99.8% of memory free)
   -------------------------------------------------------------------------------------
          1. idnum         float      %9.0g
          2. fname         str12      %12s
          3. lname         str12      %12s
          4. age           float      %9.0g
          5. sex           float      %9.0g
          6. byear         float      %9.0g          Year of Birth
          7. dyear         float      %9.0g          Year of Death
          8. status        float      %9.0g          Socioeconomic Status
          9. intax         float      %9.0g          Inheritance Tax
   -------------------------------------------------------------------------------------
    Sorted by
:

Note that all of the numeric variables have the type float. This is inefficient. To change them to their most efficient type, give the command compress producing the following results:

idnum was float now byte
age was float now byte
sex was float now byte
byear was float now int
dyear was float now int
status was float now byte
fname was str12 now str8
name was str12 now str9

Now save this as a Stata dataset with the command: save mydata where mydata is the name being assigned file.  

You do not have to add the .dta extension.  The default location is the Stata directory so you should add a path.

As always CHECK YOU DATA.  Stata has a nice summarize command to give you summary statistics, but there is no substitute for doing frequencies (tab1) on the variables you will be using in analysis.

Example of a Program to Read Data with Multiple Records/Case

   Here's the data layout for a file with records in 3 different format for one observation:

Variable Description Record Columns Format
 
idnum     Assigned ID Number     1   1-4
treetype  Type of Tree           1   5-6
idnum     Assigned ID Number     1   1-4
soilphn   Soil PH - North Side   2   5-7   (2 decimals)
soilphe   Soil PH - East Side    2   8-10  (2 decimals)
soilphs   Soil PH - South Side   2  11-13  (2 decimals)
soilphw   Soil PH - West Side    2  14-16  (2 decimals)
idnum     Assigned ID Number     3   1-4
height    Height of Tree         3   5-9   (1 decimal)
circ      Circumference of Tree  3  10-14  (1 decimal)

And here's what the Stata dictionary program for the above data looks like:

dictionary using tree.dat {
_lines(3)
_line(1)
_column(1) idnum %4f
_column(5) treetype %2f
_line(2)
_column(5) soilphn %3.2f "Soil PH - North Side"
_column(8) soilphe %3.2f "Soil PH - East Side"
_column(11) soilphs %3.2f "Soil PH - South Side"
_column(14) soilphw %3.2f "Soil PH - West Side"
_line(3)
_column(5) height %5.1f
_column(10) circ %5.1f
}