Below is Chapter 14, A Sample Session, from the Getting Started with Stata for Windows manual.


Chapter 14: A Sample Session

In this chapter we will use the auto.dta file shipped with Stata. If you wish to follow along, you must load this dataset. Launch Stata and choose Open from the File menu. Select the auto.dta file from the directory in which you installed Stata.
. use c:\stata\auto, clear
(1978 Automobile Data)
The data that we loaded contains
Contains data from auto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          14 Oct 2002 09:02
 size:         3,478 (99.7% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.) 
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------------------------------
Sorted by:  foreign

Listing can be informative

Here are some of our data:
. list make mpg in 1/10

     +---------------------+
     | make            mpg |
     |---------------------|
  1. | AMC Concord      22 |
  2. | AMC Pacer        17 |
  3. | AMC Spirit       22 |
  4. | Buick Century    20 |
  5. | Buick Electra    15 |
     |---------------------|
  6. | Buick LeSabre    18 |
  7. | Buick Opel       26 |
  8. | Buick Regal      20 |
  9. | Buick Riviera    16 |
 10. | Buick Skylark    19 |
     +---------------------+
Question: Which cars yield the lowest gas mileage?
. l make mpg in 1/5

     +---------------------+
     | make            mpg |
     |---------------------|
  1. | AMC Concord      22 |
  2. | AMC Pacer        17 |
  3. | AMC Spirit       22 |
  4. | Buick Century    20 |
  5. | Buick Electra    15 |
     +---------------------+
Which 5 cars yield the highest gas mileage?
. l make mpg in -5/-1

     +-------------------+
     | make          mpg |
     |-------------------|
 70. | VW Dasher      23 |
 71. | VW Diesel      41 |
 72. | VW Rabbit      25 |
 73. | VW Scirocco    25 |
 74. | Volvo 260      17 |
     +-------------------+

Descriptive statistics

Question: Not being familiar with 1978 prices, what is the average price of a car in this data?
. summarize price

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        74    6165.257    2949.496       3291      15906
Aside: summarize works like list—without arguments it provides a summary of all of the data:
. summarize

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        make |         0
       price |        74    6165.257    2949.496       3291      15906
         mpg |        74     21.2973    5.785503         12         41
       rep78 |        69    3.405797    .9899323          1          5
    headroom |        74    2.993243    .8459948        1.5          5
-------------+--------------------------------------------------------
       trunk |        74    13.75676    4.277404          5         23
      weight |        74    3019.459    777.1936       1760       4840
      length |        74    187.9324    22.26634        142        233
        turn |        74    39.64865    4.399354         31         51
displacement |        74    197.2973    91.83722         79        425
-------------+--------------------------------------------------------
  gear_ratio |        74    3.014865    .4562871       2.19       3.89
     foreign |        74    .2972973    .4601885          0          1
Notemake has 0 observations because it is a string—calculating a mean is undefined but is not an error. rep78 has only 69 observations because for five cars, it is missing.

Descriptive statistics, continued

Question: What is the average price of cars that are below and above the mean MPG?
. summarize price if mpg<21.3

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        43     7091.86    3425.019       3291      15906

. summarize price if mpg>=21.3

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        31    4879.968    1344.659       3299       9735
Aside: if can be suffixed to almost all commands. This is one of Stata's more useful features.

Question: What is the median MPG?

. summarize mpg, detail

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  74
25%           18             14       Sum of Wgt.          74

50%           20                      Mean            21.2973
                        Largest       Std. Dev.      5.785503
75%           25             34
90%           29             35       Variance       33.47205
95%           34             35       Skewness       .9487176
99%           41             41       Kurtosis       3.975005
Answer: 20.

  ? Explain command syntax

Descriptive statistics, continued

Our dataset contains variable foreign that is 0 if the car was manufactured in the United States or Canada and 1 otherwise.

Problem: Obtain summary statistics for price and MPG for each value of foreign.

There are two solutions to this problem:

  1. Type in the commands: summarize price mpg if foreign==0
    summarize price mpg if foreign==1
  2. Or, you could do the following:

. sort foreign

. by foreign: summarize price mpg

-> foreign = Domestic

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        52    6072.423    3097.104       3291      15906
         mpg |        52    19.82692    4.743297         12         34

_______________________________________________________________________________
-> foreign = Foreign

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        22    6384.682    2621.915       3748      12990
         mpg |        22    24.77273    6.611187         14         41
  ? Explain the by prefix

More on by

Problem: It appears that the average MPG of domestic and foreign cars differs. Test the hypothesis that the means are equal.
. ttest mpg, by(foreign)

Two-sample t test with equal variances

------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
Domestic |      52    19.82692     .657777    4.743297    18.50638    21.14747
 Foreign |      22    24.77273     1.40951    6.611187    21.84149    27.70396
---------+--------------------------------------------------------------------
combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
---------+--------------------------------------------------------------------
    diff |           -4.945804    1.362162               -7.661225   -2.230384
------------------------------------------------------------------------------
Degrees of freedom: 72

                Ho: mean(Domestic) - mean(Foreign) = diff = 0

     Ha: diff < 0               Ha: diff != 0              Ha: diff > 0
       t =  -3.6308                t =  -3.6308              t =  -3.6308
   P < t =   0.0003          P > |t| =   0.0005          P > t =   0.9997
  ? Explain by versus by()

Analysis note: We have established that in 1978 domestic cars had poorer gas mileage than foreign cars.

Descriptive statistics, making tables

Problem: Obtain counts of the number of domestic and foreign cars.
. tabulate foreign

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00
Problem: The data contains variable rep78 recording each car's frequency-of-repair record (1 = poor, ..., 5 = excellent). Obtain frequency counts.

. tabulate rep78

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00
Problem: We have 74 cars; only 69 have frequency-of-repair records recorded. List the cars for which data is missing.

. list make if missing(rep78)

     +---------------+
     | make          |
     |---------------|
  3. | AMC Spirit    |
  7. | Buick Opel    |
 45. | Plym. Sapporo |
 51. | Pont. Phoenix |
 64. | Peugeot 604   |
     +---------------+

Descriptive statistics, making tables, continued

Problem: Compare frequency-of-repair records for domestic and foreign cars (i.e., make a two-way table).

. tabulate rep78 foreign

    Repair |
    Record |       Car type
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69
Problem: Domestic cars appear to have poorer frequency-of-repair records. Is the difference statistically significant? Obtain a chi-square (even though there are not at least 5 cars expected in each cell):

. tabulate rep78 foreign, chi2
. tabulate rep78 foreign, chi2

    Repair |
    Record |       Car type
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000
Analysis note: We find that frequency-of-repair records differ between domestic and foreign cars. In 1978, domestic cars appear poorer in this regard.

  ? Tell me more about tabulate

Descriptive statistics, correlation matrices

Question: What is the correlation between MPG and weight of car?
.  correlate mpg weight
(obs=74)

             |      mpg   weight
-------------+------------------
         mpg |   1.0000
      weight |  -0.8072   1.0000
Problem: Compare the correlation for domestic and foreign cars.

. correlate mpg weight if foreign==0
(obs=52)

             |      mpg   weight
-------------+------------------
         mpg |   1.0000
      weight |  -0.8759   1.0000


. correlate mpg weight if foreign==1
(obs=22)

             |      mpg   weight
-------------+------------------
         mpg |   1.0000
      weight |  -0.6829   1.0000
Note We could have obtained this by typing by foreign: correlate mpg weight instead.

Descriptive statistics, correlation matrices, continued

Aside: We can produce correlation matrices containing as many variables as we wish.

. correlate mpg weight price length displacement
(obs=74)

             |      mpg   weight    price   length displa~t
-------------+---------------------------------------------
         mpg |   1.0000
      weight |  -0.8072   1.0000
       price |  -0.4686   0.5386   1.0000
      length |  -0.7958   0.9460   0.4318   1.0000
displacement |  -0.7056   0.8949   0.4949   0.8351   1.0000

Graphing data

Problem: We know the average MPG of domestic and foreign cars differs. We have learned that domestic and foreign cars differ in other ways as well, such as in frequency-of-repair record. We found a negative correlation of MPG and weight—as we would expect—but the correlation appears stronger for domestic cars. Examine, with an eye toward modeling, the relationship between MPG and weight. Begin with a graph.
. scatter mpg weight

Comment: scatter is explained in Graphics Reference Manual, but typing scatter y x draws a graph of y against x. The relationship, we note, appears to be nonlinear.

Note When you draw a graph, the Graph window appears, probably covering up your Results window. Click on the Results button to put your Results windows back on top. Want to see the graph again? Click on the Graph button.

Graphing data, continued

Next, we draw separate graphs for foreign and domestic cars.

. sort foreign

. scatter mpg weight, by(foreign, total row(1))

Syntax note: by() is on the right of the command, therefore scatter did whatever it is that it does with the grouping information. What scatter did was draw separate graphs for domestic and foreign cars in a single image. We have only two groups, but scatter will allow any number—the individual graphs just get smaller. The total suboption added an overall graph to the image and the row(1) suboption presented the graphs in a single row.

Analysis note: The relationship is not only nonlinear; the domestic-car relationship appears to differ from that of foreign cars.

Model estimation: linear regression

Restatement of problem: We are to model the relationship between MPG and weight.

Plan of attack: Based on the graphs, we judge the relationship nonlinear and will model MPG as a quadratic in weight. Also based on the graphs, we judge the relationship to be different for domestic and foreign cars. We will include an indicator (dummy) variable for foreign and evaluate afterwards whether this adequately describes the difference. Thus, we will fit the model:

mpg = b0 + b1 weight + b2 weight2 + b3 foreign + Î

foreign is already a 0/1 variable, so we only need to create the weight-squared variable:

. gen wtsq = weight^2

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  3,    70) =   52.25
       Model |  1689.15372     3   563.05124           Prob > F      =  0.0000
    Residual |   754.30574    70  10.7757963           R-squared     =  0.6913
-------------+------------------------------           Adj R-squared =  0.6781
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.2827

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0165729   .0039692    -4.18   0.000    -.0244892   -.0086567
        wtsq |   1.59e-06   6.25e-07     2.55   0.013     3.45e-07    2.84e-06
     foreign |    -2.2035   1.059246    -2.08   0.041      -4.3161   -.0909002
       _cons |   56.53884   6.197383     9.12   0.000     44.17855    68.89913
------------------------------------------------------------------------------

Model estimation: linear regression, continued

Aside: Stata can estimate many kinds of models, including logistic regression, Cox proportional hazards, etc. Click on Help, choose Search..., and enter estimation for a complete list or look up estimation in the index of the Stata Base Reference Manual.


Continuation of attack: We obtain the predicted values:
. predict mpghat
(option xb assumed; fitted values)
Comment: Be sure to read [U] 23 Estimation and post-estimation commands. There are a number of features available to you after estimation—one is calculation of predicted values. predict just created a new variable called mpghat equal to

.0165729weight + 1.59 * 10-6wtsq - 2.2035foreign + 56.53884

Model estimation: linear regression, continued

We can now graph the data and the predicted curve.

Continuation of attack: We just created mpghat with predict. We could graph the fit and data, but we want to evaluate the fit on the foreign and domestic data separately to determine if our shift parameter is adequate. Thus, we will draw the graphs separately:

. sort weight

. scatter mpg weight || line mpghat weight || if foreign==0

. scatter mpg weight || line mpghat weight || if foreign==1

"scatter mpg weight" says to graph mpg vs. weight as a scatterplot. "line mpghat weight" says to graph mpghat vs. weight as a line plot. The || in between says to join the two commands (overlay the two graphs).

Model estimation: linear regression, continued

Problem: You show your results to an engineer. "No," he says. "It should take twice as much energy to move 2,000 pounds 1 mile compared with moving 1,000 pounds, and therefore twice as much gasoline. Miles per gallon is not a quadratic in weight, gallons per mile is a linear function of weight."

You go back to the computer:

. gen gpm = 1/mpg

. label var gpm "Gallons per mile"

. sort foreign 

. scatter gpm weight, by(foreign, total row(1))

Satisfied the engineer is indeed correct, you rerun the regression:

. regress gpm weight foreign

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =  113.97
       Model |  .009117618     2  .004558809           Prob > F      =  0.0000
    Residual |   .00284001    71      .00004           R-squared     =  0.7625
-------------+------------------------------           Adj R-squared =  0.7558
       Total |  .011957628    73  .000163803           Root MSE      =  .00632

------------------------------------------------------------------------------
         gpm |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   .0000163   1.18e-06    13.74   0.000     .0000139    .0000186
     foreign |   .0062205   .0019974     3.11   0.003     .0022379    .0102032
       _cons |  -.0007348   .0040199    -0.18   0.855    -.0087504    .0072807
------------------------------------------------------------------------------
You find foreign cars in 1978 less efficient. Foreign cars may have yielded better gas mileage than domestic cars in 1978, but this was only because they were so light.


© Copyright 2005 Stata Corporation.