These were given at Stata User Meetings (or Stata Conferences), and are summarized here with their abstracts and downloadables. Stata User Meetings (or Conferences) are a principal opportunity for face-to-face networking within the Stata user community, and are also an opportunity to feed ideas and wishes for further development back to Stata Corporation, the makers of Stata statistical software. These meetings take place in many locations all over the world, but the largest ones are usually in London, UK, where a large concentration of Stata users is based at locations mutually accessible by train, bus or bicycle. This page contains all my own presentations given at these meetings, in reverse order of presentation date.
To find out more about Stata User Meetings (or Conferences), click here for the official Stata versions of the proceedings, or click here for the Boston College of Economics versions maintained by Kit Baum and listed with RePEc. To find out more about Stata Statistical Software, click here.
Stata Corporation
Stata User Meeting proceedings at RePEc
Stata User Meeting proceedings at StataCorp
Return to Roger Newson's main documents page
Return to Roger Newson's main resource page
Inverse treatment-propensity weights are a standard method for adjusting for predictors of exposure to a treatment. As a treatment-propensity score is a balancing score, it makes sense to do balance checks on the corresponding treatment-propensity weights. It is also a good idea to do variance-inflation checks, to estimate how much the propensity weights might inflate the variance of an estimated treatment effect, in the pessimistic scenario in which the weights are not really necessary. In Stata, the SSC package somersd can be used for balance checks, and the SSC package haif can be used for variance-inflation checks. It is argued that balance and variance-inflation checks are also necessary in the case of completeness-propensity weights, which are intended to remove inbalance in predictors of completeness between the subsample with complete data and the full sample of subjects with complete or incomplete data. However, the usage of somersd, scsomersd, and haif must be modified, because we are removing imbalance between the complete sample and the full sample, instead of between the treated subsample and the untreated subsample. An example will be presented, from a clinical trial in which the author was involved, and in which nearly a quarter of randomized subjects had no final outcome data. A post-hoc sensitivity analysis is presented, using inverse completeness-propensity weights.
Download presentation
Download example do-file
Return to top of page
Statisticians make their living producing tables (and plots). We present an update of a general family of methods for making customized tables, called the DCRIL path (decode, characterize, reshape, insert, list), with customized table cells (using the package sdecode), customized column attributes (using the package chardef), customized column labels (using the package xrewide), and/or customized inserted gap-row labels (using the package insingap), and listing these tables to automatically-generated documents. This demonstration uses the package listtab to list Markdown tables for browser-ready HTML documents, which Stata users like to generate, and the package docxtab to list .docx tables for printer-ready .docx documents, which our superiors like us to generate.
Download presentation
Download example do-files
Return to top of page
A resultsset is a Stata dataset created as output by a Stata command. It may be listed and/or saved in a disk file 3and/or (in Stata Versions 16 or higher) written to a data frame (or resultsframe) in the memory, without damaging any existing data frames. Commands creating resultssets include parmest, parmby, xcontract, xcollapse, descsave, xsvmat, and xdir. Commands useful for processing resultsframes include xframeappend, fraddinby, and invdesc. We survey the ways in which resultsset processing has been changed by resultsframes.
Download presentation
Download example do-files
Return to top of page
Ridit functions are specified with respect to an identified probability distribution. They are like ranks, only expressed on a scale from 0 to 1 (for unfolded ridits), or -1 to 1 (for folded ridits). Ridit functions have generalised inverses called percentile functions. A native ridit is a ridit of a variable with respect to its own distribution. Native ridits can be computed using the ridit() function of Nick Cox's SSC package egenmore. Alternatively, weighted ridits can be computed using the SSC package wridit. This has a handedness() option, where handedness(right) specifies a right--continuous ridit (also known as a cumulative distribution function), handedness(left) specifies a left--continuous ridit, and handedness(center) (the default) specifies a ridit function discontinuous at its mass points. wridit now has a module fridit, computing foreign ridits of a variable with respect to a distribution other than its own, specifying the foreign distribution in another data frame. An application of ridits is ridit splines, which are splines in a ridit function, typically computed using the SSC package polyspline. As an example, we may fit a ridit spline to a training set, and use it for prediction in a test set, using foreign ridits of an X-variable in the test set with respect to the distribution of the X-variable in the training set. The model parameterss are typically values of an outcome variable corresponding to percentiles of the X-variable in the training set. This practice stabilises (or Winsorises) outcome values corresponding to X-values in the test set outside the range of X-values in the training set.
Download presentation
Download example do-files
Return to top of page
Metadatasets are Stata datasets, in files or in frames, which may have one observation per file, per dataset, per variable, or per variable value. Metadatasets can be used to modify a Stata database, or to make a Stata database self-documenting, especially if converted to non-Stata formats, such as HTML or even Microsoft Excel. We present some user-written packages, updated to Stata Version 16, for creating and using metadatasets. The xdir package creates a resultsset with one observation per file in a folder conforming to a user-specified pattern. The descgen pacgage inputs a xdir resultsset, and generates a new variable indicating whether each file is a Stata dataset, and other new variables containing dataset attributes, such as the dataset label and characteristics, the sort key of variables, and the numbers of observations and variables. The vallabdef package inputs a dataset with 1 observation per label name per value per value label, and generates Stata value labels. The vallabsave package loads and saves value labels from and to label-only datasets, and transfers value labels between data frames. The descsave package creates a metadataset with one observation per variable in a dataset, and data on variable attributes (including characteristics). The invdesc package modifies the variable attributes of the dataset in the current frame, inputting a descsave resultsset in a second data frame to set the variable attributes, and inputting value labels from a dataset in a third data frame. The datasets containing the variable attributes and value labels may be produced as resultssets by Stata packages, or produced manually in a spreadsheet using LibreOffice Calc or Microsoft Excel, and input into Stata datasets using import delimited or import excel.
Download presentation
Download example do-file
Download example .txt input files
Return to top of page
Scientists frequently work with pairs of alternative variables intended to measure the same quantity. Examples include measured and predicted disease prevalences in primary-care practices, and marks awarded to student exam scripts by two different teachers. Statistical methods developed for use with such pairs of variables (A and B) may aim to measure components of disagreement between the variables (like discordance, bias and scale differential), or they may aim to estimate one variable from the other (calibration). The Bland-Altman plot is the standard way of presenting a pair of alternative measures, and allows us to visualise discordance, bias and scale differential at the same time. However, it lacks parameters with confidence limits. The SSC packages somersd, scsomersd and rcentile can be used to estimate rank parameters. They can measure discordance using Kendall's τa between A and B, bias using the mean sign and percentiles of A-B, and scale differential using Kendall's τa between A-B and A+B. For calibration (predicting A from B), we can use the SSC packages wridit and polyspline to define a ridit spline of A with respect to B. We can then plot the observed B and the predicted A (with confidence limits) against the ridit of B, to create a continuous alternative to the standard decile plot commonly used for calibration.
Download presentation
Download example dataset and do-files
Return to top of page
The Clinical Practice Research Datalink (CPRD) is a centrally-managed data warehouse, storing data provided by the primary-care sector of the United Kingdom (UK) National Health Service (NHS). Medical researchers request retrievals from this database, which take the form of a collection of text datasets, whose format can be complicated. I have written a flagship package cprdutil, with multiple modules to input into Stata the many text dataset types provided in a CPRD retrieval. These text datasets may be converted either to Stata value labels or to Stata datasets, which can be created complete with value labels, variable labels, and numeric Stata dates. I have also written a fleet of satellite packages, to input into Stata the text datasets for retrievals of linked data, in which data are provided from non-CPRD sources, with CPRD identifier variables as a foreign key to allow data linkage. The modules of cprdutil are introduced. A demonstration example is given, in which a minimal CPRD database is produced in Stata, using cprdutil, and some principles of sensible programming practice for creating large databases are illustrated.
Download presentation
Download example do-files
Return to top of page
Given a random variable X, the ridit function RX(.) specifies its distribution. The SSC package wridit can compute ridits (possibly weighted) for a variable. A ridit spline in a variable X is a spline in the ridit RX(X). The SSC package polyspline can be used with wridit to generate an unrestricted ridit-spline basis for an X-variable, with the feature that, in a regression model, the parameters corresponding to the basis variables are equal to mean values of the outcome variable at a list of percentiles of the X-variable. Ridit splines are especially useful in propensity weighting. The user may define a primary propensity score in the usual way, by fitting a regression model of the treatment variable with respect to the confounders, and then using the predicted values of the treatment variable. A secondary propensity score is then defined by regressing the treatment variable with respect to a ridit-spline basis in the primary propensity score. We have found that secondary propensity scores can predict the treatment variable as well as the corresponding primary propensity scores, as measured using the unweighted Somers' D with respect to the treatment variable. However, secondary propensity weights frequently perform better than primary propensity weights at standardizing out the treatment-propensity association, as measured using the propensity-weighted Somers' D with respect to the treatment variable. Also, when we measure the treatment effect, secondary propensity weights may cause considerably less variance inflation than primary propensity weights. This is because the secondary propensity score is less likely to produce extreme propensity weights than the primary propensity score.
Download presentation
Download example do-file
Return to top of page
The Rubin method of confounder adjustment, in its 21st-century version, is a two-phase method for using
observational data to estimate a causal treatment effect on an outcome variable. It involves first finding a
propensity model in the joint distribution of a treatment variable and its confounders (the design phase), and
then estimating the treatment effect from the conditional distribution of the outcome, given the treatments
and confounders (the analysis phase). In the design phase, we want to limit the level of spurious treatment
effect that might be caused by any residual imbalance between treatment and confounders that may remain,
after adjusting for the propensity score by propensity matching and/or weighting and/or stratification. A
good measure of this is Somers' D(W|X), where W is a confounder or a propensity score, and X is the
treatment variable. The SSC package somersd
calculates Somers' D for a wide range of sampling schemes,
allowing matching and/or weighting and/or restriction to comparisons within strata. Somers' D has the
feature that, if Y is an outcome, then a higher-magnitude D(W|X) cannot be secondary to a lower-magnitude
D(W|X), implying that D(W|X) can be used to set an upper bound to the size of a spurious treatment effect
on an outcome. For a binary treatment variable X, D(W|X) gives an upper bound to the size of a difference
between the proportions, in the two treatment groups, that can be caused for a binary outcome. If D(W|X)
is less than 0.5, then it can be doubled to give an upper bound to the size of a difference between the means,
in the two treatment groups, that can be caused for an equal-variance Normal outcome, expressed in units
of the common standard deviation for the two treatment groups. We illustrate this method using a familiar
dataset, with examples using propensity matching, weighting and stratification. We use the SSC package
haif
in the design phase, to check for variance inflation caused by propensity adjustment, and use the SSC
package scenttest
(an addition to the punaf
family) to estimate the treatment effect in the analysis phase.
Download presentation
Download example do-file
Return to top of page
Somers' D(Y|X) is an asymmetric measure of ordinal association
between two variables Y and X, on a scale from -1 to 1.
It is defined as the difference between the conditional
probabilities of concordance and discordance between two randomly-sampled
(X,Y)-pairs, given that the two X-values are
ordered. The somersd
package enables the user to estimate Somers'
D for a wide range of sampling schemes, allowing clustering and/or
sampling-probability weighting and/or restriction to comparisons within
strata. Somers' D has the useful feature that a larger D(Y|X)
cannot be secondary to a smaller D(W|X) with the same sign,
enabling us to make scientific statements that the first ordinal
association cannot be caused by the second. An important practical example,
especially for public-health scientists, is the case where Y is an
outcome, X an exposure, and W a propensity score.
However, an audience accustomed to other measures of association may be
culture-shocked, if we present associations measured using Somers'
D. Fortunately, under some commonly-used models, Somers' D is
related monotonically to an alternative association measure, which may be
more clearly related to the practical question of how much good we can do.
These relationships are nearly linear (or log-linear) over the range of
Somers' D values from -0.5 to 0.5. We present examples with
X and Y binary, with X binary and Y a
survival time, with X binary and Y conditionally Normal,
and with X and Y bivariate Normal. Somers' D can
therefore be used as a common currency for comparing a wide range of
associations between variables, not limited to a particular model.
Download presentation
Download example do-file
Return to top of page
So-called non-parametric methods are in fact based on estimating and testing parameters, usually either rank
parameters or spline parameters. Two comprehensive packages for estimating these are somersd
(for rank
parameters) and bspline
(for spline parameters). Both of these estimate a wide range of parameters, but
both are frequently found to be difficult to use by casual users. This presentation introduces rcentile
, an
easy-to-use front end for somersd
, and polyspline
, an easy-to-use front end for bspline
.
rcentile
estimates percentiles with confidence limits, optionally allowing for clustered sampling and
sampling-probability weights. The confidence intervals are saved in a Stata matrix, with one row per percentile,
which the user can save to a resultsset using the xsvmat
package. polyspline
inputs an X-
variable and a user-defined list of reference points and outputs a basis of variables for a polynomial or for
another unrestricted spline. This basis can be included in the covariate list for an estimation command, and the
corresponding parameters will be values of the polynomial or spline at the reference points, or differences between
these values. By default, the spline will simply be a polynomial, with a degree one less than the number of
reference points. However, if the user specifies a lower degree, then the spline will have knots interpolated
sensibly between the reference points.
Download presentation
Download example do-file
Return to top of page
Factor variables are defined as categorical variables with integer
values, which may represent values of some other kind, specified by a
value label. We frequently want to generate such variables in Stata
datasets, especially resultssets, which are output Stata datasets
produced by Stata programs, such as the official Stata statsby
command
and the SSC packages parmest
and xcontract
. This is because
categorical string variables can only be plotted after conversion to
numeric variables, and because these numeric variables are also
frequently used in defining a key of variables, which identify
observations in the resultsset uniquely in a sensible sort order. The
sencode
package is downloadable, and frequently downloaded, from SSC,
and is a "super" version of encode
, which inputs a string variable and
outputs a numeric factor variable. Its added features include a
replace
option allowing the output numeric variable to replace the
input string variable, a gsort()
option allowing the numeric values to
be ordered in ways other than alphabetical order of the input string
values, and a manyto1
option allowing multiple output numeric values
to map to the same input string value. The sencode
package is
well-established, and has existed since 2001. However, some tips will
be given on ways of using it that are not immediately obvious, but
which the author has found very useful over the years when
mass-producing resultssets. These applications use sencode
with other
commands, such as the official Stata command split
and the SSC
packages factmerg
, factext
and fvregen
.
Download presentation
Download example do-file
Return to top of page
Applied scientists, especially public health scientists, frequently want to know how much good can be caused
by a proposed intervention. For instance, they might want to estimate how much we could decrease the level
of a disease, in a dream scenario where the whole world stopped smoking, assuming that a regression model
fitted to a sample is true. Alternatively, they may want to compare the same scenario between regression
models fitted to different datasets, as when disease rates in different subpopulations are standardized to a
common distribution of gender and age, using the same logistic regression model with different parameters in
each subpopulation. In statistics, scenarios can be defined as alternative versions of a dataset, with the same
variables, but with different values in the observations, or even with non-corresponding observations. Using
regression methods, we may estimate scenario means of a Y-variable in scenarios with specified X-values,
and compare these scenario means. In Stata Versions 11 and 12, the standard tool for estimating scenario
means is margins
. A suite of packages is introduced for estimating scenario means and their comparisons,
using margins
, together with nlcom
to implement Normalizing and variance-stabilizing transformations.
margprev
estimates scenario prevalences for binary variables. marglmean
estimates scenario arithmetic
means for non-negative valued variables. regpar
estimates 2 scenario prevalences, together with their
difference, the population attributable risk (PAR). punaf
estimates 2 scenario arithmetic means from cohort
or cross-sectional data, together with their ratio, the population unattributable fraction (PUF), which is
subtracted from 1 to give the population attributable fraction (PAF). punafcc
estimates an arithmetic mean
between-scenario rate ratio for cases or non-survivors in case-control or survival data, respectively. This
mean rate ratio, also known as a PUF, is also subtracted from 1 to estimate a PAF. These packages use
the log transformation for arithmetic means and their ratios, the logit transformation for prevalences, and
the hyperbolic arctangent or Fisher's z transformation for differences between prevalences. Examples are
presented for these packages.
Download presentation
Download example do-file
Return to top of page
Splines, including polynomials, are traditionally used to model non-linear relationships involving continuous
predictors. However, when they are included in linear models (or generalized linear models), the estimated
parameters for polynomials are not easy for non-mathematicians to understand, and the estimated parameters
for other splines are often not easy even for mathematicians to understand. It would be easier if the
parameters were values of the polynomial or spline at reference points on the X-axis, or differences or ratios
between the values of the spline at the reference points and the value of the spline at a base reference point.
The bspline
package can be downloaded from SSC, and generates spline bases for inclusion in the design
matrices of linear models, based on Schoenberg B-splines. The package now has a recently added module
flexcurv
, which inputs a sequence of reference points on the X-axis, and outputs a spline basis, based on
equally-spaced knots generated automatically, whose parameters are the values of the spline at the reference
points. This spline basis can be modified by excluding the spline vector at a base reference point and
including the unit vector. If this is done, then the parameter corresponding to the unit vector will be the
value of the spline at the base reference point, and the parameters corresponding to the remaining reference
spline vectors will be differences between the values of the spline at the corresponding reference points and
the value of the spline at the base reference point. The spline bases are therefore extensions, to continuous
factors, of the bases of unit vectors and/or indicator functions used to model discrete factors. It is possible
to combine these bases for different continuous and/or discrete factors in the same way, using product bases
in a design matrix to estimate factor-value combination means and/or factor-value effects and/or factor
interactions.
Download presentation
Download example do-files
Return to top of page
parmest
peripherals: fvregen
, invcise
, and qqvalue
.
Presented at the 16th UK Stata User Meeting, 9-10 September, 2010.
The parmest
package is used with Stata estimation commands to produce output datasets (or resultssets)
with one observation per estimated parameter, and data on parameter names, estimates, confidence limits,
P-values, and other parameter attributes. These resultssets can then be input to other Stata programs
to produce tables, listings, plots, and secondary resultssets containing derived parameters.
Three recently-added packages for post-parmest
processing are fvregen
, invcise
, and qqvalue
.
fvregen
is used when the
parameters belong to models containing factor variables, introduced in Stata Version 11.
It regenerates these factor variables in the resultsset,
enabling the user to plot, list, or tabulate factor levels with estimates and
confidence limits of parameters specific to these factor levels. invcise
calculates standard errors inversely
from confidence limits produced without standard errors, such as those for medians and for Hodges-Lehmann
median differences. These standard errors can then be input, with the estimates, into the metaparm
module
of parmest
, to produce confidence intervals for linear combinations of medians or of median differences, such
as those used in meta-analysis or interaction estimation. qqvalue
inputs the P-values in a resultsset, and
creates a new variable containing the frequentist q-values, calculated by inverting a multiple-test procedure
designed to control the familywise error rate (FWER) or the false discovery rate (FDR). The frequentist
q-value for each P-value is the minimum FWER or FDR for which that P-value would be in the discovery
set, if the specified multiple-test procedure was used on the full set of P-values.
fvregen
, invcise
, qqvalue
, and parmest
can be downloaded from SSC.
Download presentation
Download first example do-file
Download second example do-file
Return to top of page
Insufficient confounder adjustment is viewed as a common source of "false
discoveries", especially in the epidemiology sector. However, adjustment for
"confounders" that are correlated with the exposure, but which do not
independently predict the outcome, may cause loss of power to detect the
exposure effect. On the other hand, choosing confounders based on "stepwise"
methods is subject to many hazards, which imply that the confidence interval
eventually published is likely not to have the advertized coverage
probability for the effect that we wanted to know. We would like to be able
to find a model in the data on exposures and confounders, and then to
estimate the parameters of that model from the conditional distribution of
the outcome, given the exposures and confounders. The haif
package,
downloadable from SSC, calculates the homoskedastic adjustment inflation
factors (HAIFs), by which the variances and standard errors of coefficients
for a matrix of X-variables are scaled (or inflated), if a matrix of
unnecessary confounders A is also included in a regression model, assuming
equal variances (homoskedasticity). These can be calculated from the A- and
X-variables alone, and can be used to inform the choice of a set of models
eventually fitted to the outcome data, together with the usual criteria
involving causality and prior opinion. Examples are given of the use of HAIFs
and their ratios.
Download presentation
Return to top of page
parmest
and extensions.
Presented at the 14th UK Stata User Meeting, 8-9 September, 2008.
The parmest
package creates output datasets (or resultssets) with one
observation for each of a set of estimated parameters, and data on the
parameter estimates, standard errors, degrees of freedom, t- or z-statistics,
P-values, confidence limits, and other parameter attributes specified by the
user. It is especially useful when parameter estimates are "mass-produced",
as in a genome scan. Versions of the package have existed on SSC since 1998,
when it contained the single command parmest
. However, the package has since
been extended with additional commands. The metaparm
command allows the user
to mass-produce confidence intervals for linear combinations of uncorrelated
parameters. Examples include confidence intervals for a weighted arithmetic
or geometric mean parameter in a meta-analysis, or for differences or ratios
between parameters, or for interactions, defined as differences (or ratios)
between differences (or ratios). The parmcip
command is a lower-level utility,
inputting variables containing estimates, standard errors, and degrees of
freedom, and outputting variables containing confidence limits and P-values.
As an example, we may input genotype frequencies and calculate confidence
intervals for geometric mean homozygote/heterozygote ratios for genetic
polymorphisms, measuring the size and direction of departures from
Hardy-Weinberg equilibrium.
Download presentation
Return to top of page
The cendif
module is part of the somersd
package, and calculates confidence
intervals for the Hodges-Lehmann median difference between values of a
variable in two subpopulations. The traditional Lehmann formula, unlike the
formula used by cendif
, assumes that the two subpopulation distributions
are different only in location, and that the subpopulations are therefore
equally variable. The cendif
formula therefore contrasts with the Lehmann
formula as the unequal-variance t-test contrasts with the equal-variance
t-test. In a simulation study, designed to test cendif
to destruction,
the performance of cendif
was compared to that of the Lehmann formula,
using coverage probabilities and median confidence interval width ratios.
The simulations involved sampling from pairs of Normal or Cauchy
distributions, with subsample sizes ranging from 5 to 40, and
between-subpopulation variability scale ratios ranging from 1 to 4. If the
sample numbers were equal, then both methods gave coverage probabilities
close to the advertized confidence level. However, if the sample numbers
were unequal, then the Lehmann coverage probabilities were over-conservative
if the smaller sample was from the less variable population, and
over-liberal if the smaller sample was from the more variable population.
The cendif
coverage probability was usually closer to the advertized level,
if the smaller sample was not very small. However, if the sample sizes were
5 and 40, and the two populations were equally variable, then the Lehmann
coverage probability was close to its advertized level, while the cendif
coverage probability was over-liberal. The cendif
confidence interval,
in its present form, is therefore robust both to non-Normality and to
unequal variablity, but may be less robust to the possibility that the
smaller sample size is very small. Possibilities for improvement are
discussed.
Download presentation
Return to top of page
Somers' D and Kendall's tau-a are parameters behind rank or "non-parametric"
statistics, interpreted as differences between proportions. Given two
bivariate data pairs (X_1,Y_1) and (X_2,Y_2), Kendall's tau-a is the
difference between the probability that the two pairs are concordant
and the probability that the two pairs are discordant, and Somers' D is the
difference between the corresponding conditional probabilities, given
that the X-values are ordered. The somersd
package computes confidence
intervals for both parameters. The Stata 9 version of somersd
uses Mata,
and greatly extends the definition of Somers' D, allowing the X- and/or
Y-variables to be left- or right-censored, and allowing multiple versions
of Somers' D for multiple sampling schemes for pairs of X,Y-pairs.
In particular, we may define stratified versions of Somers' D, in which we
only compare pairs from the same stratum. The strata may be defined by
grouping a Rubin-Rosenbaum propensity score, based on the values of
multiple confounders for an association between an exposure variable X and
an outcome variable Y. Therefore, rank statistics can have not only
confidence intervals, but confounder-adjusted confidence intervals.
Usually, we either estimate D(Y|X) as a measure of the effect of X on Y,
or estimate D(X|Y) as a measure of the performance of X as a predictor of Y,
compared to other predictors. Alternative rank-based measures of the effect
of X on Y include the Hodges-Lehmann median difference and the Theil-Sen
median slope, both of which are defined in terms of Somers' D and estimated
using the somersd
package.
Download presentation
Return to top of page
Most Stata users make their living producing results in a form accessible to end users.
Most of these end users cannot immediately understand Stata logs.
However, they can understand tables (in paper, PDF, HTML, spreadsheet or word processor documents)
and plots (produced using Stata or non-Stata software).
Tables are produced by Stata as resultsspreadsheets, and plots are produced by Stata as resultsplots.
Sometimes (but not always), resultsspreadsheets and resultsplots are produced using resultssets.
Resultssets, resultsspreadsheets and resultsplots are all produced, directly or indirectly, as output by Stata commands.
A resultsset is a Stata dataset, which is a table,
whose rows are Stata observations and whose columns are Stata variables.
A resultsspreadsheet is a table in generic text format, conforming to a TeX or HTML convention, or to another convention
with a column separator string and possibly left and right row delimiter strings.
A resultsplot is a plot produced as output, using a resultsset or a resultsspreadsheet as input.
Resultsset-producing programs include statsby
, parmby
, parmest
, collapse
, contract
,
xcollapse
and xcontract
.
Resultsspreadsheet-producing programs include outsheet
, listtex
, estout
and estimates table
.
Resultsplot-producing programs include eclplot
and smileplot
.
There are two main approaches (or dogmas) for generating resultsspreadsheets and resultsplots.
The resultsset-central dogma is followed by parmest
and parmby
users, and states:
"Datasets make resultssets, which make resultsplots and resultsspreadsheets".
The resultsspreadsheet-central dogma is followed by estout
and estimates table
users, and states:
"Datasets make resultsspreadsheets, which make resultssets, which make resultsplots".
The two dogmas are complementary, and each dogma has its advantages and disadvantages.
The resultsspreadsheet dogma is much easier for the casual user to learn to apply in a hurry,
and is therefore probably preferred by most users most of the time.
The resultsset dogma is more difficult for most users to learn, but is more convenient
for users who wish to program everything in do-files,
with little or no manual cutting and pasting.
Download projection
Download first example do-file
Download second example do-file
Return to top of page
Confidence intervals may be presented as publication-ready tables or as presentation-ready plots. eclplot
produces plots of estimates and confidence intervals. It inputs a dataset (or resultsset) with one observation
per parameter and variables containing estimates, lower and upper confidence limits, and a fourth variable,
against which the confidence intervals are plotted. This resultsset can be used for producing both plots
and tables, and may be generated using a spreadsheet or using statsby
, postfile
or the unofficial Stata
parmest
package. Currently, eclplot
offers 7 plot types for the estimates and 8 plot types for the confidence
intervals, each corresponding to a graph twoway
subcommand. These plot types can be combined to produce
56 combined plot types, some of which are more useful than others, and all of which can be either horizontal
or vertical. eclplot
has a plot()
option, allowing the user to superimpose other plots to add features such
as stars for P-values. eclplot
can be used either by typing a command, which may have multiple lines and
sub-suboptions, or by using a dialog, which generates the command for users not fluent in the Stata graphics
language. This presentation includes a demonstration of eclplot
, using both commands and dialogs.
Download projection
Download entire presentation
Return to top of page
A resultsset is a Stata dataset created as output by a Stata program.
It can be used as input to other Stata programs, which may in turn output the results
as publication-ready plots or tables. Programs that create resultssets include
xcontract
, xcollapse
, parmest
, parmby
and descsave
.
Stata resultssets do a similar job to SAS output data sets,
which are saved to disk files. However, in Stata, the user typically has the options of saving a resultsset
to a disk file, writing it to the memory (overwriting any pre-existing data set), or simply listing it.
Resultssets are often saved to temporary files, using the tempfile
command.
This lecture introduces programs that create resultssets,
and also programs that do things with resultssets after they have been created.
listtex
outputs resultssets to tables that can be inserted into a Microsoft Word, HTML or LaTeX document.
eclplot
inputs resultssets and creates confidence interval plots.
Other programs, such as sencode
and sdecode
, process resultssets after they are created
and before they are listed, tabulated or plotted.
These programs, used together, have a power not always appreciated if the user simply reads the on-line help
for each package.
This lecture is a survey lecture, and is based on a handout and a set of example do-files,
which can be downloaded with or without the presentation.
Download presentation
Download handout
Download example do-files
Return to top of page
Scientists often have good reasons for wanting to calculate multiple confidence intervals and/or P-values,
especially when scanning a genome. However, if we do this, then the probability of not observing
at least one "significant" difference tends to fall, even if all null hypotheses are true.
A sceptical public will rightly ask whether a difference is "significant" when considered
as one of a large number of parameters estimated. This presentation demonstrates some solutions to
this problem, using the unofficial Stata packages parmest
and smileplot
.
The parmest
package allows the calculation of Bonferroni-corrected or Sidak-corrected
confidence intervals for multiple estimated parameters. The smileplot
package contains two programs,
multproc
(which carries out multiple test procedures) and smileplot
(which presents their
results graphically by plotting the P-value on a reverse log scale on the vertical axis
against the parameter estimate on the horizontal axis). A multiple test procedure takes, as input,
a set of estimates and P-values, and rejects a subset (possibly empty) of the null hypotheses
corresponding to these P-values. Multiple test procedures have traditionally controlled the
family-wise error rate (FWER), typically enabling the user to be 95% confident that all the rejected
null hypotheses are false, and that all the corresponding "discoveries" are real.
The price of this confidence is that the power to detect a difference of a given size tends to zero
as the number of measured parameters becomes large. Therefore, recent work has concentrated on procedures
that control the false discovery rate (FDR), such as the Simes procedure and the Yekutieli-Benjamini procedure.
FDR-controlling procedures attempt to control the number of false discoveries as a proportion
of the number of true discoveries, typically enabling the user to be 95% confident that some
of the discoveries are real, or 90\% confident that most of the discoveries are real.
This less stringent requirement causes power to "bottom out" at a non-zero level as the number
of tests becomes large. The smileplot
package offers a selection of multiple test procedures of both kinds.
This presentation uses data provided by the ALSPAC Study Team at the Institute of Child Health
at Bristol University, UK.
Download presentation
Return to top of page
parmest
and friends.
Presented at the 8th UK Stata User Meeting, 20-21 May, 2002.
Statisticians make their living mostly by producing confidence intervals and
P-values. However, the ones supplied in the Stata log are not in any fit state to be
delivered to the end user, who usually at least wants them tabulated and formatted,
and may appreciate them even more if they are plotted on a graph for immediate impact.
The parmest
package was developed to make this easy, and consists of two programs.
These are parmest
, which converts the latest estimation results to a data set
with one observation per estimated parameter and data on confidence intervals, P-values
and other estimation results,
and parmby
, a "quasi-byable" front end to parmest
, which is like
statsby
, but creates a data set with one observation per parameter per by-group
instead of a data set with one observation per by-group.
The parmest
package can be used together with a team of other Stata programs
to produce a wide range of tables and plots of confidence intervals and P-values.
The programs descsave
and factext
can be used with parmby
to create plots of confidence intervals against values of a categorical factor included in the
fitted model, using dummy variables produced by xi
or tabulate
.
The user may easily fit multiple models, produce a parmby
output data set
for each one, and concatenate these output data sets using the program dsconcat
to produce a combined data set, which can then be used to produce tables or plots
involving parameters from all the models. For instance, the user might tabulate or
plot unadjusted and adjusted regression parameters side by side, together with their
confidence limits and/or P-values. The parmest
team is particularly useful
when dealing with large volumes of results derived from multiple multi-parameter
models, which are particularly common in the world of epidemiology.
This version of the presentation is a post-publication update, made in response to changes
in the parmest
package suggested by Bill Gould of StataCorp after seeing the original presentation.
Download presentation
Return to top of page
Splines are traditionally used to model non-linear relationships
involving continuous predictors, usually confounders. One example is in asthma
epidemiology, where splines are used to model a seasonal and longer-term time trend in
asthma-related hospital admissions,
which must be eliminated in a search for shorter-term epidemics caused by pollution
episodes. Usually, the spline is included in a regression model by defining a basis
of splines, and including this basis amongst the X-variates, together with the
predictors of interest. The basis is typically a plus-function basis, a truncated-power
basis, or a Schoenberg B-spline basis. With either of these options, the parameters
estimated by the regression model will not be easy to explain in words to non-mathematicians.
An STB insert (sg151 in STB-57) presented two programs for generating spline bases. One of
these (bspline
) generates Schoenberg B-splines. The other program (frencurv
,
short for "French curve")
generates an alternative spline basis, whose parameters are simply values of the spline
at reference points along the horizontal axis. In the example from asthma epidemiology,
these parameters might be the expected hospital admissions counts on the first day of
each month, in the absence of a pollution episode. The expected pollution-free
admissions counts on other days of the month are interpolated between the parameters,
using the spline. These parameters can be presented, with their confidence limits,
to non-technical people. Confidence limits can also be computed for differences and/or
ratios between expected values at different reference points, using lincom
.
Download presentation
Return to top of page
So-called "non-parametric" methods are in fact based on population
parameters, which are zero under the null hypothesis.
Two of these parameters are Kendall's tau-a and Somers' D.
both of which measure ordinal correlation between two variables
X and Y. If X is a binary variable,
then Somers' D(Y|X) is the parameter tested by a Wilcoxon rank-sum test.
It is more informative to have
confidence limits for these parameters than P-values alone,
for three main reasons. First, it might discourage people from
arguing that a high P-value proves a null hypothesis.
Second, for continuous
data, Kendall's tau-a is often related to the classical Pearson
correlation by Greiner's relation,
so we can use Kendall's tau-a to
define robust confidence limits for Pearson's correlation.
Third, we might want to know confidence limits for differences
between two Kendall's tau-a or Somers' D parameters,
because a larger Kendall's tau-a or Somers' D
cannot be secondary to a smaller one.
The program somersd
calculates confidence intervals for Somers' D or Kendall's tau-a,
using jackknife variances. There is a choice of transformations,
including Fisher's z, Daniels' arcsine, Greiner's rho, and the
z-transform of Greiner's rho. A cluster
option is available, intended
for measuring intra-class correlation (such as exists between
measurements on pairs of sisters). The estimation results are
saved as for a model fit, so that differences can be estimated using lincom
.
Download presentation
Return to top of page
Roger B. Newson
Email: r.newson@qmul.ac.uk
Text written: 21 May 2024. (Papers and presentations may have been revised since then.)
Return to top of page
Return to Roger Newson's main documents page
Return to Roger Newson's main resource page