These were given at Stata User Meetings (or Stata Conferences), and are summarized here with their abstracts and downloadables. Stata User Meetings (or Conferences) are a principal opportunity for face-to-face networking within the Stata user community, and are also an opportunity to feed ideas and wishes for further development back to Stata Corporation, the makers of Stata statistical software. These meetings take place in many locations all over the world, but the largest ones are usually in London, UK, where a large concentration of Stata users is based at locations mutually accessible by train, bus or bicycle. This page contains all my own presentations given at these meetings, in reverse order of presentation date.

To find out more about Stata User Meetings (or Conferences), click here for the official Stata versions of the proceedings, or click here for the Boston College of Economics versions maintained by Kit Baum and listed with RePEc. To find out more about Stata Statistical Software, click here.

Stata Corporation

Stata User Meeting proceedings at RePEc

Stata User Meeting proceedings at StataCorp

Return to Roger Newson's main documents page

Return to Roger Newson's main resource page

Metadatasets are Stata datasets, in files or in frames,
which may have one observation per file, per dataset, per variable, or per variable value.
Metadatasets can be used to modify a Stata database,
or to make a Stata database self-documenting,
especially if converted to non-Stata formats, such as HTML or even Microsoft Excel.
We present some user-written packages, updated to Stata Version 16, for creating and using metadatasets.
The `xdir` package creates a resultsset with one observation per file in a folder
conforming to a user-specified pattern.
The `descgen` pacgage inputs a `xdir` resultsset,
and generates a new variable indicating whether each file is a Stata dataset,
and other new variables containing dataset attributes, such as the dataset label and characteristics,
the sort key of variables,
and the numbers of observations and variables.
The `vallabdef` package inputs a dataset with 1 observation per label name per value per value label,
and generates Stata value labels.
The `vallabsave` package loads and saves value labels from and to label-only datasets,
and transfers value labels between data frames.
The `descsave` package creates a metadataset with one observation per variable in a dataset,
and data on variable attributes (including characteristics).
The `invdesc` package modifies the variable attributes of the dataset in the current frame,
inputting a `descsave` resultsset in a second data frame to set the variable attributes,
and inputting value labels from a dataset in a third data frame.
The datasets containing the variable attributes and value labels
may be produced as resultssets by Stata packages,
or produced manually in a spreadsheet using LibreOffice Calc or Microsoft Excel,
and input into Stata datasets using `import delimited` or `import excel`.

Download presentation

Download example do-file

Download example .txt input files

Return to top of page

Scientists frequently work with pairs of alternative variables intended to measure the same quantity.
Examples include measured and predicted disease prevalences in primary-care practices,
and marks awarded to student exam scripts by two different teachers.
Statistical methods developed for use with such pairs of variables (*A* and *B*)
may aim to measure components of disagreement between the variables
(like discordance, bias and scale differential),
or they may aim to estimate one variable from the other (calibration).
The Bland-Altman plot is the standard way of presenting a pair of alternative measures,
and allows us to visualise discordance, bias and scale differential at the same time.
However, it lacks parameters with confidence limits.
The SSC packages `somersd`, `scsomersd` and `rcentile`
can be used to estimate rank parameters.
They can measure discordance using Kendall's τ_{a} between *A* and *B*,
bias using the mean sign and percentiles of *A-B*,
and scale differential using Kendall's τ_{a} between *A-B* and *A+B*.
For calibration (predicting *A* from *B*),
we can use the SSC packages `wridit` and `polyspline`
to define a ridit spline of *A* with respect to *B*.
We can then plot the observed *B* and the predicted *A* (with confidence limits)
against the ridit of *B*,
to create a continuous alternative to the standard decile plot commonly used for calibration.

Download presentation

Download example dataset and do-files

Return to top of page

The Clinical Practice Research Datalink (CPRD) is a centrally-managed data warehouse,
storing data provided by the
primary-care sector of the United Kingdom (UK) National
Health Service (NHS).
Medical researchers request retrievals from this database,
which take the form of a collection of text datasets,
whose format can be complicated.
I have written a flagship package `cprdutil`,
with multiple modules to input into Stata the many text dataset types provided in a CPRD retrieval.
These text datasets may be converted either to Stata value labels or to Stata datasets,
which can be created complete with value labels, variable labels, and numeric Stata dates.
I have also written a fleet of satellite packages,
to input into Stata the text datasets for retrievals of linked data,
in which data are provided from non-CPRD sources,
with CPRD identifier variables as a foreign key to allow data linkage.
The modules of `cprdutil` are introduced.
A demonstration example is given,
in which a minimal CPRD database is produced in Stata, using `cprdutil`,
and some principles of sensible programming practice for creating large databases are illustrated.

Download presentation

Download example do-files

Return to top of page

Given a random variable *X*, the ridit function *R _{X}(·)* specifies its distribution.
The SSC package

Download presentation

Download example do-file

Return to top of page

The Rubin method of confounder adjustment, in its 21st-century version, is a two-phase method for using
observational data to estimate a causal treatment effect on an outcome variable. It involves first finding a
propensity model in the joint distribution of a treatment variable and its confounders (the design phase), and
then estimating the treatment effect from the conditional distribution of the outcome, given the treatments
and confounders (the analysis phase). In the design phase, we want to limit the level of spurious treatment
effect that might be caused by any residual imbalance between treatment and confounders that may remain,
after adjusting for the propensity score by propensity matching and/or weighting and/or stratification. A
good measure of this is Somers' *D(W|X)*, where *W* is a confounder or a propensity score, and *X* is the
treatment variable. The SSC package `somersd`

calculates Somers' *D* for a wide range of sampling schemes,
allowing matching and/or weighting and/or restriction to comparisons within strata. Somers' *D* has the
feature that, if *Y* is an outcome, then a higher-magnitude *D(W|X)* cannot be secondary to a lower-magnitude
*D(W|X)*, implying that *D(W|X)* can be used to set an upper bound to the size of a spurious treatment effect
on an outcome. For a binary treatment variable X, *D(W|X)* gives an upper bound to the size of a difference
between the proportions, in the two treatment groups, that can be caused for a binary outcome. If *D(W|X)*
is less than 0.5, then it can be doubled to give an upper bound to the size of a difference between the means,
in the two treatment groups, that can be caused for an equal-variance Normal outcome, expressed in units
of the common standard deviation for the two treatment groups. We illustrate this method using a familiar
dataset, with examples using propensity matching, weighting and stratification. We use the SSC package
`haif`

in the design phase, to check for variance inflation caused by propensity adjustment, and use the SSC
package `scenttest`

(an addition to the `punaf`

family) to estimate the treatment effect in the analysis phase.

Download presentation

Download example do-file

Return to top of page

Somers' *D(Y|X)* is an asymmetric measure of ordinal association
between two variables *Y* and *X*, on a scale from -1 to 1.
It is defined as the difference between the *conditional*
probabilities of concordance and discordance between two randomly-sampled
*(X,Y)*-pairs, *given* that the two *X*-values are
ordered. The `somersd`

package enables the user to estimate Somers'
*D* for a wide range of sampling schemes, allowing clustering and/or
sampling-probability weighting and/or restriction to comparisons within
strata. Somers' *D* has the useful feature that a larger *D(Y|X)
* cannot be secondary to a smaller *D(W|X)* with the same sign,
enabling us to make scientific statements that the first ordinal
association cannot be caused by the second. An important practical example,
especially for public-health scientists, is the case where *Y* is an
outcome, *X* an exposure, and *W* a propensity score.
However, an audience accustomed to other measures of association may be
culture-shocked, if we present associations measured using Somers' *
D*. Fortunately, under some commonly-used models, Somers' *D* is
related monotonically to an alternative association measure, which may be
more clearly related to the practical question of how much good we can do.
These relationships are nearly linear (or log-linear) over the range of
Somers' *D* values from -0.5 to 0.5. We present examples with *
X* and *Y* binary, with *X* binary and *Y* a
survival time, with *X* binary and *Y* conditionally Normal,
and with *X* and *Y* bivariate Normal. Somers' *D* can
therefore be used as a common currency for comparing a wide range of
associations between variables, not limited to a particular model.

Download presentation

Download example do-file

Return to top of page

So-called non-parametric methods are in fact based on estimating and testing parameters, usually either rank
parameters or spline parameters. Two comprehensive packages for estimating these are `somersd`

(for rank
parameters) and `bspline`

(for spline parameters). Both of these estimate a wide range of parameters, but
both are frequently found to be difficult to use by casual users. This presentation introduces `rcentile`

, an
easy-to-use front end for `somersd`

, and `polyspline`

, an easy-to-use front end for `bspline`

.
`rcentile`

estimates percentiles with confidence limits, optionally allowing for clustered sampling and
sampling-probability weights. The confidence intervals are saved in a Stata matrix, with one row per percentile,
which the user can save to a resultsset using the `xsvmat`

package. `polyspline`

inputs an *X*-
variable and a user-defined list of reference points and outputs a basis of variables for a polynomial or for
another unrestricted spline. This basis can be included in the covariate list for an estimation command, and the
corresponding parameters will be values of the polynomial or spline at the reference points, or differences between
these values. By default, the spline will simply be a polynomial, with a degree one less than the number of
reference points. However, if the user specifies a lower degree, then the spline will have knots interpolated
sensibly between the reference points.

Download presentation

Download example do-file

Return to top of page

Factor variables are defined as categorical variables with integer
values, which may represent values of some other kind, specified by a
value label. We frequently want to generate such variables in Stata
datasets, especially resultssets, which are output Stata datasets
produced by Stata programs, such as the official Stata `statsby`

command
and the SSC packages `parmest`

and `xcontract`

. This is because
categorical string variables can only be plotted after conversion to
numeric variables, and because these numeric variables are also
frequently used in defining a key of variables, which identify
observations in the resultsset uniquely in a sensible sort order. The
`sencode`

package is downloadable, and frequently downloaded, from SSC,
and is a “super” version of `encode`

, which inputs a string variable and
outputs a numeric factor variable. Its added features include a
`replace`

option allowing the output numeric variable to replace the
input string variable, a `gsort()`

option allowing the numeric values to
be ordered in ways other than alphabetical order of the input string
values, and a `manyto1`

option allowing multiple output numeric values
to map to the same input string value. The `sencode`

package is
well–established, and has existed since 2001. However, some tips will
be given on ways of using it that are not immediately obvious, but
which the author has found very useful over the years when
mass–producing resultssets. These applications use `sencode`

with other
commands, such as the official Stata command `split`

and the SSC
packages `factmerg`

, `factext`

and `fvregen`

.

Download presentation

Download example do-file

Return to top of page

Applied scientists, especially public health scientists, frequently want to know how much good can be caused
by a proposed intervention. For instance, they might want to estimate how much we could decrease the level
of a disease, in a dream scenario where the whole world stopped smoking, assuming that a regression model
fitted to a sample is true. Alternatively, they may want to compare the same scenario between regression
models fitted to different datasets, as when disease rates in different subpopulations are standardized to a
common distribution of gender and age, using the same logistic regression model with different parameters in
each subpopulation. In statistics, scenarios can be defined as alternative versions of a dataset, with the same
variables, but with different values in the observations, or even with non–corresponding observations. Using
regression methods, we may estimate scenario means of a *Y*–variable in scenarios with specified *X*–values,
and compare these scenario means. In Stata Versions 11 and 12, the standard tool for estimating scenario
means is `margins`

. A suite of packages is introduced for estimating scenario means and their comparisons,
using `margins`

, together with `nlcom`

to implement Normalizing and variance–stabilizing transformations.
`margprev`

estimates scenario prevalences for binary variables. `marglmean`

estimates scenario arithmetic
means for non–negative valued variables. `regpar`

estimates 2 scenario prevalences, together with their
difference, the population attributable risk (PAR). `punaf`

estimates 2 scenario arithmetic means from cohort
or cross–sectional data, together with their ratio, the population unattributable fraction (PUF), which is
subtracted from 1 to give the population attributable fraction (PAF). `punafcc`

estimates an arithmetic mean
between–scenario rate ratio for cases or non–survivors in case–control or survival data, respectively. This
mean rate ratio, also known as a PUF, is also subtracted from 1 to estimate a PAF. These packages use
the log transformation for arithmetic means and their ratios, the logit transformation for prevalences, and
the hyperbolic arctangent or Fisher’s *z* transformation for differences between prevalences. Examples are
presented for these packages.

Download presentation

Download example do-file

Return to top of page

Splines, including polynomials, are traditionally used to model non-linear relationships involving continuous
predictors. However, when they are included in linear models (or generalized linear models), the estimated
parameters for polynomials are not easy for non–mathematicians to understand, and the estimated parameters
for other splines are often not easy even for mathematicians to understand. It would be easier if the
parameters were values of the polynomial or spline at reference points on the *X*–axis, or differences or ratios
between the values of the spline at the reference points and the value of the spline at a base reference point.
The `bspline`

package can be downloaded from SSC, and generates spline bases for inclusion in the design
matrices of linear models, based on Schoenberg *B*–splines. The package now has a recently added module
`flexcurv`

, which inputs a sequence of reference points on the *X*–axis, and outputs a spline basis, based on
equally–spaced knots generated automatically, whose parameters are the values of the spline at the reference
points. This spline basis can be modified by excluding the spline vector at a base reference point and
including the unit vector. If this is done, then the parameter corresponding to the unit vector will be the
value of the spline at the base reference point, and the parameters corresponding to the remaining reference
spline vectors will be differences between the values of the spline at the corresponding reference points and
the value of the spline at the base reference point. The spline bases are therefore extensions, to continuous
factors, of the bases of unit vectors and/or indicator functions used to model discrete factors. It is possible
to combine these bases for different continuous and/or discrete factors in the same way, using product bases
in a design matrix to estimate factor–value combination means and/or factor–value effects and/or factor
interactions.

Download presentation

Download example do-files

Return to top of page

`parmest`

peripherals: `fvregen`

, `invcise`

, and `qqvalue`

.
Presented at the 16th UK Stata User Meeting, 9-10 September, 2010.
The `parmest`

package is used with Stata estimation commands to produce output datasets (or resultssets)
with one observation per estimated parameter, and data on parameter names, estimates, confidence limits,
*P*-values, and other parameter attributes. These resultssets can then be input to other Stata programs
to produce tables, listings, plots, and secondary resultssets containing derived parameters.
Three recently-added packages for post-`parmest`

processing are `fvregen`

, `invcise`

, and `qqvalue`

.
`fvregen`

is used when the
parameters belong to models containing factor variables, introduced in Stata Version 11.
It regenerates these factor variables in the resultsset,
enabling the user to plot, list, or tabulate factor levels with estimates and
confidence limits of parameters specific to these factor levels. `invcise`

calculates standard errors inversely
from confidence limits produced without standard errors, such as those for medians and for Hodges-Lehmann
median differences. These standard errors can then be input, with the estimates, into the `metaparm`

module
of `parmest`

, to produce confidence intervals for linear combinations of medians or of median differences, such
as those used in meta-analysis or interaction estimation. `qqvalue`

inputs the *P*-values in a resultsset, and
creates a new variable containing the frequentist *q*-values, calculated by inverting a multiple-test procedure
designed to control the familywise error rate (FWER) or the false discovery rate (FDR). The frequentist
*q*-value for each *P*-value is the minimum FWER or FDR for which that *P*-value would be in the discovery
set, if the specified multiple-test procedure was used on the full set of *P*-values.
`fvregen`

, `invcise`

, `qqvalue`

, and `parmest`

can be downloaded from SSC.

Download presentation

Download first example do-file

Download second example do-file

Return to top of page

Insufficient confounder adjustment is viewed as a common source of "false
discoveries", especially in the epidemiology sector. However, adjustment for
"confounders" that are correlated with the exposure, but which do not
independently predict the outcome, may cause loss of power to detect the
exposure effect. On the other hand, choosing confounders based on "stepwise"
methods is subject to many hazards, which imply that the confidence interval
eventually published is likely not to have the advertized coverage
probability for the effect that we wanted to know. We would like to be able
to find a model in the data on exposures and confounders, and then to
estimate the parameters of that model from the conditional distribution of
the outcome, given the exposures and confounders. The `haif`

package,
downloadable from SSC, calculates the homoskedastic adjustment inflation
factors (HAIFs), by which the variances and standard errors of coefficients
for a matrix of *X*-variables are scaled (or inflated), if a matrix of
unnecessary confounders *A* is also included in a regression model, assuming
equal variances (homoskedasticity). These can be calculated from the *A*- and
*X*-variables alone, and can be used to inform the choice of a set of models
eventually fitted to the outcome data, together with the usual criteria
involving causality and prior opinion. Examples are given of the use of HAIFs
and their ratios.

Download presentation

Return to top of page

`parmest`

and extensions.
Presented at the 14th UK Stata User Meeting, 8-9 September, 2008.
The `parmest`

package creates output datasets (or resultssets) with one
observation for each of a set of estimated parameters, and data on the
parameter estimates, standard errors, degrees of freedom, *t*- or *z*-statistics,
*P*-values, confidence limits, and other parameter attributes specified by the
user. It is especially useful when parameter estimates are "mass-produced",
as in a genome scan. Versions of the package have existed on SSC since 1998,
when it contained the single command `parmest`

. However, the package has since
been extended with additional commands. The `metaparm`

command allows the user
to mass-produce confidence intervals for linear combinations of uncorrelated
parameters. Examples include confidence intervals for a weighted arithmetic
or geometric mean parameter in a meta-analysis, or for differences or ratios
between parameters, or for interactions, defined as differences (or ratios)
between differences (or ratios). The `parmcip`

command is a lower-level utility,
inputting variables containing estimates, standard errors, and degrees of
freedom, and outputting variables containing confidence limits and *P*-values.
As an example, we may input genotype frequencies and calculate confidence
intervals for geometric mean homozygote/heterozygote ratios for genetic
polymorphisms, measuring the size and direction of departures from
Hardy-Weinberg equilibrium.

Download presentation

Return to top of page

The `cendif`

module is part of the `somersd`

package, and calculates confidence
intervals for the Hodges-Lehmann median difference between values of a
variable in two subpopulations. The traditional Lehmann formula, unlike the
formula used by `cendif`

, assumes that the two subpopulation distributions
are different only in location, and that the subpopulations are therefore
equally variable. The `cendif`

formula therefore contrasts with the Lehmann
formula as the unequal-variance *t*-test contrasts with the equal-variance
*t*-test. In a simulation study, designed to test `cendif`

to destruction,
the performance of `cendif`

was compared to that of the Lehmann formula,
using coverage probabilities and median confidence interval width ratios.
The simulations involved sampling from pairs of Normal or Cauchy
distributions, with subsample sizes ranging from 5 to 40, and
between-subpopulation variability scale ratios ranging from 1 to 4. If the
sample numbers were equal, then both methods gave coverage probabilities
close to the advertized confidence level. However, if the sample numbers
were unequal, then the Lehmann coverage probabilities were over-conservative
if the smaller sample was from the less variable population, and
over-liberal if the smaller sample was from the more variable population.
The `cendif`

coverage probability was usually closer to the advertized level,
if the smaller sample was not very small. However, if the sample sizes were
5 and 40, and the two populations were equally variable, then the Lehmann
coverage probability was close to its advertized level, while the `cendif`

coverage probability was over-liberal. The `cendif`

confidence interval,
in its present form, is therefore robust both to non-Normality and to
unequal variablity, but may be less robust to the possibility that the
smaller sample size is very small. Possibilities for improvement are
discussed.

Download presentation

Return to top of page

Somers' *D* and Kendall's tau-a are parameters behind rank or "non-parametric"
statistics, interpreted as differences between proportions. Given two
bivariate data pairs *(X_1,Y_1)* and *(X_2,Y_2)*, Kendall's tau-a is the
difference between the probability that the two pairs are concordant
and the probability that the two pairs are discordant, and Somers' *D* is the
difference between the corresponding conditional probabilities, given
that the *X*-values are ordered. The `somersd`

package computes confidence
intervals for both parameters. The Stata 9 version of `somersd`

uses Mata,
and greatly extends the definition of Somers' *D*, allowing the *X*- and/or
*Y*-variables to be left- or right-censored, and allowing multiple versions
of Somers' *D* for multiple sampling schemes for pairs of *X,Y*-pairs.
In particular, we may define stratified versions of Somers' *D*, in which we
only compare pairs from the same stratum. The strata may be defined by
grouping a Rubin-Rosenbaum propensity score, based on the values of
multiple confounders for an association between an exposure variable *X* and
an outcome variable *Y*. Therefore, rank statistics can have not only
confidence intervals, but confounder-adjusted confidence intervals.
Usually, we either estimate *D(Y|X)* as a measure of the effect of *X* on *Y*,
or estimate *D(X|Y)* as a measure of the performance of *X* as a predictor of *Y*,
compared to other predictors. Alternative rank-based measures of the effect
of *X* on *Y* include the Hodges-Lehmann median difference and the Theil-Sen
median slope, both of which are defined in terms of Somers' *D* and estimated
using the `somersd`

package.

Download presentation

Return to top of page

Most Stata users make their living producing results in a form accessible to end users.
Most of these end users cannot immediately understand Stata logs.
However, they can understand tables (in paper, PDF, HTML, spreadsheet or word processor documents)
and plots (produced using Stata or non-Stata software).
Tables are produced by Stata as resultsspreadsheets, and plots are produced by Stata as resultsplots.
Sometimes (but not always), resultsspreadsheets and resultsplots are produced using resultssets.
Resultssets, resultsspreadsheets and resultsplots are all produced, directly or indirectly, as output by Stata commands.
A resultsset is a Stata dataset, which is a table,
whose rows are Stata observations and whose columns are Stata variables.
A resultsspreadsheet is a table in generic text format, conforming to a TeX or HTML convention, or to another convention
with a column separator string and possibly left and right row delimiter strings.
A resultsplot is a plot produced as output, using a resultsset or a resultsspreadsheet as input.
Resultsset-producing programs include `statsby`

, `parmby`

, `parmest`

, `collapse`

, `contract`

,
`xcollapse`

and `xcontract`

.
Resultsspreadsheet-producing programs include `outsheet`

, `listtex`

, `estout`

and `estimates table`

.
Resultsplot-producing programs include `eclplot`

and `smileplot`

.
There are two main approaches (or dogmas) for generating resultsspreadsheets and resultsplots.
The resultsset-central dogma is followed by `parmest`

and `parmby`

users, and states:
"Datasets make resultssets, which make resultsplots and resultsspreadsheets".
The resultsspreadsheet-central dogma is followed by `estout`

and `estimates table`

users, and states:
"Datasets make resultsspreadsheets, which make resultssets, which make resultsplots".
The two dogmas are complementary, and each dogma has its advantages and disadvantages.
The resultsspreadsheet dogma is much easier for the casual user to learn to apply in a hurry,
and is therefore probably preferred by most users most of the time.
The resultsset dogma is more difficult for most users to learn, but is more convenient
for users who wish to program *everything* in do-files,
with little or no manual cutting and pasting.

Download projection

Download first example do-file

Download second example do-file

Return to top of page

Confidence intervals may be presented as publication-ready tables or as presentation-ready plots. `eclplot`

produces plots of estimates and confidence intervals. It inputs a dataset (or resultsset) with one observation
per parameter and variables containing estimates, lower and upper confidence limits, and a fourth variable,
against which the confidence intervals are plotted. This resultsset can be used for producing both plots
and tables, and may be generated using a spreadsheet or using `statsby`

, `postfile`

or the unofficial Stata
`parmest`

package. Currently, `eclplot`

offers 7 plot types for the estimates and 8 plot types for the confidence
intervals, each corresponding to a `graph twoway`

subcommand. These plot types can be combined to produce
56 combined plot types, some of which are more useful than others, and all of which can be either horizontal
or vertical. `eclplot`

has a `plot()`

option, allowing the user to superimpose other plots to add features such
as stars for *P*-values. `eclplot`

can be used either by typing a command, which may have multiple lines and
sub-suboptions, or by using a dialog, which generates the command for users not fluent in the Stata graphics
language. This presentation includes a demonstration of `eclplot`

, using both commands and dialogs.

Download projection

Download entire presentation

Return to top of page

A resultsset is a Stata dataset created as output by a Stata program.
It can be used as input to other Stata programs, which may in turn output the results
as publication-ready plots or tables. Programs that create resultssets include
`xcontract`

, `xcollapse`

, `parmest`

, `parmby`

and `descsave`

.
Stata resultssets do a similar job to SAS output data sets,
which are saved to disk files. However, in Stata, the user typically has the options of saving a resultsset
to a disk file, writing it to the memory (overwriting any pre-existing data set), or simply listing it.
Resultssets are often saved to temporary files, using the `tempfile`

command.
This lecture introduces programs that create resultssets,
and also programs that do things with resultssets after they have been created.
`listtex`

outputs resultssets to tables that can be inserted into a Microsoft Word, HTML or LaTeX document.
`eclplot`

inputs resultssets and creates confidence interval plots.
Other programs, such as `sencode`

and `sdecode`

, process resultssets after they are created
and before they are listed, tabulated or plotted.
These programs, used together, have a power not always appreciated if the user simply reads the on-line help
for each package.
This lecture is a survey lecture, and is based on a handout and a set of example do-files,
which can be downloaded with or without the presentation.

Download presentation

Download handout

Download example do-files

Return to top of page

Scientists often have good reasons for wanting to calculate multiple confidence intervals and/or *P*-values,
especially when scanning a genome. However, if we do this, then the probability of *not* observing
at least one "significant" difference tends to fall, even if all null hypotheses are true.
A sceptical public will rightly ask whether a difference is "significant" when considered
as one of a large number of parameters estimated. This presentation demonstrates some solutions to
this problem, using the unofficial Stata packages `parmest`

and `smileplot`

.
The `parmest`

package allows the calculation of Bonferroni-corrected or Sidak-corrected
confidence intervals for multiple estimated parameters. The `smileplot`

package contains two programs,
`multproc`

(which carries out multiple test procedures) and `smileplot`

(which presents their
results graphically by plotting the *P*-value on a reverse log scale on the vertical axis
against the parameter estimate on the horizontal axis). A multiple test procedure takes, as input,
a set of estimates and *P*-values, and rejects a subset (possibly empty) of the null hypotheses
corresponding to these *P*-values. Multiple test procedures have traditionally controlled the
family-wise error rate (FWER), typically enabling the user to be 95% confident that *all* the rejected
null hypotheses are false, and that *all* the corresponding "discoveries" are real.
The price of this confidence is that the power to detect a difference of a given size tends to zero
as the number of measured parameters becomes large. Therefore, recent work has concentrated on procedures
that control the false discovery rate (FDR), such as the Simes procedure and the Yekutieli-Benjamini procedure.
FDR-controlling procedures attempt to control the number of false discoveries as a proportion
of the number of true discoveries, typically enabling the user to be 95% confident that *some*
of the discoveries are real, or 90\% confident that *most* of the discoveries are real.
This less stringent requirement causes power to "bottom out" at a non-zero level as the number
of tests becomes large. The `smileplot`

package offers a selection of multiple test procedures of both kinds.
This presentation uses data provided by the ALSPAC Study Team at the Institute of Child Health
at Bristol University, UK.

Download presentation

Return to top of page

`parmest`

and friends.
Presented at the 8th UK Stata User Meeting, 20-21 May, 2002.
Statisticians make their living mostly by producing confidence intervals and
*P*-values. However, the ones supplied in the Stata log are not in any fit state to be
delivered to the end user, who usually at least wants them tabulated and formatted,
and may appreciate them even more if they are plotted on a graph for immediate impact.
The `parmest`

package was developed to make this easy, and consists of two programs.
These are `parmest`

, which converts the latest estimation results to a data set
with one observation per estimated parameter and data on confidence intervals, *P*-values
and other estimation results,
and `parmby`

, a "quasi-byable" front end to `parmest`

, which is like
`statsby`

, but creates a data set with one observation per parameter per by-group
instead of a data set with one observation per by-group.
The `parmest`

package can be used together with a team of other Stata programs
to produce a wide range of tables and plots of confidence intervals and *P*-values.
The programs `descsave`

and `factext`

can be used with `parmby`

to create plots of confidence intervals against values of a categorical factor included in the
fitted model, using dummy variables produced by `xi`

or `tabulate`

.
The user may easily fit multiple models, produce a `parmby`

output data set
for each one, and concatenate these output data sets using the program `dsconcat`

to produce a combined data set, which can then be used to produce tables or plots
involving parameters from all the models. For instance, the user might tabulate or
plot unadjusted and adjusted regression parameters side by side, together with their
confidence limits and/or *P*-values. The `parmest`

team is particularly useful
when dealing with large volumes of results derived from multiple multi-parameter
models, which are particularly common in the world of epidemiology.
This version of the presentation is a post-publication update, made in response to changes
in the `parmest`

package suggested by Bill Gould of StataCorp after seeing the original presentation.

Download presentation

Return to top of page

Splines are traditionally used to model non-linear relationships
involving continuous predictors, usually confounders. One example is in asthma
epidemiology, where splines are used to model a seasonal and longer-term time trend in
asthma-related hospital admissions,
which must be eliminated in a search for shorter-term epidemics caused by pollution
episodes. Usually, the spline is included in a regression model by defining a basis
of splines, and including this basis amongst the *X*-variates, together with the
predictors of interest. The basis is typically a plus-function basis, a truncated-power
basis, or a Schoenberg *B*-spline basis. With either of these options, the parameters
estimated by the regression model will not be easy to explain in words to non-mathematicians.
An STB insert (sg151 in STB-57) presented two programs for generating spline bases. One of
these (`bspline`

) generates Schoenberg *B*-splines. The other program (`frencurv`

,
short for "French curve")
generates an alternative spline basis, whose parameters are simply values of the spline
at reference points along the horizontal axis. In the example from asthma epidemiology,
these parameters might be the expected hospital admissions counts on the first day of
each month, in the absence of a pollution episode. The expected pollution-free
admissions counts on other days of the month are interpolated between the parameters,
using the spline. These parameters can be presented, with their confidence limits,
to non-technical people. Confidence limits can also be computed for differences and/or
ratios between expected values at different reference points, using `lincom`

.

Download presentation

Return to top of page

So-called "non-parametric" methods are in fact based on population
parameters, which are zero under the null hypothesis.
Two of these parameters are Kendall's tau-a and Somers' *D*.
both of which measure ordinal correlation between two variables
*X* and *Y*. If *X* is a binary variable,
then Somers' *D(Y|X)* is the parameter tested by a Wilcoxon rank-sum test.
It is more informative to have
confidence limits for these parameters than *P*-values alone,
for three main reasons. First, it *might* discourage people from
arguing that a high *P*-value proves a null hypothesis.
Second, for continuous
data, Kendall's tau-a is often related to the classical Pearson
correlation by Greiner's relation,
so we can use Kendall's tau-a to
define robust confidence limits for Pearson's correlation.
Third, we might want to know confidence limits for differences
between two Kendall's tau-a or Somers' *D* parameters,
because a larger Kendall's tau-a or Somers' *D*
cannot be secondary to a smaller one.
The program `somersd`

calculates confidence intervals for Somers' *D* or Kendall's tau-a,
using jackknife variances. There is a choice of transformations,
including Fisher's *z*, Daniels' arcsine, Greiner's rho, and the
*z*-transform of Greiner's rho. A `cluster`

option is available, intended
for measuring intra-class correlation (such as exists between
measurements on pairs of sisters). The estimation results are
saved as for a model fit, so that differences can be estimated using `lincom`

.

Download presentation

Return to top of page

Roger B. Newson

Email: [email protected]

Text written: 11 September 2020. (Papers and presentations may have been revised since then.)

Return to top of page

Return to Roger Newson's main documents page

Return to Roger Newson's main resource page