4 duplicates — Report, tag, or drop duplicate observations
Options for duplicates examples and duplicates list
Main
compress, nocompress, fast, abbreviate(#), string(#); see [D] list.
Options
table, display, header, noheader, header(#), clean, divider, separator(#),
sepby(varlist), nolabel; see [D] list.
Summary
mean
(varlist)
, sum
(varlist)
, N
(varlist)
, labvar(varname); see [D] list.
Advanced
constant
(varlist)
, notrim, absolute, nodotz, subvarname, linesize(#); see [D] list.
Option for duplicates tag
generate(newvar) is required and specifies the name of a new variable that will tag duplicates.
Option for duplicates drop
force specifies that observations duplicated with respect to a named varlist be dropped. The force
option is required when such a varlist is given as a reminder that information may be lost by
dropping observations, given that those observations may differ on any variable not included in
varlist.
Remarks and examples stata.com
Current data management and analysis may hinge on detecting (and sometimes dropping) duplicate
observations. In Stata terms, duplicates are observations with identical values, either on all variables if
no varlist is specified or on a specified varlist; that is, 2 or more observations that are identical on all
specified variables form a group of duplicates. When the specified variables are a set of explanatory
variables, such a group is often called a covariate pattern or a covariate class.
Linguistic purists will point out that duplicate observations are strictly only those that occur in
pairs, and they might prefer a more literal term, although the most obvious replacement, “replicates”,
already has another statistical meaning. However, the looser term appears in practice to be much
more frequently used for this purpose and to be as easy to understand.
Observations may occur as duplicates through some error; for example, the same observations
might have been entered more than once into your dataset. In contrast, some researchers deliberately
enter a dataset twice. Each entry is a check on the other, and all observations should occur as identical
pairs, assuming that one or more variables identify unique records. If there is just one copy, or more
than two copies, there has been an error in data entry.
Or duplicate observations may also arise simply because some observations just happen to be
identical, which is especially likely with categorical variables or large datasets. In this second situation,
consider whether contract, which automatically produces a count of each distinct set of observations,
is more appropriate for your problem. See [D] contract.