the loan at origination, including the names of the lender and servicer at origination, the
interest rate, the term, and the principal balance. The origination dataset includes all new
loans; including refinancing loans. The monthly performance updates provide information
on the loans delinquency status, as well as the date at which a loan is terminated. This can
occur if a loan is paid off through a prepayment, end of the term, or refinancing.
In order to study comparable mortgages, we follow the literature and restrict our sample
to 30-year mortgages originated for purchases of primary residence homes. These loans are
fully amortizing and have full documentation. The loan-level origination dataset further
provides interest rates, FICO scores, LTVs, and DTIs, as well as loan size, type, purpose,
and location. It also identifies the originator that sold the loan to the GSE in cases where
the originator had sufficiently high origination market share in the reporting period.
HMDA: HMDA data covers is a mortgage-application-level dataset covering the near
universe of U.S. mortgage applications and originations. This data, used extensively in the
literature, includes lender identification, application outcome, and loan type, purpose, size,
year of origination, and location at the census tract level.
Since 2018, HMDA has further recorded several further variables that are key in our
analysis: loan interest rate, non-interest-rate charges (including origination charges, discount
points, and lender credits), loan-to-value (LTV) ratio, and debt-to-income (DTI) ratio. Con-
sequently, while our main sample includes loans since 2005, we do several test including these
additional variables in the 2018–2021 sample.
Merging procedure: There is no direct crosswalk linking the GSE data to the HMDA
data. In addition to the two underlying data sources, we create a new merged dataset that
links mortgages to their new, refinanced loan. To create this match, we first take GSE loans
to their HMDA counterparts. The match is based on year, zip code, loan amount, loan
purpose, occupancy type, and the original lender name. Then, for these matched loans, we
take the sample which is exiting through a refinancing in the month that the loan is recorded
as leaving the dataset. We forward match to executed refinancings in the HMDA dataset.
Importantly, the substantial information regarding borrowers available in the HMDA dataset
facilitates this match. Rather than use loan characteristics, we use borrower characteristics
(year, census tract, race, ethnicity, loan amount) to match refinancings with their new loans.
To summarize, the empirical analysis for this project consists of several datasets: a random
sample from GSE-backed origination and monthly performance data, a random sample of
HMDA applications, and forward-linked refinanced loans.
11
Electronic copy available at: https://ssrn.com/abstract=4552425