You are currently browsing the tag archive for the ‘data cleaning’ tag.
One of the uses of a hazard curve is as a sanity check on your data and the technicalities of the default definition.
If you regularly find yourself analysing millions of records, you will know that every conceivable weird and wobbly data bug will happen, as well as a few that could never have been conceived of. Recalling a typical example from a loan portfolio: there were 10,000 accounts that opened and closed on the same day. This not surprisingly was some artefact of how the data systems coped with proposals or quotes (or something), but in reality these accounts were NTU and there was never any exposure to risk in their respect. But, in amongst a quarter of a million accounts, it would be possible to miss their presence and to do some default analytics – and even some model building – including these accounts as “closed good” accounts.
<digress for a war story> One of those accounts even managed to span two months! It appeared in two consecutive calendar month snapshot datasets – somehow allowed by time zone differences and the exact timing of month-end processing. A casual analysis might have assumed that this represented two months of exposure to risk – see also the comments about time grains <end digression>
But coming to the point of this post, I have found that estimating the default and churn hazard is an excellent “sanity check” on the data that will quickly show up most issues that you would want to know about. The issue mentioned above showed up as a massive spike in the churn hazard at MOB=1.
Other features that might be noticeable in churn hazard curves are peaks of churn around key account ages, such as at MOB=6 if the product has a teaser rate for the first 6 months. Multiples of 12 MOB may also occur in certain pay-annual-interest-in-advance type of products. These examples would be features that one might be on the lookout for, so finding them would be “reassuring” feedback rather than “alerting” feedback.
Sanity checking is not only noticing what you didn’t expect, but also confirming what you did expect.
Features found in the default hazard curves may give important feedback about the way the default definition works. For example, with a 90DPD definition one may be expecting zero hazard for MOB=1,2,3 but there may in fact be genuine defaults in that zone triggered by supplementary business rules. However, what can happen is that the totality of rules in the default definition don’t quite produce the desired effect in practice. One example I recall caused the year-in-advance loans to reflect as default after only 30DPD. This showed up as a spike at 12,24,36 MOB and caused a review of the default definition as applied to this (relatively small) portion of the loan book.
The data cleaning and sanity checking stage is helped by having some experience in similar analyses on similar products. But even in a completely new context, some data wobblies will produce such an unnatural effect on the hazard curve that you will be immediately alerted to follow up.
Hazard curves, being longitudinal, only help you examine default tendencies that relate to MOB. Cross-sectional effects, such as a sudden worsening in credit conditions in the economy, would be monitored in other ways.