Harking back to the issue of time granularity, and anticipating some default analytic calculations yet to come, let’s get back to small details and look inside the smallest data unit i.e. the time grain.

For typical retail products this granularity would be a month, which is the example carried forward in this post.

Monthly data warehoused for analysis purposes would typically be on a calendar month basis. An alternative (for CC?) might be data on a monthly payment cycle basis.

Even though a grain is ‘small’ there is still latitude for vagueness because data recorded against a month may relate in several ways to the time axis within that month:

  • point in time at the beginning of the month
  • point in time in the middle of the month
  • point in time at the end of the month
  • the whole time window comprising that month

For most cross-sectional studies, the time axis is calendar date and the ‘status’ variables like account balance would usually relate to the end of the month, as that would be their most up-to-date value. Other variables that summarise or count transactions (for example) would relate to the whole time window. Certain calculated values (like hazards AWML) may relate to the mid-point of the month.

In cross-sectional studies there is no difficulty in finding the point-in-time variables as at the beginning of a month, because these will be the (end-of-month) values from the previous month’s record – i.e. closing balance for Feb = opening balance for March etc.

If numeric date values are used as the key on a data table, they would most logically perhaps be set equal to the last day of each month, which is unfortunately a bit messy and harder (for a human) to remember than the obvious choice of the 1st of each month.

A non-numeric-date month key like “200805” avoids specifying any particular part of the month, and leaves it up to the user to figure the time relationships from the metadata. A slight disadvantage of such a key is that date arithmetic (figuring out the difference between two dates) becomes non-trivial.

Longitudinal studies would typically rely on performance data for each individual account that is stored cross-sectionally i.e. by calendar month. This introduces a slight wrinkle because the account opening date can be anywhere within a month, whereas the performance data is only available at month ends. So the first performance measurement point an account reaches may come up in only 1-2 days (if the account opened on the 29-30th of a month) or alternatively may represent up to 30 days of exposure-to-risk. Longitudinal studies have MOB rather than calendar date as their time axis, and this means that the MOB=1 analysis really represents on average about 0.5 months of exposure, and likewise all subsequent MOB points really represent on average half a month less. (This example assumes your MOB counting convention starts at 1 rather than from 0.) But in any case, it would be most representative to start at 0.5 and count upwards as 1.5, 2.5, etc.

The above may sound picky, but it can quite easily come about that one analyst’s 12-month OW is another analyst’s 13-month OW due to choices at this level, and this could make a significant change to risk measures.

Further intra-grain issues will be met when calculating hazards. This basically means dividing the number of defaults (at a certain MOB) by the number of accounts that were exposed-to-risk of default. In a longitudinal study the number of accounts exposed-to-risk will always be declining, as accounts close good or go into default. Good practice would therefore be to find the average number (month_start + month_end)/2 of exposed-to-risk accounts during that month for use in the denominator of the hazard.

Actuaries are good at these deliberations because of the care and expertise put into estimation of mortality statistics. If you can’t find a tame actuary, the recommended approach is large diagrams on whiteboards and a bottle of headache tablets.