Data pruning best practices

35 Kudos

Overview
Optimizing data retention
Data retention configuration
Figure 1: Days before snapshot deletion, days before certifications are archived, and days before ce...
Figure 2: Snapshot frequency in days, and index history granularity
Perform maintenance task configuration
Figure 3: Setup > tasks > perform maintenance
Identity refresh task configuration
Data archive strategies
Large data cleanup examples
Additional areas to consider
Additional resources

Overview

Original content by Scott.Lindsey and Mike.Hovis

This document discusses the configuration, processes and best practices for pruning old data from your IdentityIQ environment, in order to maintain performance and database disk space allocation.

Optimizing data retention

IdentityIQ should be configured to store the appropriate amount of historical data to optimize database size and overall system performance based on your business requirements for retaining identity and certification data.

The following objects have the most effect on database size and system performance:

Object Type	Description	Impact
Identity Snapshots	Stores a copy of the governance Identity data for historical purposes. These snapshots are created during certification generation, and by using the option “Maintain identity histories” in the Identity Refresh task. The number of Identity Snapshots can affect the time to generate new certifications, the time to render Identities in the UI, and the size of the database. It is recommended that you define an expiration policy for identity snapshots that matches your certification archive policy.	Performance: High Table Space: High
Certifications	As certifications are generated, large amounts of historical data is captured and stored in the certification tables along with identity snapshots, history, and entitlement snapshot data. After the certification cycle is complete and audit reports are generated, the certification data becomes stale. It is recommended that you keep at least one previous cycle of data to use during the next certification generation, for comparison purposes. You may choose to keep live certification data in the system for one or more complete certification cycles to provide historical data for current certifications. Older certifications can be archived and deleted to improve certification performance and to reduce disk space.	Performance: Medium Table Space: High
Task Results	Any time a task is executed, a task result is saved. Task results include report results that in many cases consume a large portion of the utilized disk space. Depending on the TaskDefinition configuration, the previous result may be renamed or deleted. In the case of renames to keep historical task results, a large amount of task result data can accumulate, especially for scheduled daily or hourly tasks. It may be important to keep this task data for some period of time for debugging and monitoring purposes. Depending on the business and operational requirements task results should be cleaned up as often as necessary to reduce disk space.	Performance: Low Table Space: Medium
Identities	Identities accumulate over time as new systems are aggregated into IdentityIQ and users complete their life cycle at a company. As the number of identities increases, the time to search, refresh and maintain these identities increases. This document won’t discuss strategies for optimizing the execution of these daily tasks but will focus on pruning the dead weight from your identity warehouse.	Performance: Medium Table Space: Low
Requests	Each time an email is sent, a new request object is created. For large certification cycles or daily-automated tasks, the request queue can become very large. Request objects are also created for future data workflow events and other object expiration events. Although the completed requests have no effect on system performance they do take disk space.	Performance: Low Table Space: Low

Data retention configuration

For each of the object types it is important to define the number of days before the objects will be deleted or archived to optimize the amount of data retained, maintain system performance, and control the size of the data footprint. The system can also be configured to capture more or less granular data based on your business requirements. Figure 1 and Figure 2 below show the settings available from the System Setup > IdentityIQ Configuration section. Refer to the product documentation for specific explanations.

Each setting is applied when the task listed is executed
A value of 0 indicates never to remove objects
Deleted objects cannot be recovered

WARNING: Enabling these options can result in deleted data

Setting and Type	Recommended Retention Policies	Task Configuration
Days before snapshot deletion	At a minimum this should be set to match your shortest certification cycle and at a maximum it should match your certification archive days.	Perform Maintenance Option: Prune Identity Snapshots
Days before task result deletion	This should be set to match the longest schedule for tasks where you need to see previous results. For some customers this is every 60 days, where their longest schedule of interest is every 30 days. This option applies to Task Results and Report Results.	Perform Maintenance Option: Prune Task Results
Days before certifications are archived	When this option is enabled (not set to 0) completed certifications are serialized into a single Character Large Object (CLOB) and persisted into the spt_certification_archive table. The certification objects are then deleted from the live tables and no longer appear in the UI or are available for reporting. Archived certification objects should be considered as permanently removed but not yet deleted. A recommended practice is to archive your certifications that are older than the last 2 cycles. Certification data should be extracted into a data warehouse or report before archiving.	Perform Maintenance Option: Archive and Prune Certifications
Days before certification archive deletion	To complete the cleanup of the old certifications this option will cause the archive object to be completely deleted. Use this option with caution, as the data is not recoverable.	Perform Maintenance Option: Archive and Prune Certifications
Snapshot Frequency in Days	In general, this should be set to the highest acceptable value, in order to reduce the amount of snapshot data being stored. This option is used only if the Maintain Identity history option is enabled in the Identity Refresh task.	Refresh Identity Cube: Maintain identity histories
Index History Granularity	This determines the increments at which to store history. For example, if the Identity history is set to ‘Week’, snapshots are preserved on a weekly basis. This option should be set to the largest granular window that is acceptable to the user, to reduce the number of updates to the snapshot data.	Refresh Identity Cube: Maintain identity histories

Figure 1: Days before snapshot deletion, days before certifications are archived, and days before certification archive deletion

Figure 2: Snapshot frequency in days, and index history granularity

Perform maintenance task configuration

The Perform Maintenance task controls which objects are pruned, and when. It is recommended that you configure a dedicated task that executes on a separate cycle to prune and archive objects. You may choose to have a task for each object type for example.

WARNING: Enabling the prune options will result in deleted data

Create a dedicated “Object Pruning and Archive” task definition
1. Decide if you need one for each type
Schedule the task to run frequently enough to keep pace with data creation
1. Snapshots and Task Results: weekly or daily
2. Certifications: monthly or longer
The first execution may take many days to run, depending on existing historical data
Prune requests should be run separately based on a set schedule, such as every 60 or 90 days.
1. This option deletes all completed requests, not based on age of request.

Figure 3: Setup > tasks > perform maintenance

Identity refresh task configuration

The Identity Refresh task has options which control the amount of historical data retained for identities. For example, you can configure the Identity Refresh Task to maintain identity histories - this creates a snapshot of each identity that is refreshed when this task is run.

You can control the amount of data retained by this task in several ways:

Only enable the "Maintain identity histories" option if your business requires it.
If you want to maintain histories but don't want to create a snapshot every time an identity is refreshed (if for example you want to capture history information less frequently), you can set up a dedicated Identity Refresh task that is specifically for capturing snapshots, and is separate from your main Identity Refresh task
You can configure filters to only snapshot specific Identities, using the "Optional filter string..." and "Optional list or group of populations..." options in the task configuration, to select a specific identity or set of identities to snapshot.

Data archive strategies

There are several strategies for preserving the certification, identity, and snapshot data for historical purposes. Choose a strategy that best fits your operating and business needs.

Certification Archive Reports
- Generate certification archive reports and save these exported data sets outside of IdentityIQ for long term retention
Certification Data Warehouse
- Export all certification data into a Data Warehouse. The Data Export task is a good example of a custom database process to sync the data.
Offline IdentityIQ table archives
- Database sync scripts to an offline archive of the tables
Full Database Backup
- Database backups are used to restore old data in extreme cases

Large data cleanup examples

The following SQL and HQL (for querying only) examples can be used as methods to remove large data sets directly at the database layer reducing the overall time to prune old data. Note that these are samples and may need adjusting based on your database type and IdentityIQ version. Any reference to time fields must be made in milliseconds since epoch; these can be generated by adding three digits for milliseconds to results generated from an epoch time converter, such as http://www.epochconverter.com/

The console can be used to run SQL or HQL queries.

SQL queries are run using:

sql “select count(*) from spt_identity”
sql “select max(created) from spt_identity_history”

HQL queries are run using HQL. Note that returned time fields will be translated, but for queries, epoch time must be used. For indication of the names of various columns, refer to the hbm.xml files within the identityiq.jar file:

hql “select count(*) from Identity”
hql “select max(created) from IdentityHistory”

For SQL, you must know which tables are involved (such as spt_identity for Identity). For HQL, all you need to know is the name of the type of object (such as Identity). Here is a list of the table names involved with the largest datasets:

spt_identity_archive
spt_identity_snapshot
spt_certification_archive
spt_identity_history
spt_identity_history_item

SQL Queries (note that timestamps are in epoch format and must be converted for queries):

delete from spt_identity_history_item where created < <predetermined max age of retention>
delete from spt_identity_history where id not in (select identity_history_id from spt_identity_history_item)
delete from spt_identity_snapshot where id not in (select snapshot_id from spt_certification_entity) and created < <predetermined max age of retention>”
delete from spt_identity_archive where created < <predetermined max age of retention>

Additional areas to consider

Audit Data – Audit data accumulates in the system indefinitely. There is no task or option to prune this data. Depending on your data retention requirements, you can create a custom task to prune the data, or manually prune this table using a simple SQL query to archive and delete the old records every X days.
IdentityRequest – In the Perform Identity Request Maintenance task, the Max age (in days) for Identity Request objects setting controls how long request objects are kept in the database. Leaving this set to 0 tells IdentityIQ to keep these objects indefinitely. You can set this to a different value according to your business needs.
Syslog – Log4j messages at the error or warning level are stored in the syslog file as well as in the log files on the respective host. If you don't want to retain these messages in the syslog file, you can change this setting in your IdentityIQ settings. Go to the Gear menu >Global Settings > IdentityIQ Configuration > Miscellaneous tab to set the level at which syslog events will be stored, and when they will be deleted (0 days = never delete).

Additional resources

Perform Maintenance (Housekeeper) Task explained

tonmay_bhattacharjee · ‎Apr 18, 2017

Great! Exactly what I was looking for!

bill_qian · ‎Oct 10, 2017

Awesome, this helps a lot!

BTW, I know the configuration on "Other Object Expiration" items and others are based on the business requirement. But sometimes, we can't get the business requirement for that. Do we have any best practice on that?

Thanks,

Bill

AnilMadduri · ‎Dec 18, 2018

Hi All,

Where we can configure the Prune Requests age.? Currently in my environment there are only 7 days of requests are holding

jennifer_mitchell · ‎Dec 18, 2018

Identity Requests are pruned by the Perform Identity Request Maintenance task. You can specify how long to retain them in that task.

If you have follow up questions, please ask them in the IdentityIQ Forums. You can put a link back to this article if you want, but more people follow the forums and you are likely to get more answers/participation there than here (plus it muddies the articles to have long threads of Q/A that are not specifically about clarifying doc content).

martin_dehn · ‎Feb 15, 2021

Hi,

great article. It was provided before sailpoint.object.ProvisioningTransaction have been intriduced to IdentityIQ. They contain good data but have a really high impact on database size and possibly performance. I think this should be added here.

Regards

Martin

mathieug · ‎Jul 25, 2022

Agreed with @martin_dehn, ProvisioningTransaction should be added in this documentation.

suresh_duraisamy · ‎Apr 07, 2023

Very good article.

ROBERTGUI · ‎Mar 06, 2024

Thanks !

Data pruning best practices