Data pruning best practices
- Overview
- Optimizing data retention
- Data retention configuration
- Figure 1: Days before snapshot deletion, days before certifications are archived, and days before ce...
- Figure 2: Snapshot frequency in days, and index history granularity
- Perform maintenance task configuration
- Figure 3: Setup > tasks > perform maintenance
- Identity refresh task configuration
- Data archive strategies
- Large data cleanup examples
- Additional areas to consider
- Additional resources
Overview
Original content by Scott.Lindsey and Mike.Hovis
This document discusses the configuration, processes and best practices for pruning old data from your IdentityIQ environment, in order to maintain performance and database disk space allocation.
Optimizing data retention
IdentityIQ should be configured to store the appropriate amount of historical data to optimize database size and overall system performance based on your business requirements for retaining identity and certification data.
The following objects have the most effect on database size and system performance:
Object Type | Description | Impact |
---|---|---|
Identity Snapshots | Stores a copy of the governance Identity data for historical purposes. These snapshots are created during certification generation, and by using the option “Maintain identity histories” in the Identity Refresh task. The number of Identity Snapshots can affect the time to generate new certifications, the time to render Identities in the UI, and the size of the database. It is recommended that you define an expiration policy for identity snapshots that matches your certification archive policy. |
Performance: High Table Space: High |
Certifications | As certifications are generated, large amounts of historical data is captured and stored in the certification tables along with identity snapshots, history, and entitlement snapshot data. After the certification cycle is complete and audit reports are generated, the certification data becomes stale. It is recommended that you keep at least one previous cycle of data to use during the next certification generation, for comparison purposes. You may choose to keep live certification data in the system for one or more complete certification cycles to provide historical data for current certifications. Older certifications can be archived and deleted to improve certification performance and to reduce disk space. |
Performance: Medium Table Space: High |
Task Results | Any time a task is executed, a task result is saved. Task results include report results that in many cases consume a large portion of the utilized disk space. Depending on the TaskDefinition configuration, the previous result may be renamed or deleted. In the case of renames to keep historical task results, a large amount of task result data can accumulate, especially for scheduled daily or hourly tasks. It may be important to keep this task data for some period of time for debugging and monitoring purposes. Depending on the business and operational requirements task results should be cleaned up as often as necessary to reduce disk space. |
Performance: Low Table Space: Medium |
Identities | Identities accumulate over time as new systems are aggregated into IdentityIQ and users complete their life cycle at a company. As the number of identities increases, the time to search, refresh and maintain these identities increases. This document won’t discuss strategies for optimizing the execution of these daily tasks but will focus on pruning the dead weight from your identity warehouse. |
Performance: Medium Table Space: Low |
Requests | Each time an email is sent, a new request object is created. For large certification cycles or daily-automated tasks, the request queue can become very large. Request objects are also created for future data workflow events and other object expiration events. Although the completed requests have no effect on system performance they do take disk space. |
Performance: Low Table Space: Low |
Data retention configuration
For each of the object types it is important to define the number of days before the objects will be deleted or archived to optimize the amount of data retained, maintain system performance, and control the size of the data footprint. The system can also be configured to capture more or less granular data based on your business requirements. Figure 1 and Figure 2 below show the settings available from the System Setup > IdentityIQ Configuration section. Refer to the product documentation for specific explanations.
- Each setting is applied when the task listed is executed
- A value of 0 indicates never to remove objects
- Deleted objects cannot be recovered
WARNING: Enabling these options can result in deleted data
Setting and Type | Recommended Retention Policies | Task Configuration |
---|---|---|
Days before snapshot deletion | At a minimum this should be set to match your shortest certification cycle and at a maximum it should match your certification archive days. | Perform Maintenance Option: Prune Identity Snapshots |
Days before task result deletion | This should be set to match the longest schedule for tasks where you need to see previous results. For some customers this is every 60 days, where their longest schedule of interest is every 30 days. This option applies to Task Results and Report Results. | Perform Maintenance Option: Prune Task Results |
Days before certifications are archived | When this option is enabled (not set to 0) completed certifications are serialized into a single Character Large Object (CLOB) and persisted into the spt_certification_archive table. The certification objects are then deleted from the live tables and no longer appear in the UI or are available for reporting. Archived certification objects should be considered as permanently removed but not yet deleted. A recommended practice is to archive your certifications that are older than the last 2 cycles. Certification data should be extracted into a data warehouse or report before archiving. | Perform Maintenance Option: Archive and Prune Certifications |
Days before certification archive deletion | To complete the cleanup of the old certifications this option will cause the archive object to be completely deleted. Use this option with caution, as the data is not recoverable. | Perform Maintenance Option: Archive and Prune Certifications |
Snapshot Frequency in Days | In general, this should be set to the highest acceptable value, in order to reduce the amount of snapshot data being stored. This option is used only if the Maintain Identity history option is enabled in the Identity Refresh task. | Refresh Identity Cube: Maintain identity histories |
Index History Granularity |
This determines the increments at which to store history. For example, if the Identity history is set to ‘Week’, snapshots are preserved on a weekly basis. This option should be set to the largest granular window that is acceptable to the user, to reduce the number of updates to the snapshot data. |
Refresh Identity Cube: Maintain identity histories |
Figure 1: Days before snapshot deletion, days before certifications are archived, and days before certification archive deletion
Figure 2: Snapshot frequency in days, and index history granularity
Perform maintenance task configuration
The Perform Maintenance task controls which objects are pruned, and when. It is recommended that you configure a dedicated task that executes on a separate cycle to prune and archive objects. You may choose to have a task for each object type for example.
WARNING: Enabling the prune options will result in deleted data
- Create a dedicated “Object Pruning and Archive” task definition
- Decide if you need one for each type
- Schedule the task to run frequently enough to keep pace with data creation
- Snapshots and Task Results: weekly or daily
- Certifications: monthly or longer
- The first execution may take many days to run, depending on existing historical data
- Prune requests should be run separately based on a set schedule, such as every 60 or 90 days.
- This option deletes all completed requests, not based on age of request.
Figure 3: Setup > tasks > perform maintenance
Identity refresh task configuration
The Identity Refresh task has options which control the amount of historical data retained for identities. For example, you can configure the Identity Refresh Task to maintain identity histories - this creates a snapshot of each identity that is refreshed when this task is run.
You can control the amount of data retained by this task in several ways:
- Only enable the "Maintain identity histories" option if your business requires it.
- If you want to maintain histories but don't want to create a snapshot every time an identity is refreshed (if for example you want to capture history information less frequently), you can set up a dedicated Identity Refresh task that is specifically for capturing snapshots, and is separate from your main Identity Refresh task
- You can configure filters to only snapshot specific Identities, using the "Optional filter string..." and "Optional list or group of populations..." options in the task configuration, to select a specific identity or set of identities to snapshot.
Data archive strategies
There are several strategies for preserving the certification, identity, and snapshot data for historical purposes. Choose a strategy that best fits your operating and business needs.
- Certification Archive Reports
- Generate certification archive reports and save these exported data sets outside of IdentityIQ for long term retention
- Certification Data Warehouse
- Export all certification data into a Data Warehouse. The Data Export task is a good example of a custom database process to sync the data.
- Offline IdentityIQ table archives
- Database sync scripts to an offline archive of the tables
- Full Database Backup
- Database backups are used to restore old data in extreme cases
Large data cleanup examples
The following SQL and HQL (for querying only) examples can be used as methods to remove large data sets directly at the database layer reducing the overall time to prune old data. Note that these are samples and may need adjusting based on your database type and IdentityIQ version. Any reference to time fields must be made in milliseconds since epoch; these can be generated by adding three digits for milliseconds to results generated from an epoch time converter, such as http://www.epochconverter.com/
The console can be used to run SQL or HQL queries.
SQL queries are run using:
- sql “select count(*) from spt_identity”
- sql “select max(created) from spt_identity_history”
HQL queries are run using HQL. Note that returned time fields will be translated, but for queries, epoch time must be used. For indication of the names of various columns, refer to the hbm.xml files within the identityiq.jar file:
- hql “select count(*) from Identity”
- hql “select max(created) from IdentityHistory”
For SQL, you must know which tables are involved (such as spt_identity for Identity). For HQL, all you need to know is the name of the type of object (such as Identity). Here is a list of the table names involved with the largest datasets:
- spt_identity_archive
- spt_identity_snapshot
- spt_certification_archive
- spt_identity_history
- spt_identity_history_item
SQL Queries (note that timestamps are in epoch format and must be converted for queries):
- delete from spt_identity_history_item where created < <predetermined max age of retention>
- delete from spt_identity_history where id not in (select identity_history_id from spt_identity_history_item)
- delete from spt_identity_snapshot where id not in (select snapshot_id from spt_certification_entity) and created < <predetermined max age of retention>”
- delete from spt_identity_archive where created < <predetermined max age of retention>
Additional areas to consider
- Audit Data – Audit data accumulates in the system indefinitely. There is no task or option to prune this data. Depending on your data retention requirements, you can create a custom task to prune the data, or manually prune this table using a simple SQL query to archive and delete the old records every X days.
- IdentityRequest – In the Perform Identity Request Maintenance task, the Max age (in days) for Identity Request objects setting controls how long request objects are kept in the database. Leaving this set to 0 tells IdentityIQ to keep these objects indefinitely. You can set this to a different value according to your business needs.
- Syslog – Log4j messages at the error or warning level are stored in the syslog file as well as in the log files on the respective host. If you don't want to retain these messages in the syslog file, you can change this setting in your IdentityIQ settings. Go to the Gear menu >Global Settings > IdentityIQ Configuration > Miscellaneous tab to set the level at which syslog events will be stored, and when they will be deleted (0 days = never delete).
Additional resources
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Great! Exactly what I was looking for!
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Awesome, this helps a lot!
BTW, I know the configuration on "Other Object Expiration" items and others are based on the business requirement. But sometimes, we can't get the business requirement for that. Do we have any best practice on that?
Thanks,
Bill
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Hi All,
Where we can configure the Prune Requests age.? Currently in my environment there are only 7 days of requests are holding
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Identity Requests are pruned by the Perform Identity Request Maintenance task. You can specify how long to retain them in that task.
If you have follow up questions, please ask them in the IdentityIQ Forums. You can put a link back to this article if you want, but more people follow the forums and you are likely to get more answers/participation there than here (plus it muddies the articles to have long threads of Q/A that are not specifically about clarifying doc content).
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Hi,
great article. It was provided before sailpoint.object.ProvisioningTransaction have been intriduced to IdentityIQ. They contain good data but have a really high impact on database size and possibly performance. I think this should be added here.
Regards
Martin
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Agreed with @martin_dehn, ProvisioningTransaction should be added in this documentation.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Very good article.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Content to Moderator
Thanks !