Example Issue
UBE runs for three days and never finishes.
These settings:
Approaches
Pros:
Pros:
Need a valid profile to answer the question
WHERE IS THE CODE/SYSTEM SPENDING ITS TIME?The problem question:
- How to get that profile?
- How to “get your arms around the process”?
Analysis of Long running UBE’s
In JDE.INI [DEBUG] DumpLPDS=0These settings:
- Reduce log file size; debug gets large less quickly
- Give a more accurate profile by stripping out stuff which does not help with performance analysis
Approaches
- Run the unmodified production use case with debug on
- Run the UBE with Reduced Data Selection
- Run for 30-60 minutes, then terminate job
- Run for a long period, collecting log samples at intervals
- Capture runtime call stacks over an extended interval
Pros:
- This is the ideal case – the entire run is captured beginning to end
- No compromise to data sample.
- For UBEs longer than two hours (absolute maximum), this method is almost certainly not viable.
- Logs get too large too fast: several GB per hour
- Some customers have 2GB file size limit on the operating system.
- Note that this is usually modifiable
- Debug logging adds a factor of 2-3 to the runtime
Reduced Data Selection
- Shorter, manageable run
- Smaller log size
- Job can finish, so a complete start-to-finish picture results
- No need to terminate the job
- Under sampling
- Short time frame skews the profile
- Fixed-cost / one-time hits at startup will be exaggerated
- A 10-minute run of a 10 hour job will NOT give a reliable profile
- Avoidance of problems you are trying to observe
- If the problem is related to a sp specific data range, this may be missed
- Job may be in infinite loop
Run for 30-60 minutes, then terminate job
Pros:- Less risk from under sampling
- A reasonable sized log in the few GB range
- UBE may have multiple sections, and the real problems may occur in a section that is never reached in the 30-60 minutes
- Get a listing of the UBE’s ER to help get a clearer picture
- Look at the UBE in the RDA tool
- What the UBE is doing in the first hour may NOT be the same as what it’s doing in the third hour
- Remember: a one-hour run with debugging on is probably 20-30 minutes of runtime without debug logging on
- If the job is running the same code for a long period – but slowing down gradually – this will also be missed by just getting the first hour
- Killing the job prematurely means not seeing the complete picture
- The graceful end of a UBE contains important indications of cache memory leaks:
- Such as jdeCacheInit()not matched with jdeCacheTerminateAll()
- Failure to call jdeCacheTerminateAll() will call the jdeCacheDestroyAllUserCaches()
- This will prove difficult to count and match all jdeCacheInit() calls, as they may be initialized and closed in as many BSFNs.
- Multiple sections in UBEs
- Problem data ranges in UBEs
Run for a long period, collecting log samples at intervals
Pros:- Obtain profiles for a much longer period
- Perhaps collect a 30 minute log every 2-4 hours
- The run needs monitoring, babysitting
- Process is a bit of a “kludge”
- Takes a long time, requires machine resources during that time
Capture runtime call stacks over an extended interval
DO NOT enable EOne debug logging- Capture a set of snapshots at different intervals over a long run
- Can combine this method with other monitoring
- Poor man’s “manual” sampling…can be effective
- Use existing operating system commands
- Debug code NOT required
- Can help to spot infinite looping behavior
- Raw call stacks are a bit obscure, not as intuitive to read
- The run needs monitoring, babysitting
- Takes a long time, requires machine resources during that time
- If the use case is less than two hours:
- Run the unmodified production use case with debug on
- In general, avoid reducing data selection
- But this MAY help in identifying cache leaks – since the UBE finishes gracefully
- Can view the end of the log file for missing cache terminates
- Perhaps a reduced data selection case could be run in addition to one of the longer use cases
- This would allow the end of the job to be captured
- Try a one hour terminated run first
- In general one single long-running SELECT should NOT drive the analysis
- Next – try log samples throughout the run
- Finally, try call stack samples
No comments:
Post a Comment