During the process of loading and exploring this data, a lot of effort went into parsing this data into a usable format. The perl scripts that were used to do this can be found in Appendix A. While loading the data with R, I used the readr
package, which allowed me to intelligently break down the date format into automatically countable months, days, hours, etc.. However, while parsing I found that 1.51% of the Helix dataset and 0.54% of the Cadillac dataset encountered errors while parsing. This is due to some abnormalities in the number of columns provided from the original data, however have not had a significant impact on the availability of data from those rows. The commands used to load the data into R are available through the source code of this document.
In addition to parsing, the high density of the data led me to create two functions. The first (1) groups jobs by month, producing sum totals for the month in Total walltime, num. successful jobs, num. failed jobs, the total walltime of failed jobs, the total walltime of successful jobs, the number of unique users (per month), the total amount of used memory (per month), and the total number of jobs. Also, another was created (2) grouping jobs by day, producing a total number of jobs for the day. More functions were also written to aid in the creation of sorted frequency tables.
Below is the first ten data points in this dataset. The chunk below was taken for the Helix dataset, however the data structure is identical for both clusters.
Date | JobID | Group | JobName | Queue | CTime | QTime | ETime | StartTime | Owner | NeedNodes | NodeCT | ResourceNodes | ResourceWalltime | Session | EndTime | ExitStatus | UsedCPU | UsedMemory | UsedVirtualMemory | UsedWalltime |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2014-09-27 | 3.helix-master.jax.org | jaxadmin | RFB-ExampleJob | batch | 1411821999 | 1411821999 | 1411821999 | 1411821999 | Aardvark1 | nodes=1 | 1 | nodes=1 | 0.0166667 | 0 | 1411822000 | -2 | 0 | 0 | 0 | 0.0002778 |
2014-09-27 | 7.helix-master.jax.org | jaxadmin | STDIN | batch | 1411825869 | 1411825869 | 1411825869 | 1411825869 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 0 | 1411825869 | -2 | 0 | 0 | 0 | 0.0000000 |
2014-09-27 | 8.helix-master.jax.org | jaxadmin | STDIN | batch | 1411825887 | 1411825887 | 1411825887 | 1411825887 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 0 | 1411825887 | -2 | 0 | 0 | 0 | 0.0000000 |
2014-09-27 | 12.helix-master.jax.org | jaxadmin | STDIN | batch | 1411826763 | 1411826763 | 1411826763 | 1411826763 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 29287 | 1411826793 | 0 | 0 | 3364 | 348856 | 0.0083333 |
2014-09-27 | 13.helix-master.jax.org | jaxadmin | STDIN | batch | 1411827333 | 1411827333 | 1411827333 | 1411827334 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 31661 | 1411827354 | 0 | 0 | 3368 | 348856 | 0.0055556 |
2014-09-27 | 14.helix-master.jax.org | jaxadmin | STDIN | batch | 1411832317 | 1411832317 | 1411832317 | 1411832317 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 52135 | 1411832327 | 0 | 0 | 0 | 0 | 0.0027778 |
2014-09-28 | 15.helix-master.jax.org | jaxadmin | STDIN | batch | 1411905758 | 1411905758 | 1411905758 | 1411905768 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 20687 | 1411905789 | 0 | 0 | 3368 | 348856 | 0.0058333 |
2014-09-28 | 16.helix-master.jax.org | jaxadmin | STDIN | batch | 1411905758 | 1411905758 | 1411905758 | 1411905779 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 20759 | 1411905809 | 0 | 0 | 3372 | 348856 | 0.0083333 |
2014-09-28 | 17.helix-master.jax.org | jaxadmin | STDIN | batch | 1411905759 | 1411905759 | 1411905759 | 1411905789 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 20832 | 1411905809 | 0 | 0 | 1912 | 123972 | 0.0055556 |
2014-09-28 | 18.helix-master.jax.org | jaxadmin | STDIN | batch | 1411905759 | 1411905759 | 1411905759 | 1411905819 | Aardvark1 | nodes=1 | 1 | nodes=1 | 1.0000000 | 21020 | 1411905840 | 0 | 0 | 3368 | 348856 | 0.0058333 |
The date range on the Helix dataset stretches from September 9th, 2014 to February 2nd, 2021. The date range on the Cadillac dataset stretches from April 4th, 2014 to January 31st, 2021. All dates are in the %m/%d/%Y
format.
Each Job has an associated ID. Since this dataset only reports jobs that have ended, some JobIDs may have been skipped due to server errors, submission errors, cancellations, or other reasons.
The Group variable is not very informative, as while some of the early jobs were specified by group (such as compsci
, jaxadmin
, or jaxchurchill
), many of the later jobs were specified as simply jaxuser
. The frequency table for this variable can be seen below for both Helix and Cadillac.
|
|
The job name is also mostly uninformative, unless looking for one specific job name. This is up to the user to decide, so performing any sort of analysis is mostly unintelligible.
This variable can be used to describe the popularity of certain queues submitted to by users. Below are the frequency tables for each cluster.
|
|
These values, as recorded, are numeric representations of time stamps. CTime
represents the time the job was created. QTime
represents the time the job was queued. ETime
represents the time the job was eligible to run. StartTime
represents the time the job was started. EndTime
represents the time the job ended.
This field represents the owner, or submitter of the job. This can be useful data to identify degree our top users use the cluster. For purposes of privacy, these usernames have been anonymised.
All-time
|
|
Since 2017
|
|
Since Clusters’ EOL Dates Helix and Cadillac went EOL on July 1st, 2019
|
|
Since January 2020
|
|
Last 3 Months
|
|
These fields describe the number of nodes requested by the job submission. While this could be of use to see how well users are profiling their jobs, this is mostly useless due to the fact that we are more interested in raw CPU time and walltime, as compared to the number of unique nodes requested.
These two fields reflect the amount of walltime requested as compared to the amount of walltime used by the job. The ResourceWalltime
field describes a decimal representation of how many hours of walltime were originally requested by the job. The UsedWalltime
variable represents how much time was actually used.
For now we will simply observe some basic statistics, as the amount of utilized walltime in hours will be analyzed later in monthly grouped data.
print(descr(helix.full$ResourceWalltime, stats="common"), method='render', table.classes = 'st-small')
value | |
---|---|
Mean | 110.47 |
Std.Dev | 1649.00 |
Min | -11.00 |
Median | 10.00 |
Max | 734647.00 |
N.Valid | 11991016 |
Pct.Valid | 99.61 |
Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15
value | |
---|---|
Mean | 39280.71 |
Std.Dev | 1279454.53 |
Min | -3.90 |
Median | 0.02 |
Max | 74016884.00 |
N.Valid | 11998868 |
Pct.Valid | 99.67 |
Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15
All of these statistics are reflective of the amount of resources consumed by the job. The UsedCPU
field reflects how many hours of CPU time were utilized. The UsedMemory
field reflects how much RAM was used by the job in terms of Kb. The UsedVirtualMemory
field reflects how much Virtual Memory was used from the nodes by the jobs in terms of Kb.
This field reflects the exit code the job returned. A exit code of “0” represents a successful job, and any other exit code represents a failure.
All-time
|
|
Last 3 Months
|
|