Data Exploration and Summary

During the process of loading and exploring this data, a lot of effort went into parsing this data into a usable format. The perl scripts that were used to do this can be found in Appendix A. While loading the data with R, I used the readr package, which allowed me to intelligently break down the date format into automatically countable months, days, hours, etc.. However, while parsing I found that 1.51% of the Helix dataset and 0.54% of the Cadillac dataset encountered errors while parsing. This is due to some abnormalities in the number of columns provided from the original data, however have not had a significant impact on the availability of data from those rows. The commands used to load the data into R are available through the source code of this document.

In addition to parsing, the high density of the data led me to create two functions. The first (1) groups jobs by month, producing sum totals for the month in Total walltime, num. successful jobs, num. failed jobs, the total walltime of failed jobs, the total walltime of successful jobs, the number of unique users (per month), the total amount of used memory (per month), and the total number of jobs. Also, another was created (2) grouping jobs by day, producing a total number of jobs for the day. More functions were also written to aid in the creation of sorted frequency tables.

Data Structure

Below is the first ten data points in this dataset. The chunk below was taken for the Helix dataset, however the data structure is identical for both clusters.

Date JobID Group JobName Queue CTime QTime ETime StartTime Owner NeedNodes NodeCT ResourceNodes ResourceWalltime Session EndTime ExitStatus UsedCPU UsedMemory UsedVirtualMemory UsedWalltime
2014-09-27 3.helix-master.jax.org jaxadmin RFB-ExampleJob batch 1411821999 1411821999 1411821999 1411821999 Aardvark1 nodes=1 1 nodes=1 0.0166667 0 1411822000 -2 0 0 0 0.0002778
2014-09-27 7.helix-master.jax.org jaxadmin STDIN batch 1411825869 1411825869 1411825869 1411825869 Aardvark1 nodes=1 1 nodes=1 1.0000000 0 1411825869 -2 0 0 0 0.0000000
2014-09-27 8.helix-master.jax.org jaxadmin STDIN batch 1411825887 1411825887 1411825887 1411825887 Aardvark1 nodes=1 1 nodes=1 1.0000000 0 1411825887 -2 0 0 0 0.0000000
2014-09-27 12.helix-master.jax.org jaxadmin STDIN batch 1411826763 1411826763 1411826763 1411826763 Aardvark1 nodes=1 1 nodes=1 1.0000000 29287 1411826793 0 0 3364 348856 0.0083333
2014-09-27 13.helix-master.jax.org jaxadmin STDIN batch 1411827333 1411827333 1411827333 1411827334 Aardvark1 nodes=1 1 nodes=1 1.0000000 31661 1411827354 0 0 3368 348856 0.0055556
2014-09-27 14.helix-master.jax.org jaxadmin STDIN batch 1411832317 1411832317 1411832317 1411832317 Aardvark1 nodes=1 1 nodes=1 1.0000000 52135 1411832327 0 0 0 0 0.0027778
2014-09-28 15.helix-master.jax.org jaxadmin STDIN batch 1411905758 1411905758 1411905758 1411905768 Aardvark1 nodes=1 1 nodes=1 1.0000000 20687 1411905789 0 0 3368 348856 0.0058333
2014-09-28 16.helix-master.jax.org jaxadmin STDIN batch 1411905758 1411905758 1411905758 1411905779 Aardvark1 nodes=1 1 nodes=1 1.0000000 20759 1411905809 0 0 3372 348856 0.0083333
2014-09-28 17.helix-master.jax.org jaxadmin STDIN batch 1411905759 1411905759 1411905759 1411905789 Aardvark1 nodes=1 1 nodes=1 1.0000000 20832 1411905809 0 0 1912 123972 0.0055556
2014-09-28 18.helix-master.jax.org jaxadmin STDIN batch 1411905759 1411905759 1411905759 1411905819 Aardvark1 nodes=1 1 nodes=1 1.0000000 21020 1411905840 0 0 3368 348856 0.0058333

Information Fields

Date

The date range on the Helix dataset stretches from September 9th, 2014 to February 2nd, 2021. The date range on the Cadillac dataset stretches from April 4th, 2014 to January 31st, 2021. All dates are in the %m/%d/%Y format.

JobID

Each Job has an associated ID. Since this dataset only reports jobs that have ended, some JobIDs may have been skipped due to server errors, submission errors, cancellations, or other reasons.

Group

The Group variable is not very informative, as while some of the early jobs were specified by group (such as compsci, jaxadmin, or jaxchurchill), many of the later jobs were specified as simply jaxuser. The frequency table for this variable can be seen below for both Helix and Cadillac.

Helix (left), Cadillac (right)
Var1 Freq
jaxuser 11839060
jaxchurchill 118914
smrtanalysis 44019
compsci 15951
jaxadmin 11744
jaxgraber 5621
jaxchesler 1360
jaxhibbs 876
28003 740
galaxy 66
jaxcarter 35
splunk 13
Var1 Freq
jaxuser 2893660
jaxchurchill 375759
jaxhibbs 94775
jaxchesler 61744
compsci 16545
jaxgraber 16149
jaxadmin 5015
smrtanalysis 5006
galaxy 3992
jaxcarter 289
jaxcgd 38

Job Name

The job name is also mostly uninformative, unless looking for one specific job name. This is up to the user to decide, so performing any sort of analysis is mostly unintelligible.

Queue

This variable can be used to describe the popularity of certain queues submitted to by users. Below are the frequency tables for each cluster.

Helix (left), Cadillac (right)
Var1 Freq
batch 11659470
batch2 116769
special 102003
dev_centos7 45358
test 23979
CLIA 21859
short 19568
interactive 15229
high_mem 11146
gpu 7960
long 7573
ruan_priority 3077
ccs 3013
htps 873
Var1 Freq
batch 3313734
short 142760
gpu 6701
long 4990
interactive 2260
high_mem 1554
CLIA 665
test 308

CTime, QTime, ETime, StartTime, EndTime

These values, as recorded, are numeric representations of time stamps. CTime represents the time the job was created. QTime represents the time the job was queued. ETime represents the time the job was eligible to run. StartTime represents the time the job was started. EndTime represents the time the job ended.

Owner

This field represents the owner, or submitter of the job. This can be useful data to identify degree our top users use the cluster. For purposes of privacy, these usernames have been anonymised.

All-time

Helix (left), Cadillac (right)
Owner n
Dog1 1086195
Koala1 957174
Gar1 836677
Antelope1 667965
Moorhen1 582832
Bonobo1 487060
Coati1 462973
Reindeer1 422476
Horse1 395046
Rottweiler1 357363
Owner n
Cat3 270686
Puffin2 261889
Dodo3 246801
Turkey2 234585
Cuttlefish3 220317
Reindeer2 209401
Raccoon2 204107
Swan2 182920
Gorilla3 133021
Wrasse2 131644

Since 2017

Helix (left), Cadillac (right)
Owner n
Dog1 1075048
Koala1 957174
Gar1 836677
Antelope1 617375
Moorhen1 582832
Bonobo1 484839
Coati1 459176
Reindeer1 422476
Horse1 395046
Rottweiler1 357363
Owner n
Cat3 270686
Puffin2 248309
Dodo3 246801
Cuttlefish3 220317
Turkey2 176039
Reindeer2 172844
Raccoon2 171715
Gorilla3 133021
Wrasse2 121287
Abyssinian3 86849

Since Clusters’ EOL Dates Helix and Cadillac went EOL on July 1st, 2019

Helix (left), Cadillac (right)
Owner n
Antelope1 289014
Koala1 228717
Rottweiler1 174553
Newfoundland1 137935
Snake1 126767
Grasshopper2 106730
Leopard1 84819
Bonobo1 79329
Bonobo2 78036
Dog1 60622
Owner n
Dodo3 90946
Reindeer2 47536
Cat3 18120
Javanese3 13065
Moorhen3 12398
Hedgehog3 11685
Wrasse2 9388
Kangaroo3 8053
Gharial3 5186
Indri3 3120

Since January 2020

Helix (left), Cadillac (right)
Owner n
Antelope1 274082
Rottweiler1 112992
Leopard1 78217
Bonobo1 74650
Koala1 45723
Quoll1 43988
Grasshopper2 41675
Birman2 34497
Bonobo2 29789
Rattlesnake1 29508
Owner n
Dodo3 55923
Moorhen3 12398
Wrasse2 9138
Fox3 970
Gharial3 514
Rat2 500
Javanese3 388
Indri3 229
Bulldog3 93
Quail2 88

Last 3 Months

Helix (left), Cadillac (right)
Owner n
Rottweiler1 20318
Salamander1 12205
Chamois1 6537
Quoll1 5335
Rattlesnake1 1767
Pheasant2 997
Leopard1 848
Snake1 766
Lionfish1 715
Goat2 673
Owner n
Dodo3 13275
Moorhen3 2725
Wrasse2 494
Javanese3 20
Gharial3 1

NeedNodes, NodeCT, ResourceNodes

These fields describe the number of nodes requested by the job submission. While this could be of use to see how well users are profiling their jobs, this is mostly useless due to the fact that we are more interested in raw CPU time and walltime, as compared to the number of unique nodes requested.

ResourceWalltime, UsedWalltime

These two fields reflect the amount of walltime requested as compared to the amount of walltime used by the job. The ResourceWalltime field describes a decimal representation of how many hours of walltime were originally requested by the job. The UsedWalltime variable represents how much time was actually used.

For now we will simply observe some basic statistics, as the amount of utilized walltime in hours will be analyzed later in monthly grouped data.

print(descr(helix.full$ResourceWalltime, stats="common"), method='render', table.classes = 'st-small')

Descriptive Statistics

value

N: 12038399
value
Mean 110.47
Std.Dev 1649.00
Min -11.00
Median 10.00
Max 734647.00
N.Valid 11991016
Pct.Valid 99.61

Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15

print(descr(helix.full$UsedWalltime, stats="common"), method='render', table.classes = 'st-small')

Descriptive Statistics

value

N: 12038399
value
Mean 39280.71
Std.Dev 1279454.53
Min -3.90
Median 0.02
Max 74016884.00
N.Valid 11998868
Pct.Valid 99.67

Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15

UsedCPU, UsedMemory, UsedVirtualMemory

All of these statistics are reflective of the amount of resources consumed by the job. The UsedCPU field reflects how many hours of CPU time were utilized. The UsedMemory field reflects how much RAM was used by the job in terms of Kb. The UsedVirtualMemory field reflects how much Virtual Memory was used from the nodes by the jobs in terms of Kb.

ExitStatus

This field reflects the exit code the job returned. A exit code of “0” represents a successful job, and any other exit code represents a failure.

All-time

Helix (left), Cadillac (right)
ExitStatus n
0 10740144
1 702411
2 96586
271 92195
127 70128
255 65665
-9 62017
-11 59000
3 9714
126 7926
ExitStatus n
0 3047777
1 257093
271 25381
-9 20872
2 20470
-11 17146
127 13839
255 9080
137 3293
132 3102

Last 3 Months

Helix (left), Cadillac (right)
ExitStatus n
0 49087
1 1730
-11 507
2 445
127 339
255 90
271 64
254 48
139 43
265 36
ExitStatus n
1 13656
0 2842
-11 17