Data Exploration and Summary

During the process of loading and exploring this data, a lot of effort went into parsing this data into a usable format. The perl scripts that were used to do this can be found in Appendix A. While loading the data with R, I used the readr package, which allowed me to intelligently break down the date format into automatically countable months, days, hours, etc.. However, while parsing I found that 1.51% of the Helix dataset and 0.54% of the Cadillac dataset encountered errors while parsing. This is due to some abnormalities in the number of columns provided from the original data, however have not had a significant impact on the availability of data from those rows. The commands used to load the data into R are available through the source code of this document.

In addition to parsing, the high density of the data led me to create two functions. The first (1) groups jobs by month, producing sum totals for the month in Total walltime, num. successful jobs, num. failed jobs, the total walltime of failed jobs, the total walltime of successful jobs, the number of unique users (per month), the total amount of used memory (per month), and the total number of jobs. Also, another was created (2) grouping jobs by day, producing a total number of jobs for the day. More functions were also written to aid in the creation of sorted frequency tables.

Data Structure

Below is the first ten data points in this dataset. The chunk below was taken for the Helix dataset, however the data structure is identical for both clusters.

Date	JobID	Group	JobName	Queue	CTime	QTime	ETime	StartTime	Owner	NeedNodes	NodeCT	ResourceNodes	ResourceWalltime	Session	EndTime	ExitStatus	UsedMemory	UsedVirtualMemory	UsedWalltime
2014-09-27	3.helix-master.jax.org	jaxadmin	RFB-ExampleJob	batch	1411821999	1411821999	1411821999	1411821999	Aardvark1	nodes=1	1	nodes=1	0.0166667	0	1411822000	-2	0	0	0.0002778
2014-09-27	7.helix-master.jax.org	jaxadmin	STDIN	batch	1411825869	1411825869	1411825869	1411825869	Aardvark1	nodes=1	1	nodes=1	1.0000000	0	1411825869	-2	0	0	0.0000000
2014-09-27	8.helix-master.jax.org	jaxadmin	STDIN	batch	1411825887	1411825887	1411825887	1411825887	Aardvark1	nodes=1	1	nodes=1	1.0000000	0	1411825887	-2	0	0	0.0000000
2014-09-27	12.helix-master.jax.org	jaxadmin	STDIN	batch	1411826763	1411826763	1411826763	1411826763	Aardvark1	nodes=1	1	nodes=1	1.0000000	29287	1411826793	0	3364	348856	0.0083333
2014-09-27	13.helix-master.jax.org	jaxadmin	STDIN	batch	1411827333	1411827333	1411827333	1411827334	Aardvark1	nodes=1	1	nodes=1	1.0000000	31661	1411827354	0	3368	348856	0.0055556
2014-09-27	14.helix-master.jax.org	jaxadmin	STDIN	batch	1411832317	1411832317	1411832317	1411832317	Aardvark1	nodes=1	1	nodes=1	1.0000000	52135	1411832327	0	0	0	0.0027778
2014-09-28	15.helix-master.jax.org	jaxadmin	STDIN	batch	1411905758	1411905758	1411905758	1411905768	Aardvark1	nodes=1	1	nodes=1	1.0000000	20687	1411905789	0	3368	348856	0.0058333
2014-09-28	16.helix-master.jax.org	jaxadmin	STDIN	batch	1411905758	1411905758	1411905758	1411905779	Aardvark1	nodes=1	1	nodes=1	1.0000000	20759	1411905809	0	3372	348856	0.0083333
2014-09-28	17.helix-master.jax.org	jaxadmin	STDIN	batch	1411905759	1411905759	1411905759	1411905789	Aardvark1	nodes=1	1	nodes=1	1.0000000	20832	1411905809	0	1912	123972	0.0055556
2014-09-28	18.helix-master.jax.org	jaxadmin	STDIN	batch	1411905759	1411905759	1411905759	1411905819	Aardvark1	nodes=1	1	nodes=1	1.0000000	21020	1411905840	0	3368	348856	0.0058333

Information Fields

Date

The date range on the Helix dataset stretches from September 9th, 2014 to February 2nd, 2021. The date range on the Cadillac dataset stretches from April 4th, 2014 to January 31st, 2021. All dates are in the %m/%d/%Y format.

JobID

Each Job has an associated ID. Since this dataset only reports jobs that have ended, some JobIDs may have been skipped due to server errors, submission errors, cancellations, or other reasons.

Group

The Group variable is not very informative, as while some of the early jobs were specified by group (such as compsci, jaxadmin, or jaxchurchill), many of the later jobs were specified as simply jaxuser. The frequency table for this variable can be seen below for both Helix and Cadillac.

Helix (left), Cadillac (right)

Var1	Freq
jaxuser	11839060
jaxchurchill	118914
smrtanalysis	44019
compsci	15951
jaxadmin	11744
jaxgraber	5621
jaxchesler	1360
jaxhibbs	876
28003	740
galaxy	66
jaxcarter	35
splunk	13

Var1	Freq
jaxuser	2893660
jaxchurchill	375759
jaxhibbs	94775
jaxchesler	61744
compsci	16545
jaxgraber	16149
jaxadmin	5015
smrtanalysis	5006
galaxy	3992
jaxcarter	289
jaxcgd	38

Job Name

The job name is also mostly uninformative, unless looking for one specific job name. This is up to the user to decide, so performing any sort of analysis is mostly unintelligible.

Queue

This variable can be used to describe the popularity of certain queues submitted to by users. Below are the frequency tables for each cluster.

Helix (left), Cadillac (right)

Var1	Freq
batch	11659470
batch2	116769
special	102003
dev_centos7	45358
test	23979
CLIA	21859
short	19568
interactive	15229
high_mem	11146
gpu	7960
long	7573
ruan_priority	3077
ccs	3013
htps	873

Var1	Freq
batch	3313734
short	142760
gpu	6701
long	4990
interactive	2260
high_mem	1554
CLIA	665
test	308

CTime, QTime, ETime, StartTime, EndTime

These values, as recorded, are numeric representations of time stamps. CTime represents the time the job was created. QTime represents the time the job was queued. ETime represents the time the job was eligible to run. StartTime represents the time the job was started. EndTime represents the time the job ended.

Owner

This field represents the owner, or submitter of the job. This can be useful data to identify degree our top users use the cluster. For purposes of privacy, these usernames have been anonymised.

All-time

Helix (left), Cadillac (right)

Owner	n
Dog1	1086195
Koala1	957174
Gar1	836677
Antelope1	667965
Moorhen1	582832
Bonobo1	487060
Coati1	462973
Reindeer1	422476
Horse1	395046
Rottweiler1	357363

Owner	n
Cat3	270686
Puffin2	261889
Dodo3	246801
Turkey2	234585
Cuttlefish3	220317
Reindeer2	209401
Raccoon2	204107
Swan2	182920
Gorilla3	133021
Wrasse2	131644

Since 2017

Helix (left), Cadillac (right)

Owner	n
Dog1	1075048
Koala1	957174
Gar1	836677
Antelope1	617375
Moorhen1	582832
Bonobo1	484839
Coati1	459176
Reindeer1	422476
Horse1	395046
Rottweiler1	357363

Owner	n
Cat3	270686
Puffin2	248309
Dodo3	246801
Cuttlefish3	220317
Turkey2	176039
Reindeer2	172844
Raccoon2	171715
Gorilla3	133021
Wrasse2	121287
Abyssinian3	86849

Since Clusters’ EOL Dates Helix and Cadillac went EOL on July 1st, 2019

Helix (left), Cadillac (right)

Owner	n
Antelope1	289014
Koala1	228717
Rottweiler1	174553
Newfoundland1	137935
Snake1	126767
Grasshopper2	106730
Leopard1	84819
Bonobo1	79329
Bonobo2	78036
Dog1	60622

Owner	n
Dodo3	90946
Reindeer2	47536
Cat3	18120
Javanese3	13065
Moorhen3	12398
Hedgehog3	11685
Wrasse2	9388
Kangaroo3	8053
Gharial3	5186
Indri3	3120

Since January 2020

Helix (left), Cadillac (right)

Owner	n
Antelope1	274082
Rottweiler1	112992
Leopard1	78217
Bonobo1	74650
Koala1	45723
Quoll1	43988
Grasshopper2	41675
Birman2	34497
Bonobo2	29789
Rattlesnake1	29508

Owner	n
Dodo3	55923
Moorhen3	12398
Wrasse2	9138
Fox3	970
Gharial3	514
Rat2	500
Javanese3	388
Indri3	229
Bulldog3	93
Quail2	88

Last 3 Months

Helix (left), Cadillac (right)

Owner	n
Rottweiler1	20318
Salamander1	12205
Chamois1	6537
Quoll1	5335
Rattlesnake1	1767
Pheasant2	997
Leopard1	848
Snake1	766
Lionfish1	715
Goat2	673

Owner	n
Dodo3	13275
Moorhen3	2725
Wrasse2	494
Javanese3	20
Gharial3	1

NeedNodes, NodeCT, ResourceNodes

These fields describe the number of nodes requested by the job submission. While this could be of use to see how well users are profiling their jobs, this is mostly useless due to the fact that we are more interested in raw CPU time and walltime, as compared to the number of unique nodes requested.

ResourceWalltime, UsedWalltime

These two fields reflect the amount of walltime requested as compared to the amount of walltime used by the job. The ResourceWalltime field describes a decimal representation of how many hours of walltime were originally requested by the job. The UsedWalltime variable represents how much time was actually used.

For now we will simply observe some basic statistics, as the amount of utilized walltime in hours will be analyzed later in monthly grouped data.

print(descr(helix.full$ResourceWalltime, stats="common"), method='render', table.classes = 'st-small')

Descriptive Statistics

value

N: 12038399

	value
Mean	110.47
Std.Dev	1649.00
Min	-11.00
Median	10.00
Max	734647.00
N.Valid	11991016
Pct.Valid	99.61

Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15

print(descr(helix.full$UsedWalltime, stats="common"), method='render', table.classes = 'st-small')

Descriptive Statistics

value

N: 12038399

	value
Mean	39280.71
Std.Dev	1279454.53
Min	-3.90
Median	0.02
Max	74016884.00
N.Valid	11998868
Pct.Valid	99.67

Generated by summarytools 0.9.8 (R version 4.0.4)
2021-03-15

UsedCPU, UsedMemory, UsedVirtualMemory

All of these statistics are reflective of the amount of resources consumed by the job. The UsedCPU field reflects how many hours of CPU time were utilized. The UsedMemory field reflects how much RAM was used by the job in terms of Kb. The UsedVirtualMemory field reflects how much Virtual Memory was used from the nodes by the jobs in terms of Kb.

ExitStatus

This field reflects the exit code the job returned. A exit code of “0” represents a successful job, and any other exit code represents a failure.

All-time

Helix (left), Cadillac (right)

ExitStatus	n
0	10740144
1	702411
2	96586
271	92195
127	70128
255	65665
-9	62017
-11	59000
3	9714
126	7926

ExitStatus	n
0	3047777
1	257093
271	25381
-9	20872
2	20470
-11	17146
127	13839
255	9080
137	3293
132	3102

Last 3 Months

Helix (left), Cadillac (right)

ExitStatus	n
0	49087
1	1730
-11	507
2	445
127	339
255	90
271	64
254	48
139	43
265	36

ExitStatus	n
1	13656
0	2842
-11	17

Cluster Log Analysis, 2014-Current

Matthew Bradley

2021-03-15

Data Exploration and Summary

Data Structure

Information Fields

Date

JobID

Group

Job Name

Queue

CTime, QTime, ETime, StartTime, EndTime

Owner

NeedNodes, NodeCT, ResourceNodes

ResourceWalltime, UsedWalltime

Descriptive Statistics

value

Descriptive Statistics

value

UsedCPU, UsedMemory, UsedVirtualMemory

ExitStatus