Maryland Green Jobs

Maryland Mobile Logo

Job Information

Johns Hopkins University Systems Engineer in Baltimore, Maryland

General Summary/Purpose:

HPC Systems Engineer for the Maryland Advanced Research Computing Center (MARCC) provides Linux systems management for a clustered HPC system with over 23,000 cores and several petabytes of storage, serving the HPC and data intensive science needs of researchers in the UMCP and Johns Hopkins schools. The Systems Engineer contributes to the planning, design, testing, organization and implementation of cutting-edge technology projects for the facility. The systems team is responsible for the day to day administration of HPC clusters, High Performance storage systems, backups, networking, security and any other services related to the operation of a large HPC center.

Specific duties & responsibilities:

70% -- Install, Configure, Maintain and Troubleshoot

  • With Senior Systems Engineer, design, organize, plan, test and implement cutting-edge hardware designs for an HPC environment

  • Maintain monitoring systems to facilitate quick, proactive responses to routine failures, and to provide comprehensive performance data logging.

  • Provide general system administration backup and escalation for other staff.

  • Consult with and provide expertise to building engineers and other staff on new facilities to be under control of MARCC.

  • Assist with facilities-related issues that directly affect MARCC

  • Automate user account creation, management, and purging

  • Manage data access restrictions on a per user and group basis

  • Implement and maintain monitoring measures for data and system access

  • Audit and maintain user access, authorization and authentication

  • Plan for and deployment of patches and updates to the operating system and application software.

  • Maintain an effective schedule for systems backups and archive operations for mission critical systems.

  • Other Systems Tasks as assigned by supervisor

20% -- Analysis and Design

  • With an understanding of HPC technical needs, work closely with Sr. Systems Engineer and the facility’s director and oversight groups to successfully implement policies and procedures.

  • Document systems processes so that users can easily find useful information and other IT staff can perform routine tasks and provide backup.

  • Participate in the design of future clusters, and plan the retirement of aging systems.

  • Offer technical advice on new projects that directly involve HPC computing at Hopkins.

  • Research and implement new technologies that could be beneficial to HPC.

  • Test and vet new technology in support of HPC efforts

  • Work with vendors to procure prototypes and demo units.

10% -- Training/Education

  • Continuously evaluate new tools and technologies for use in existing and future clusters.

  • Attend department and University-sponsored training to increase knowledge, improve skills, and learn new skills. May substitute University training for supervisor approved commercial job-related course offerings.

Position Overview:

The Maryland Advanced Research Computing Center (MARCC) is a state of the art High Performance Computing (HPC) facility that provides resources (HPC, storage and analytics) for researchers at Johns Hopkins University, The University of Maryland at College Park and eventually to all other schools in the state of Maryland. The Systems Engineer contributes to the planning, design, testing, organization and implementation of cutting-edge technology projects for the facility. The systems team is responsible for the day to day administration of HPC clusters, High Performance storage systems, backups, networking, security and any other services related to the operation of a large HPC center.

Describe the position’s roles & interactions:

Systems support

  • With Senior Systems Engineer, design, organize, plan, test and implement cutting-edge hardware designs for an HPC environment

  • Ensure solutions released to the community are stable and usable.

  • Plan for and deployment of patches and updates to operating system and application software

  • Ensure resources meet the community’s needs and are highly available to the group with limited interruption.

  • Automate user account creation, audit and maintain user access, authentication and authorization

  • Maintain effective schedule for systems backups and archive operations for mission critical systems.

General HPC support

  • Extensively document processes so that users can easily find useful information and other IT staff can perform routine tasks and provide backup.

  • Conduct extensive research to resolve HPC challenges

  • Work closely with the systems and application groups to successfully implement policies and procedures

  • Continuously evaluate new tools and technologies for use in existing and future clusters

  • Recommend solutions and new technologies

  • Provide required facility activity data for University and government reports.

Training/Education

  • Contributed to the Development of materials and workshops describing best practices on application development

  • Attend department and University-sponsored training to increase knowledge, improve skills, and learn new skills. May substitute University training for supervisor approved commercial job related course offerings.

Describe the specific systems, applications, projects for which the position is responsible:

The MARCC high performance computing center consists primarily of a large compute cluster, for high speed analysis of big data, or the execution of highly complex models. The cluster runs CentOS Linux. In addition, there are two storage systems, a Lustre filesystem for data in use, and a set of ZFS storage nodes for longer term data storage. The ZFS storage is to be replicated via high-speed network to comparable storage at the Homewood Campus. The position is responsible for the monitoring and maintenance of the compute nodes, including OS maintenance and security monitoring, as well as the filesystems.

Describe scale/size of area, project and/or system supported (# of users, # of servers, # of machines, # of systems supported, transaction volume, # of schools/areas that use system, # of environments, geography, # of interfaces/integration with other systems, etc.):

Blue Crab is the main cluster at MARCC with over 23,000 cores (June 2018) and a combined theoretical performance of over 1.4 PFLOPs. The compute nodes are a combination of Intel Ivy Bridge (large memory nodes), Haswell, Broadwell and Skylake processors and several Nvidia K80/P100 GPUS linked via FDR-14 InfiniBand interconnects. It also features two types of storage: 2 PB Lustre (IEEL) and 14 PB ZFS on Linux.

The standard compute nodes are Intel Xeon E5-2680v3 (Haswell, 12 cores per CPU), E5-2690v4 (Broadwell, 14 cores per CPU) and Gold 6126 (Skylake, 12 cores per CPU) and 128/96 GB DDR4, 2.5/2.6/2.6 GHz (Marked TDP frequency) or 2.1/2.6/2.3 GHz AVX2(AVX512) base frequency.

The large memory nodes are Dell PowerEdge R920 servers with quad Intel “Ivy Bridge” Xeon E7-8857v2, (3.0GHz, 12 core, 30MB, 130W). Each node has 1024 GB RAM.

The GPU nodes are Dell PowerEdge R730 servers with dual Intel Haswell Xeon E5-2680v3 (12 core, 2.5 GHz, 120W), 128 GB of 2133 MHz DDR4 RAM. (AVX frequency: 2.1GHz) and two Nvidia K80s per node.

The FDR-14 Infiniband topology is 2:1 with 56 Gbps bandwidth. The Lustre file system provides an aggregate bandwidth of 25 GBps (read) and 20 GBps (write).

The MARCC facility is used by researchers throughout JHU, as well as researchers at University System of Maryland institutions.

Minimum qualifications (mandatory):

  • Bachelor’s degree required.

  • Five years related experience.

  • Additional education can be substituted for experience and additional experience can be substituted for education.

Preferred qualifications:

  • 5 years' experience of managing Linux servers as a high-level Linux system administrator.

  • Experience with open source software compilation.

  • In-depth knowledge of TCP/IP networking and related protocols, InfiniBand, etc.

  • Excellent scripting skills, python, perl, shell.

  • Ability to maintain confidentiality.

  • Excellent customer service skills.

  • Excellent communication skills.

  • Must demonstrate strong critical thinking and analytical reasoning.

  • Programming skills in C, C++, or scientific language, desired but not required.

  • Experience with MySQL or Mariadb database programming, desired but not required.

  • Expert level knowledge of configuration management and monitoring tools (puppet, nagios, etc).

  • Familiarity or experience with data subject to restrictions, desired but not required

Special knowledge, skills, and abilities:

  • Must have in strong knowledge of Linux systems administration, including familiarity with cluster configurations

  • Experience with configuration management tools, such as Bright, xCAT, puppet, IPMI, ROCKS for maintenance of Linux clusters, supercomputers, storage systems, and smaller systems

  • Strong knowledge of networked filesystems, such as NFS and ZFS

  • Strong knowledge of Active Directory for user management, authentication and authorization

  • Strong critical and analytical skills

  • Ability to work on multiple priorities effectively

  • Ability to complete tasks in a timely fashion

  • Ability to work collaboratively as a team member

List required & preferred skills specific to position:

  • Experience as a high-level Linux system administrator.

  • Experience with open source software compilation.

  • In-depth knowledge of TCP/IP networking and related protocols, InfiniBand, etc.

  • Excellent scripting skills, python, perl, shell.

  • Ability to maintain confidentiality

  • Excellent customer service skills

  • Excellent communication skills

  • Must demonstrate strong critical thinking and analytical reasoning

  • Understanding of massive high-performance parallel storage and methodologies.

  • Understand, implement, troubleshoot, and support batch and workload management systems, including diagnosis of failed jobs, implementation of policies, and investigations of new features and services.

  • Install and configure infrastructure applications by following industry best practices to deliver effective solutions.

  • Must have the ability to multi-task and prioritize.

  • Must be adaptable and able to meet conflicting deadlines.

  • Exceptional organizational skills.

  • The ability to interact with peer institutions to support HPC directives effectively; furthering the goals of the MARCC facility.

  • Excellent oral and written interpersonal skills in terms of customer service, training, and evangelism of new technologies, negotiation, and persuasion.

  • Produce effective and thorough technical documentation.

  • Provide outstanding direct and indirect user support.

  • Research, recommend, and implement new technologies based on the value to the research facility.

  • Object oriented design experience

  • Experience with industry standard software development tools (e.g., subversion, eclipse)

  • Understanding of software lifecycle, design, implementation, testing

Classified Title: Systems Engineer

Working Title: Systems Engineer ​​​​​

Role/Level/Range: ATP/04/PE

Starting Salary Range: $69,140 to $95,005 annually

Employee group: Full Time

Schedule: M-F, 8:30 a.m. to 5 p.m.

Exempt Status: Exempt

Location: 01-MD:Homewood Campus

Department name: 10001346-Dean Office of

Personnel area: School of Arts & Sciences

The successful candidate(s) for this position will be subject to a pre-employment background check.

If you are interested in applying for employment with The Johns Hopkins University and require special assistance or accommodation during any part of the pre-employment process, please contact the HR Business Services Office at jhurecruitment@jhu.edu . For TTY users, call via Maryland Relay or dial 711.

The following additional provisions may apply depending on which campus you will work. Your recruiter will advise accordingly.

During the Influenza ("the flu") season, as a condition of employment, The Johns Hopkins Institutions require all employees who provide ongoing services to patients or work in patient care or clinical care areas to have an annual influenza vaccination or possess an approved medical or religious exception. Failure to meet this requirement may result in termination of employment.

The pre-employment physical for positions in clinical areas, laboratories, working with research subjects, or involving community contact requires documentation of immune status against Rubella (German measles), Rubeola (Measles), Mumps, Varicella (chickenpox), Hepatitis B and documentation of having received the Tdap (Tetanus, diphtheria, pertussis) vaccination. This may include documentation of having two (2) MMR vaccines; two (2) Varicella vaccines; or antibody status to these diseases from laboratory testing. Blood tests for immunities to these diseases are ordinarily included in the pre-employment physical exam except for those employees who provide results of blood tests or immunization documentation from their own health care providers. Any vaccinations required for these diseases will be given at no cost in our Occupational Health office.

Equal Opportunity Employer

Note: Job Postings are updated daily and remain online until filled.

EEO is the Law

Learn more:

https://www1.eeoc.gov/employers/upload/eeocselfprint_poster.pdf

Equal Opportunity Employer:

Johns Hopkins University is an equal opportunity employer and does not discriminate on the basis of race, color, gender, religion, age, sexual orientation, national or ethnic origin, disability, marital status, veteran status, or any other occupationally irrelevant criteria. The university promotes affirmative action for minorities, women, disabled persons, and veterans.

DirectEmployers