No Travel? No Problem.

Remote Participation
HPC Systems Engineer III
·
NCAR - National Center for Atmospheric Research
·
Boulder, Colorado (primary) & Cheyenne, Wyoming (sometimes)
SessionJob Fair
Event Type
Job Posting
Registration Categories
TP
W
TUT
XO / EX
TimeMonday, 15 November 202112pm - 3pm CST
LocationOnline
DescriptionWhat you will do:

As part of the High Performance Computing Systems Group (HSG), provides system engineering leadership and support for the Computational & Information Systems Laboratory’s (CISL) high-performance supercomputers, block and object storage systems, data archival systems, high-performance networks, and data transfer services. The environment is composed of multi-vendor resources with numerous specialized hardware components and requires coordination and communication with the other groups and divisions within CISL.

Primary job location is in Boulder Colorado. Production systems supported are located at the NCAR Wyoming Supercomputing Center (NWSC) located in Cheyenne Wyoming. May be required to work at the NWSC during periods of system installation, system upgrade, or system troubleshooting.

Responsibilities:

Software Engineering and Development
Develops, implements, and documents new features or capabilities in system administration and system monitoring software. Develops and maintains systems software as necessary for the deployment and management of all aspects of high-performance supercomputers, clusters, storage, and network fabrics. Develops and maintains security monitoring and analysis software. Performs installation and necessary hardware and software integration as part of the HPC infrastructure deployments and upgrades. Develops and maintains security monitoring and analysis software. Helps define group standards and guidelines for software development and documentation. Leads software development projects including requirements gathering, design, and project management. Writes code to enhance system management capabilities of the HPC infrastructure and automate repeatedly performed system administration tasks. Manages, designs, and develops bench marking tool suites for use during procurement and for ongoing performance monitoring of the high-performance computing environment. Develops acceptance testing criteria and applications for system procurement.

Research and Evaluation
Researches new and emerging technology (e.g., cloud), evaluates the potential impact of the new hardware and software technology on workflows and plans, and makes recommendations to the HPCD division and CISL management for future procurement of hardware and software products, configurations, and functional enhancements or upgrades in support of the high-performance computing environment. Performs evaluations and benchmarks, and compiles reports on new hardware and software systems related to the high-performance computing environment (i.e., computing, storage, networking).

Participates in projects relating to the high-performance computing environment and may have direct responsibility for design and procurement decisions. This may include development of systems level code to support the various aspects of the HPC infrastructure software and hardware. Participates in the RFP process by contributing to the technical specification, requirements definition, review, decision making, acceptance, and implementation for future procurement.

Operational Monitoring and Troubleshooting
Operates and monitors the behavior of the group managed supercomputers, clusters, servers, storage, and network fabrics on a routine, daily basis to ensure proper and efficient operations. Alerts other HPC Systems Group staff, vendor representatives, and/or NWSC staff of abnormal conditions or behaviors, as appropriate, and takes remedial actions as necessary. Diagnoses and may repair failed software and/or hardware components, or may mentor/assist other staff in such.

Provides service on a 7x24 on-call basis troubleshooting and resolving system related problems presented by users, other sections in CISL, and vendor-employed engineers and analysts. Refers and escalates problems to senior members of the HPC Systems Group or appropriate staff as necessary. Documents troubleshooting and operational techniques and best practices, mentors other team members when necessary.

Systems Administration
Provides systems support for diverse hardware and software architectures. Leads the installation and upgrades of system hardware and software, including computational systems, clusters, standalone machines, storage systems and a variety of network fabrics including Ethernet, InfiniBand, and Fibre Channel. Helps define standards and guidelines for operation and maintenance, and produces systems operation and procedural documentation. Compiles, installs and maintains commercial and open source application software. Documents system administration tasks and mentors other team members when necessary.

Project Management
Leads team projects utilizing standard project management tools and techniques. Under the direction of the HSG group lead, provide project coordination, technical expertise and planning for system deployment projects. Develops budgets, project timelines, and task structures for the group. May guide and review the tasks of team members and provide guidance as necessary. May participate in cross-group and cross-division projects as necessary including taking a lead role.

Organizational Representation and Reporting
Provides regular HSG activities reports to management and may contribute to CISL or NCAR annual report and development plans. Attends group, division, and laboratory meetings and may represent HSG and its activities at such meetings. May represent the group at larger organizational meetings and broader community events as appropriate.

Supervisory responsibilities: None

Work location requirements:

This position is expected to support a hybrid format (remote and in-person work) with some days each week in-person at the primary Boulder, CO office. Production systems supported are located at the NCAR Wyoming Supercomputing Center (NWSC) in Cheyenne, Wyoming and the Systems Engineer will be required to work at the NWSC to assist with new supercomputers, storage commissioning, major upgrades, outages, downtimes, etc.

Salary:
Hiring Range $100,326 - $125,407
Full Salary Range $100,326 - $163,030
RequirementsWhat You Need: Bachelor’s degree and eight to twelve years of progressive experience or equivalent combination of education and experience in one or more of the following fields: Computer Science, Mathematics, Computer/Electrical Engineering, Information Sciences, Software Engineering, or equivalent related field. Experience is desired in the following areas: Experience with the Linux operating system environment with an emphasis on networking, storage, and performance Experience with the installation and management in a Linux based supercomputer environment Experience with job scheduling system (e.g., PBS, SLURM) Experience with clustered file systems (e.g., Spectrum Scale, Lustre) Experience with at least one form of HPC storage (e.g., block, object, cloud, or tape) Experience with clustered system management tools and techniques Experience with version control and configuration management systems Experience in Linux scripting languages and at least one higher level programming language Experience with disaster recovery of single image systems and clusters REQUIRED Demonstrated skill in the installation, configuration, administration, troubleshooting, and securing of compute clusters Demonstrated skill in the administration of high-performance storage systems including clustered/parallel file systems and/or tape-based storage systems Demonstrated skill in the configuration and troubleshooting of high-performance network fabrics (e.g., Ethernet, InfiniBand) Demonstrated skill with containers and virtualization Demonstrated skill in common scripting and programming languages (e.g., ANSI/GNU C, Python, Perl, etc.) and general software engineering practices Demonstrated skill in performance tuning and benchmarking I/O systems and networks Demonstrated skill in performing tasks requiring organization and attention to detail Excellent written and verbal communication skills and the ability to write and interpret systems documentation Communicates effectively with lab and/or program. May communicate with entire organization. Able to explain concepts with high technical complexity to others of various technical backgrounds. This may include risks, control, and impacts. Employs active listening to lab or program needs to create solutions to technical problems at a high level of complexity. Makes formal presentations at lab or program level and advocates for proposed solutions. Ability to work collaboratively with teams of different skill levels and backgrounds Ability to mentor team members and collaborators May supervise and mentor technical staff, including other software engineers, support staff, or groups Ability to function effectively within a matrixed, multidisciplinary team Represents the organization as the prime technical contact on contracts and projects. Interacts with senior external personnel on significant technical matters often requiring coordination between organizations. Maintains professional contact with members of industry and sponsors. May interact at national level with sponsors/presentations. DESIRED BUT NOT REQUIRED Experience with high-performance computing, supercomputers, and related technologies Experience with GPU administration, performance tuning, and system support Knowledge of Artificial Intelligence/Machine learning techniques Experience with public and/or private cloud systems Experience with hardware troubleshooting & components replacement Experience with documentation and presentation development Experience with project management and leading small project teams Experience with writing technical proposals and reports Experience with procurement and budget management OTHER REQUIREMENTS: Occasional travel to the NCAR Wyoming Supercomputer Center, which is approximately 90 miles north of Boulder Periodic 7x24 on-call support in rotation with other staff Providing assessment and feedback on vendor technology roadmap, RFI/RFP to the HSG group head and the HPCD division director Please note that while the position description details both minimum requirements as well as desired skills and experience, we want to remind applicants that you do not need to have all the desired skills and experience to be considered for this role. If you have the passion for the work along with experience in a related field, you are encouraged to apply. We can provide on-the-job training for the rest.
Company DescriptionWhere You Will Work: Located in Boulder, Colorado, the National Center for Atmospheric Research (NCAR) is one of the world’s premier scientific institutions, with an internationally recognized staff and research program dedicated to advancing knowledge, providing community-based resources, and building human capacity in the atmospheric and related sciences. NCAR is sponsored by the National Science Foundation (NSF) and managed by the University Corporation for Atmospheric Research (UCAR) In addition, NCAR’s Computational and Information Systems Laboratory (CISL) is a leader in supercomputing and data services necessary for the advancement of atmospheric and geospace science. CISL’s mission is to remain a leader at the forefront of ensuring that research universities, NCAR, and the larger atmospheric, oceanographic, and related research communities have access to the computational resources they need for their research. To fulfill the need for a stronger workforce at the intersection of High Performance Computing (HPC) and geoscience problems, CISL engages in education and outreach activities to inspire and attract a diverse future workforce. -------------------------------- Benefits: UCAR's rich package of employee benefits includes medical, dental, vision, education assistance, retirement, and life insurance. UCAR offers a variety of programs designed to assist with work-life balance including flexible work alternatives, paid time off and 14 weeks of paid parental leave. To learn more about our benefits, click here: https://www.ucar.edu/opportunities/careers/benefits ------------------------------ Vaccine Requirements: As required by Executive Order 14042, all Federal Contractor employees are required to be fully vaccinated against COVID-19 regardless of the employee’s duty location or work arrangement (e.g., telework, remote work, etc.), subject to such exceptions as required by law. Effective immediately, UCAR requires all new employees to be fully vaccinated prior to entering on duty, subject to such exceptions as required by law. If selected, you will be required to be vaccinated against COVID-19 and submit documentation of proof of vaccination by December 21, 2021 or before appointment or onboarding with UCAR, if after December 218, 2021. UCAR will provide additional information regarding what information or documentation will be needed and how you can request of UCAR a legally required exception from this requirement.
·
·
2021-11-17
Back To Top Button