CRC Looking Forward to 2025

CRC is always working to improve our service and capabilities, and to innovate. During 2024, we distributed a survey specifically for MPI users and a wide-ranging survey for all users. We closely studied the feedback from our users and worked to correct the problems they encounter. Our roadmap for 2025 and beyond reflects listening to our users and responding to existing concerns, but also our team’s exploration of the forefront of computing such as generative AI.

ChatCRC
One of the most significant new offerings will be ChatCRC, an internal large language model (LLM) to augment our support site. Built upon open-source foundational LLMs, we will use AI techniques such as RAG (Retrieval-Augmented Generation), LoRA (Low-Rank Adaptation of Large Language Models), and Fine Tuning to provide CRC-specific context to the LLMs.

ChatCRC will become an integral part of CRC’s support. It will be trained on data from the CRC context, including help tickets, and be accessible via a portal with a UI interface. For example, you could ask ChatCRC how to create a job submission script for MATLAB, and it would be able to provide you with a usable job submission script. Essentially, you will be able to ask any question that would go into a help ticket submission.

Our aim is that ChatCRC will be able to provide not a generalized answer, but an answer that aligns with the specific setup of the CRC ecosystem. ChatCRC will be a co-pilot to augment our user support site. Now, we respond to a ticket first if you have read the manual. With the launch of ChatCRC, we will ask if you have read the manual and have chatted with ChatCRC. If you still have an issue, one of our consultants will work with you to find a solution.

We envision that ChatCRC will be able to address many of both the simple and hard problems that you may encounter using our system and that it will learn and improve over time.

Teach Cluster
CRC offers computing resources for coursework as well as for research. After hearing feedback from instructors and students who sometimes have difficulty accessing resources for term projects, we have created a dedicated teaching cluster with access to CPUs and GPU from a new JupyterHub portal, https://jupyter.crc.pitt.edu. When it’s midnight and you are trying to finish homework and need computing resources, CRC has got you covered.
  
Improved User Manual
One significant comment we received in the user survey was that our User Manual and Documentation was hard to use. Users pointed out that information is sometimes outdated because we have changed your cluster settings. We are responding to this feedback in a number of ways. The first is a new streamlined site, which will prune away many of the items that have accumulated over the life of CRC. The User Manual has also migrated from the web pages to GitHub, which is more searchable than the previous site. ChatCRC will be trained using material from our User Manual, hence the importance of making sure the material is up-to-date and accurate.

Improved Storage Addresses Performance Issues
Survey respondents also cited performance issues, such as when data intensive processes like high throughput computing and AI bring the file system to a crawl and impact all users. We are addressing the need for performant storage in conjunction with Pitt IT to engage a vendor to provision new storage, targeting a system that can accommodate high IOPS, high bandwidth, and scalability. The aim is to create storage with unified performance, and not relying on the CRC team customizing—babysitting, really—the existing storage systems. This will free up the team to dedicate their expertise and creativity to other forward-looking initiatives.

Replacing Technology via the Tic-Toc Cadence
CRC has created a plan where each technology is replaced over five years—but with two separate purchases for each technology. In MPI for example we make a new purchase in year one. That’s tic. Then in year three we start a new purchase. That’s toc.

In this way, CRC straddles the gap between old and new technology. Rather than making one big purchase in year one and running the technology for five years, when the hardware start showing its age, we will split purchases so that we have new hardware in years one and three.

Updating Red Hat
Red Hat Enterprise Linux is the operating system for CRC compute nodes. We're running Version 7, which is officially at end of life with no more support, which binds our hands in terms of some of the software that we can install as well as some of the new modalities for access. We are planning to update to the most recent version of Red Hat in the summer of 2025.

We are planning to try to stage the update as much as possible so that it will just be a one day outage, but it is possible that it may become a two to three day outage. After the upgrade, there can potentially be downstream side effects such as needing to update software that you were hoping to use. Our team will be helping to address any of these downstream side effects.

- Brian Connelly

 

Tuesday, October 22, 2024