by Scott McCarty

What generative AI can do for sysadmins

feature
Aug 07, 202413 mins
Artificial IntelligenceGenerative AI

For IT admins, engineers, and architects, language models will save time and frustration and increase confidence in troubleshooting, configuration, and many other tasks. Here are six ways they’ll make operations easier.

Successful Data Center IT Specialist Using Laptop Computer. Server Farm Cloud Computing Facility with System Administrator Working. Data Protection Engineering Network for Cyber Security.
Credit: Gorodenkoff / Shutterstock

With any mention of generative AI and AI in general, we need to be clear that our goals should be to make people’s lives better, not just to cut costs. From my perspective, we need to focus on how this technology—and yes I view AI as an enabling technology, not a product—enables really smart people who have really stressful jobs. Namely systems administrators, site reliability engineers, enterprise architects, network admins, database admins, and others, sometimes collectively referred to as operations people. These are my people, so I will address their concerns.

Generative AI is a maturing technology. Now that we’ve had a minute (just!) to sit with it, we’re coming to understand when—and for what—it should and shouldn’t be used. For operations people like systems administrators, generative AI has the potential to save time and frustration, as well as to increase confidence. Here are five ways that generative AI and language models can make sysadmins more productive and precise, framed by a discussion about the importance of balancing the use of generative AI with human intervention.

Performing log analysis

Operations people, be them sysadmins, network admins, or database admins, are constantly reviewing system logs to identify patterns among millions of alerts. These operations people learn over time when 40,000 of one type of alert, 30,000 of another, and 5,000 of another are OK, but also when an anomalous alert is cause for serious alarm. (Actually, the onesies and twosies are the alerts that usually spell real trouble.) There are products that can help with this analysis, but in the end it’s a human who is making the call, and it’s a human who ought to continue making the call.

Generative AI doesn’t change the need for this kind of human intervention, but it can help humans analyze alerts more efficiently and confidently. Because, really, what are system, network, and database admins doing when they research an alert? They are training themselves by doing research—typically through a Google search. They’re looking up the definition of an alert, which they then have to synthesize the meaning of, in the context of their specific infrastructure or environment. Sounds a lot like training a language model, right? Precisely!

Determining if a log entry is good, bad, or indifferent has always been a sort of an art form. For example, when restoring a service after an outage at 2 a.m., a human never has a 100% confidence interval. We kinda “think” the service is up again, we check the logs, give them a glance, and declare it “back to the normal state,” meaning not necessarily perfect, but working again. At some point, we go back to bed!

The same is true when analyzing logs for “error messages.” A human programmer writes the English error messages, and a human admin analyzes those error messages in a log somewhere. The communication between these two parties is imperfect at best. This imperfect communication can create false positives and slow down an administrator doing their job. The admin can’t be sure, and neither will a language model, but language models can increase the chances that humans make the right call about suspicious-looking alerts. This is a potential bionic ability for the humans, and hopefully a win to make their lives better during working hours, or at 2 a.m. during an outage.

Generating config files

Raise your hand if you enjoy generating config files from scratch. Anyone? Anyone? Format complexity, baroque syntax, variations in syntax between software versions, environment-specific requirements, validation, security concerns, integration issues … the list of challenges goes on and on. And getting just one server personality (Bind, Apache, Nginx, Redis, etc.) to do all the things it needs to do in a production environment can take a combination of five, 10, or 20 different config files in total. The admin has to make sure the network interface, DNS, NTP, web server, and so on are all configured perfectly.

For all of these reasons, using a language model to generate config files is pretty awesome—a huge time saver, potentially trimming hundreds of human work hours down to just a few. It’s not OK, however, to leave it completely to generative AI to generate config files. Humans must review and validate files to ensure that they address organization-specific factors, for example, or comply with industry standards and regulatory mandates. A human also needs to make sure config files are documented to help avoid problems with future translation. (See “Translating config files” below.)

We know this is coming because if you look at GitHub Copilot or Ansible Lightspeed, language models are already generating formal language syntax such as Python, Ruby, Node.js, etc. Extending this to even more limited syntaxes like config files should be an easy win in the coming months and years. Even better, Ansible Lightspeed even cites its work, showing what source code it was trained on, which is a feature I think we should all demand of any syntax generation code.

Translating config files

If generating config files is a drag, translating them might be even worse. Say you’re upgrading a server or software, and the config file format changes just a little bit. You need to translate the existing config file to a newer format to ensure that the service (Apache, Redis, Nginx, etc.) will start and run properly. You have to maintain the functional integrity of the config file—which was written by a person doing his or her own thing, maybe years ago—while accurately translating what needs to be changed.

In my days as a sysadmin, I spent many, many (many!) hours in frustration, hacking my way through painful config files: Sendmail, Bind, Apache redirects anybody? (And people wonder why “make sure your applications are updated” isn’t the easiest security best practice to follow.) How can a language model help? Again, we’re looking for more confidence, not 100% confidence, but a machine learning model could easily tell you what config options have been deprecated and what new ones are in place. The sysadmin must be the ultimate filter in terms of what’s ultimately OK, but a machine learning model can provide support along the way.

Also, I’ll leave you with a best practice I’ve discovered from using large language models: Save the prompt text with the artifact you’re creating or translating. With images I generate for presentations, I put the “prompt text” in the speaker notes, but for a config file, I’d save it as a comment in the file. This will help future admins understand what you were thinking and trying to achieve, even if that future sysadmin is you. Come on, you’ve all looked at your code six months later, cussed to yourself, and cursed whoever wrote it, only to do a “git blame” and discover that person was you. 🙂

Providing ‘peer’ perspective

The best advice comes from people who have been there, done that, but every software upgrade in your specific environment is uncharted territory. There are always little nuances and specifics about the standard operating environment (SOE) your company uses, or worse, one-off changes to that SOE that a specific workload had to make to ensure that workload runs well (e.g., disabling SELinux).

Sure, you can comb Reddit, LinkedIn, and other places admins gather. You can try to read between the lines of vendor-supplied (and therefore super-positive) use cases. Here’s a dirty secret: Although vendors try their best, they can never test the exact permutation of software and configuration that you have in our specific environment.

Synthesizing all that information you need to know for your specific environment is challenging, to say the least. That’s where language models can come in. Without giving up your own company data, you could use ChatGPT, Bard, Perplexity, or even a local model like Granite or Mistral to ask what kind of experience companies have had when moving from, say, a specific  public cloud to a hybrid model using a specific hardware vendor on-premises or from one version of a software platform to another.

These types of “stories” generated by AI can be quite powerful because large language models are actually quite good at statistically scraping Reddit, Stack Overflow, blogs, etc. and creating a narrative and unlocking themes, saving you hours of research work. I have successfully used LLMs to uncover common gotchas and best practices, providing plenty of food for thought. That said, you should verify the narrative responses you get. Trust, but verify, of course. Using LLMs in this cautious way has given me confidence when I’m making tough architectural decisions.

Powering shells

I’m starting to see examples of language models being integrated into shells and CLIs. These are such an elegant place to use LLMs because shell commands have been developed organically over 30 or 40 years. The commands themselves have terse syntax that is difficult to understand, and the man pages aren’t typically a lot of help. Often these commands and the man pages were written by a single person to remind themselves how to use the command they wrote, and not all of us are great UX geniuses. If our goal is to bring people along, and enable Linux for a wider audience, LLM-enabled shells are a perfect way to make people feel more comfortable.

LLM-enabled shells ease the transition for new users, and help long-time users alike. I can’t count the number of times I’ve forgotten the exact syntax to do something I haven’t done in a while, especially when it requires complex options to the command. LLM-enabled shells do this by allowing users to interact with a computer using natural language. For example, a user can ask “What files in a directory are the oldest in this directory?” or “Find all of the files larger than 237MB” or “Remove the numbers from names of all of the files in this directory.” These kinds of commands can be very complex to construct using awk, sed and bash, using cryptic syntax long forgotten by most of us.

While it used to be a “flex” to show people how good your awk skills are, today’s cloud operators and admins have to support so many pieces of technology that they can never become deeply expert at any one technology. Leveraging LLM-enabled shells, plugins, and wrappers will likely become key to supporting more and more technology.

Powering vendor software

In addition to publicly available tools such as ChatGPT and the use of internal language models, technology providers have started to build generative AI capabilities into their products and will continue to do so over time. The definitions of AIOps and MLOps seem to have quickly evolved to mean “LLM-enabled everything” related to infrastructure.

That said, I do see some interesting use cases coming. One can easily imagine that LLM-enabled tools will be able to help you generate an SOE for complex software like operating systems, enterprise databases, CRM software, and other large-scale systems.

Deploying large, enterprise workloads often requires reading piles and piles of documentation to garner requirements and best practices. Then several architects (enterprise, storage, network, and database) have to work together to synthesize the knowledge they’ve gained from the documentation with their specific security standards, compliance rules, network configurations, storage configurations, and architectural standards.

This is no small task…

I see a future, where instead of reading a reference architecture and piles of documentation for SAP, Oracle NetSuite, Microsoft SQL Server, or Red Hat Enterprise Linux and synthesizing how to put the pieces together yourself, you use an LLM-enabled tool provided by one of these companies to generate the configurations necessary to run the workload within your organization with your security standards, your compliance requirements, and your network and storage needs. While I think this is still months or perhaps years in the future, I think it’s one to keep your eye on.

 Language models for admins and architects

No matter how sysadmins consume language models and generative AI, it will be important to consider accuracy, performance, resource management, and data privacy, among many other factors. And, with the growing move to smaller, more purpose-built language models, system administrators should think about how multiple models can be used together. Moreover, these models will require the same life cycle work that applications will need: upgrades, testing, replacing, etc.

LLMs are an exciting technology for Admins and Architects because they work with natural language quite well. They work on stories, and stories are everywhere in our work, whether we realize it or not. All of the error messages which a programmer embeds in their application, and are later dumped into log files, are essentially stories told by those developers, to the admin. They are not constrained by any rules, and they are not always accurate. They require interpretation and synthesization to truly understand. This is true for log files, but also documentation, reference architectures, etc. None of these sources of information have ever provided 100% accurate information.

But perhaps the most important common denominator among the suggestions I make here and any new use cases that arise in the future is that operations folks — and everyone else, for that matter — should not just hand over their power and responsibility to generative AI and large language models. And that’s not because you don’t want AI to take your job; it’s because AI can’t take your job. Rather, infrastructure and operations experts should look for ways that AI can be used to help them do their jobs … better.

The Admin and Architect has always been key, and will remain key!

At Red Hat, Scott McCarty is senior principal product manager for RHEL Server, arguably the largest open source software business in the world. Scott is a social media startup veteran, an e-commerce old timer, and a weathered government research technologist, with experience across a variety of companies and organizations, from seven person startups to 12,000 employee technology companies. This has culminated in a unique perspective on open source software development, delivery, and maintenance.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.