Software services are crucial for reliable communication and networking; therefore, Site Reliability Engineering (SRE) is important to ensure these systems stay reliable and perform well in cloud-native environments. SRE leverages tools like Prometheus and Grafana to monitor system metrics, defining critical Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for maintaining high service standards. However, a significant challenge arises as many developers often lack in-depth understanding of these tools and the intricacies involved in defining appropriate SLIs and SLOs. To bridge this gap, we propose a novel SRE platform, called SRE-Llama, enhanced by Generative-AI, Federated Learning, Blockchain, and Non-Fungible Tokens (NFTs). This platform aims to automate and simplify the process of monitoring, SLI/SLO generation, and alert management, offering ease in accessibility and efficy for developers. The system operates by capturing metrics from cloud-native services and storing them in a time-series database, like Prometheus and Mimir. Utilizing this stored data, our platform employs Federated Learning models to identify the most relevant and impactful SLI metrics for different services and SLOs, addressing concerns around data privacy. Subsequently, fine-tuned Meta's Llama-3 LLM is adopted to intelligently generate SLIs, SLOs, error budgets, and associated alerting mechanisms based on these identified SLI metrics. A unique aspect of our platform is the encoding of generated SLIs and SLOs as NFT objects, which are then stored on a Blockchain. This feature provides immutable record-keeping and facilitates easy verification and auditing of the SRE metrics and objectives. The automation of the proposed platform is governed by the blockchain smart contracts. The proposed SRE-Llama platform prototype has been implemented with a use case featuring a customized Open5GS 5G Core.
翻译:软件服务对于可靠的通信与网络至关重要;因此,站点可靠性工程在确保这些系统于云原生环境中保持可靠与高性能方面具有重要意义。SRE利用Prometheus和Grafana等工具监控系统指标,定义关键的服务水平指标与服务目标以维持高服务标准。然而,一个重要挑战在于许多开发者往往缺乏对这些工具及定义合适SLI与SLO所涉复杂性的深入理解。为弥合此差距,我们提出一种新型SRE平台——SRE-Llama,其通过生成式人工智能、联邦学习、区块链与非同质化代币技术增强。该平台旨在自动化并简化监控、SLI/SLO生成及告警管理流程,为开发者提供便捷的访问途径与高效能。系统通过采集云原生服务指标并存储于Prometheus、Mimir等时序数据库中运行。利用这些存储数据,本平台采用联邦学习模型识别不同服务与SLO中最相关且具影响力的SLI指标,同时解决数据隐私顾虑。随后,采用经微调的Meta Llama-3大语言模型,基于已识别的SLI指标智能生成SLI、SLO、错误预算及相关告警机制。本平台的一个独特之处在于将生成的SLI与SLO编码为NFT对象并存储于区块链上,该特性提供了不可篡改的记录保存功能,便于对SRE指标与目标进行验证与审计。所提平台的自动化流程由区块链智能合约管理。SRE-Llama平台原型已通过定制化Open5GS 5G核心网的用例实现验证。