The struggle of high-volume logging is real, and in response to the widespread interest in this topic we recently hosted a webinar titled 'Loki at Scale: Navigating High-Volume Logging Challenges.' Our guest for the event was Sreejith S., DevOps Lead at Capillary Technologies.
Managing approximately 1.5 TB of logs per cluster daily, Sreejith played a critical role in overseeing successful Loki adoption at Capillary, in close collaboration with the Facets Team.
Representing Facets, we were joined by Rohit Raveendran, Co-founder and VP Engineering, and Pramodh Ayyappan, DevOps Tech Lead. Together, they shared the learnings from implementing Loki.
The Challenge - Prelude to Loki Adoption
Capillary Technologies was facing challenges with its log management solution. Their log volume was over 1.5 TB logs per day, and they had to retain these for over a year for compliance and daily operations. Compared to the massive logs per day, the search queries for them were almost non-existent – only a couple 100 searches/day.
Adding to this challenge was the need for dedicated engineering manpower just to manage these logs. Their old solution was simple and reliable but with time, it turned out to be expensive and not scalable. With a high volume and low search ratio, their ROI was poor. They were sitting on a data goldmine but were not able to leverage it for any business metrics and alerts.
They needed a solution wherein a similar investment would yield higher returns and add to the overall efficiency of the team. The team embarked on an evaluative journey, sifting through options like ELK, Parseable, and New Relic, and eventually zeroed in on Loki. The decision hinged on Loki's scalability, cost-effectiveness, ROI, and seamless Grafana integration – all of which aligned perfectly with Capillary’s needs.
Core Differentiator: Loki’s Architecture
What sets Loki apart is that it adopts a minimal indexing strategy and focuses on indexing only the labels and not the entire log content. This approach boosts not only the query speeds but it also significantly reduces the storage requirements. Inspired by Prometheus, Loki’s design was scalable and robust, making it a perfect choice for large-scale logging challenges at Capillary. The four key differentiators were:
Implementing Loki - A Meticulous Strategy
Although Loki had all the right things that Sreejith and his team were looking for, the implementation was not that straightforward. The team took a methodical approach when moving from their legacy system to a modern Loki Architecture.
The approach is twofold: a technical strategy that advocates for phased deployment, comprehensive load testing, and continuous performance monitoring for optimization; and a change management strategy that prioritizes extensive training, accessible documentation, and an active feedback loop.
Technical Strategy:
Phased Deployment: There was a need for extensive testing and adjustments to the existing approach. A step-by-step approach was implemented to ensure that Loki’s integration did not impact any of the existing operations. The team was able to tune in the performance under different scenarios.
Load Testing: The team conducted rigorous load testing to understand how Loki performs under different stress levels. These insights were critical for fine-tuning system configurations and ensuring scalability and reliability when fully deployed.
Monitoring and Optimization: The team implemented continuous monitoring to track Loki's performance and resource usage in real time. They utilized these metrics to optimize configurations, improving efficiency and reducing costs.
Managing Change in Developer Processes:
Comprehensive Training: A comprehensive training program was developed covering Loki's architecture, features, and best practices. Hands-on sessions were conducted to help developers become familiar with the new system.
Documentation and Support: The developers were provided with detailed documentation and support channels. This step ensured that the developers have access to the information and assistance they need, facilitating smoother transition and integration into their workflows.
Feedback Loop: Established a feedback loop with developers to gather insights on Loki's implementation challenges and successes. This feedback loop proved invaluable for the continuous improvement and adaptation of both the technical strategy and developer support processes.
Overcoming Implementation Challenges
Transitioning to Loki brought its own set of challenges, each demanding specific resolutions. Issues like rate limiting, ingestor overloads, and S3 API rate limits surfaced during the implementation. The team tackled these through adjustments such as modifying ingestion rates and stream sizes and optimizing query performance.
Rate Limiting and Ingestor Overloads: Adjusting ingestion rates and stream sizes was key to managing the load efficiently. This strategic calibration ensured the system could handle high volumes of data without data integrity or latency being compromised, allowing for smooth processing by ingestors.
S3 API Rate Limits: Addressing this involved implementing caching strategies and query sharding, which effectively reduced the number of calls to the S3 API. This approach not only diminished latency but also significantly improved the system's overall responsiveness to queries.
Query Performance: Enhancements in query performance were achieved through optimizing how queries were processed and managed. By introducing more efficient data retrieval techniques and optimizing the indexing strategy, we were able to significantly speed up query times, providing faster access to logs.
Collaborative Optimization: The partnership between Facets and Capillary allowed for a tailor-made optimization of Loki's setup. This co-development effort focused on creating a configuration that specifically addressed the challenges faced, leading to a more efficient, scalable logging solution perfectly suited to the operational requirements.
The Business Impact of Implementing Loki
Adopting Loki marked a significant improvement in log management and also their efficiency in troubleshooting. Developers gained faster access to logs, accelerating issue resolution. The Loki architecture made it simple and easy to extract actionable business metrics from logs, thereby enhancing overall data analytics capabilities. From the many business impacts, here are a few that stand out:
More Efficient Log Management: Loki made log management more efficient while handling large volumes of log data, reducing the time and effort required for log processing and management.
Accelerated Issue Resolution: Developers saw a significant reduction in the time taken to access logs. Faster access meant quicker identification and resolution of issues, leading to reduced downtime and enhanced system reliability.
Enhanced Analytical Capabilities: By extracting valuable metrics from logs, Loki has empowered Capillary Technologies to delve deeper into their data analytics, offering a clearer understanding of system performance, user behavior, and potential areas for optimization.
Scalability and Flexibility: The adoption of Loki brought scalability and flexibility to Capillary's logging infrastructure. This flexibility is crucial in managing varying log volumes and ensures that the system can scale up efficiently as the organization grows.
Cost-Effectiveness: The minimalist indexing strategy, coupled with efficient storage, translates to lower storage costs, making it a financially viable solution for large-scale log management.
This has set a new benchmark in their log management approach, driving operational excellence and supporting business growth. Check out some of the key takeaways from the implementation:
In Conclusion:
In summary, the "Loki at Scale" webinar was not just an exploration of a tool but a broader narrative on overcoming the complexities of high-volume logging. One size doesn't fit all, and extensive domain knowledge is required to tackle the challenges. For professionals in DevOps and platform engineering, embracing tools like Loki isn't just about keeping pace with technology—it's about leveraging it to drive operational excellence and business growth. If you'd like to watch the full webinar and get the in-depth details, here's a link for you.