1. 系統維護與監控:負責設計、構建和維護高可靠性的生產系統。持續監控系統性能,並確保系統達到既定的服務水平目標(Service Level Objectives, SLOs)。
2. 故障分析與解決:快速響應系統中斷和性能問題,進行根本原因分析(Root Cause Analysis, RCA),並實施長期解決方案以防止問題再次發生。
3. 自動化與工具開發:開發和部署自動化工具來提高系統效率和減少人為錯誤。這包括自動化部署、故障恢復和其他常規維護任務。
4. 跨部門協作:與開發、運營和產品管理團隊緊密合作,以確保技術解決方案滿足功能和性能要求。積極參與產品的設計和改進過程,提供可靠性和可維護性的反饋。
5. 性能優化:分析現有系統的性能,識別瓶頸並實施優化策略,以提高效率和降低成本。
6. 持續學習與技術更新:保持對業界發展的敏感性,學習和實施新技術以不斷提升系統的可靠性和性能。
7. 文件編制與維護:編制詳細的系統架構、配置文檔和操作手冊,以支持團隊成員的瞭解和操作。
---
1. System Maintenance and Monitoring: Responsible for designing, building, and maintaining highly reliable production systems. Continuously monitor system performance to ensure compliance with established Service Level Objectives (SLOs).
2. Incident Analysis and Resolution: Respond quickly to system outages and performance issues, conduct Root Cause Analysis (RCA), and implement long-term solutions to prevent recurrence of problems.
3. Automation and Tool Development: Develop and deploy automation tools to improve system efficiency and reduce human errors. This includes automating deployment, failure recovery, and other routine maintenance tasks.
4. Cross-Departmental Collaboration: Work closely with development, operations, and product management teams to ensure technical solutions meet functional and performance requirements. Actively participate in the design and improvement process of products, providing feedback on reliability and maintainability.
5. Performance Optimization: Analyze the performance of existing systems, identify bottlenecks, and implement optimization strategies to enhance efficiency and reduce costs.
6. Continuous Learning and Technology Upkeep: Stay current with industry developments, learn and implement new technologies to continuously improve system reliability and performance.
7. Documentation and Maintenance: Prepare detailed system architecture, configuration documents, and operational manuals to support the understanding and operations of team members.
If You
● Are a self-driven DevOps Engineer with proven experience in large-scale micro-service systems hosted on AWS.
● Have a deep understanding of cloud architecture, AWS technologies and cloud security best practices.
● Are following the latest industry trends and are passionate about cloud computing for large-scale systems.
Key Responsibilities
● Work in a team of DevOps and DBA professionals – initially 3 people, although this will expand throughout the country expansion
● Improve existing infrastructure and CI/CD procedure
● Holistically improve all aspects of our infrastructure, including reducing costs, improving build and deployment times, streamlining environment provisioning, lowering load times, incorporating the latest techniques and technologies, and more
● Monitor and maintain the existing cloud infrastructure via autoscaling, automated alerts
● Take ownership and responsibility for our cloud operation activities
● Liaise with external security agencies for annual audits as well as perform our own internal security sweeps
● Aid in reconfiguring existing architecture to allow for rapid deployments to new countries
● Report to DevOps Leader/Director
Our Stack
● Backend Application Framework: Spring Boot (Java Config + Embedded Tomcat)
● Frontend Application Framework: VueJS
● Micro Service Framework: Spring Cloud Dalston (Netflix Eureka + Netflix Eureka + Netflix Ribbon + Feign)
● Database: AWS RDS, RDS Proxy, MONGODB
● Public Cache: AWS ElastiCache + Redis
● Message Queue: Apache RocketMQ, RabbitMQ
● Distributed Scheduling: Dangdang Elastic Job
● Data Index and Search: ElasticSearch
● Log Real-time Visualization: ElasticSearch + Logstash + Kibana, Grafana Loki
● Business Monitoring: Prometheus + Grafana
● Reverse Proxy: Nginx
● CDN: Cloudflare
● Server Virtualization Container: AWS EKS + AWS EC2
● Server Operation System: CentOS
● Static File Storage: AWS S3
● Inner DNS Resolution: AWS Route 53
● Network Management: AWS VPC
● Cluster Management and Scaling: AWS OpsWorks
● Cluster Monitoring: Prometheus + AWS CloudWatch
● HTTPS Certificate Management: AWS Certificate Manager
● Malicious Attack Defending: AWS WAF & Shield
● Cluster Alert: AWS SNS + Slack
● Continuous Integration/Deployment: Jenkins, Rancher, ArgoCD
● Configuration Tool: Ansible, Chef, Salt
We are looking for a Site Reliability Engineer (SRE) to make sure our cloud-based commerce platform is up and running and healthy.
As a SRE for iKala Commerce, you will be responsible for everything from our cloud infrastructure and operating systems to developing tools for code deployment and service monitoring. You will also review our code and system design and partner with developers to build our applications.
The SRE role is an integral member of our product development team. You will be a part of the team that makes crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure. Our ideal candidate should be able to develop applications on his/her own, but more eager to accelerate the whole team by building systems to improve performance and operational efficiency.
Ultimately, you should be involved in all stages of software development to define and improve our SLOs, SLAs & SLIs.
Our current tech stack include:
GCP, Terraform, Kubernetes, Helm, ArgoCD, Gitlab-CI/CD, Grafana LGTM,
【Key Responsibilities】
1. Designing & implementing infrastructure for collecting metrics, crunching data and improving service monitoring to detect problems before they're visible to our customers.
2. Building systems to automate our server lifecycle, from configuration management, CI/CD to server bootstrap and decommission.
3. Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level.
4. Participating in solution design and advising other developers when building new features so that they're scalable, maintainable, and performing well.
5. Improving the observability of our applications through monitoring, alerting, logging, tracing and profiling, and building such observability features into a common platform.
6. Practicing sustainable incident response and blameless postmortems.
7. Proactively identifying and reducing issues through design, testing, and implementation of software-based solutions.
More Info>>>https://www.ikala.ai
We're seeking a Lead DevOps Engineer to oversee our DevOps team and ensure top-notch system and infrastructure performance. If you have a strong background in UNIX/Linux environments, cloud services, and leadership in a dynamic setting, we'd love to have you steer our team to success!
#Responsibilities
1. Oversee the management and monitoring of all installed systems and infrastructure.
2. Ensure the highest levels of systems and infrastructure availability.
3. Lead the identification and resolution of capacity and performance issues to ensure uninterrupted operation of systems.
4. Direct the monitoring and testing of application performance, identifying bottlenecks and collaborating with developers on solutions.
5. Develop and maintain security, backup, and redundancy strategies.
6. Create and maintain custom scripts to enhance system efficiency and reduce manual intervention.
7. Design and build internal tools to assist with development or debugging.
8. Drive improvements in monitoring, database, and storage optimizations.
9. Lead the on-call rotation and manage incident response.
10. Co-work with developers to automate deployments and testing, i.e., CI/CD.
#Requirements
General Must Have:
1. Good English communications.
2. Worked in startup and successful companies before.
3. Strong management experience.
4. Good at Performance Tracking.
- Provide tracking systems.
- Provide examples.
- Identify underperforming employees.
- help underperforming employees.
5. Documentation & organization Skills.
Good to Have:
1. Industry knowledge.
2. Good at EKS/K8S.
3. Experience of Infrastructure as code, such as Terraform.
Specific Must Have:
1. Master in AWS.
2. Good communication between departments and teams.
3. Cost reduction means/strategies.