Title: Sr Monitoring & Observability Engineer, Los Angeles (On-Site)
About Us
Data Analysis Incorporated (DAI) is the controlling entity of the O’Neil family of businesses. DAI and its subsidiaries operate in diverse industries worldwide, including global equity markets, health care, financial services, digital news, and insurance. Our global footprint allows our teams to be responsive to customer needs in a timely and efficient manner. We are dedicated to using technology and innovation to bring change and growth to our businesses. We believe in a dynamic workplace, creating engaging, informative products and services that help our customers succeed. Integrity is an essential characteristic for our firms and our associates; if this describes you, please apply!
Summary
The Senior Monitoring & Observability Engineer (TOC) is a senior-level infrastructure and reliability engineering role responsible for designing, implementing, optimizing, and supporting enterprise monitoring and observability platforms across networks, systems, cloud environments, and critical business applications. The position combines observability engineering, cloud and infrastructure operations, automation, and incident management responsibilities, including ownership of monitoring tools such as Datadog, Splunk, SolarWinds, Dynatrace, AppDynamics, Nagios, PRTG, and Zabbix. Acting as a technical escalation point, the role partners closely with Infrastructure, Security, DevOps, and IT Operations teams to improve system reliability, alert quality, operational efficiency, and service availability while supporting SRE-aligned practices such as automation, root cause analysis, SLIs/SLOs, and continuous operational improvement.
Compensation and Location
- Trageted pay range $115K - $125K Base pay, + 10% yearly bonus target.
- Location: 11065 Beatrice St., Los Angeles, CA 900066
Duties and Responsibilities
- Monitor and manage IT infrastructure, network systems, and business applications using enterprise monitoring tools, aligned with the TOC Sr. Engineer scope.
- Serve as the first point of escalation for TOC Engineers, providing advanced troubleshooting, guidance, and root cause analysis.
- Lead or support incident response, root cause analysis, escalation, and post-incident review processes; ensure issues are properly classified, escalated, and resolved efficiently.
- Take key roles in ITIL Incident, Problem, and Change Management processes.
- Build and tune monitoring and observability tooling — instrumentation, integrations, dashboards, alert logic, synthetic checks, log pipelines, and APM configuration — not just consume them.
- Develop and implement automation scripts and tooling to improve operational efficiency, alerting quality, and response times (Python, PowerShell, Bash, Ansible, or similar).
- Analyze system logs, network traffic, event data, and performance metrics to identify trends, reduce alert noise, and prevent outages.
- Document monitoring standards, troubleshooting steps, system configurations, dashboards, and runbooks for knowledge sharing.
- Collaborate with IT, Security, and DevOps teams to maintain system reliability and security posture.
- Work with vendors and service providers to resolve tool, platform, and infrastructure issues.
- Participate in 24/7 on-call rotations and provide leadership during major incidents, helping coordinate cross-functional resolution efforts.
- Mentor junior TOC/NOC engineers on monitoring tools, dashboards, alert handling, and incident response practices.
Qualifications & Requirements
Required Education, Experience, Certification/Licensure
- Bachelor’s degree in IT, Computer Science, Networking, or a related field (or equivalent work experience).
- 3+ years of experience in IT operations, network monitoring, or system administration, with hands-on experience implementing and tuning enterprise monitoring/observability platforms.
- Demonstrated experience building or implementing (not just using) one or more of: Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds Orion, Orion DPA, Nagios, PRTG, or Zabbix.
- Advanced understanding of network protocols (TCP/IP, BGP, OSPF, VLANs, VPN, DNS, DHCP).
- Proficiency in Windows/Linux environments and at least one major cloud platform (AWS, Azure, or GCP).
- Familiarity with ITIL best practices for incident, problem, and change management.
- Scripting and automation experience using Python, PowerShell, Bash, Ansible, or similar tools.
- Working knowledge of cybersecurity best practices, firewall configurations, and SIEM tools.
- Strong leadership, communication, and collaboration skills, including the ability to translate monitoring data into clear operational action across cross-functional teams.
- Ability to work in a high-stress, dynamic environment while handling multiple high-priority incidents.
Preferred / "Ideal Candidate" Attributes
- Hands-on experience designing, implementing, or significantly maturing monitoring and observability platforms in an enterprise environment — not limited to acknowledging alerts or interpreting dashboards.
- Strong understanding of the relationship between infrastructure, networking, systems, cloud platforms, logs, metrics, traces, alerts, dashboards, and incident workflows.
- Experience in or exposure to a Site Reliability Engineering (SRE) environment, including reliability practices, automation, observability, service health, SLIs/SLOs, error budgets, and post-incident improvement.
- Experience reducing alert noise and improving signal quality through threshold tuning, deduplication, correlation, and runbook-driven response.
- Comfort working with APIs, configuration-as-code, and CI/CD pipelines as they relate to monitoring deployment and management.
KNOWLEDGE, SKILLS AND ABILITIES (KSAs) / CERTIFICATIONS
- CompTIA A+, Network+, or Security+
- Microsoft Fundamentals Certifications (Azure, M365, or Windows Server)
- AWS Cloud Practitioner or Azure Fundamentals
- ITIL Foundation certification (preferred for Incident Management responsibilities)
- Vendor certifications in Datadog, Splunk, Dynatrace, AppDynamics, or SolarWinds are a plus
Working Conditions
Must be able to perform the essential job duties. Work is performed primarily in an office environment. Typically requires the ability to sit for extended periods of time (66%+ each work day), ability to hear the telephone, ability to enter data on a computer and may also require the ability to lift up to 10 pounds.
Equal Opportunity Employer
Data Analysis Inc is an equal opportunity employer. All aspects of employment, including the decision to hire, promote, discipline, or discharge, will be based on merit, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, veteran status, or any other status protected under federal, state, or local law.
Nearest Major Market: Los Angeles