Aiven provides built-in monitoring and logging for all services, along with integrations to export metrics and logs to external platforms. Monitor service health, performance, and troubleshoot issues with comprehensive observability tools.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/aiven/aiven-docs/llms.txt
Use this file to discover all available pages before exploring further.
Monitoring overview
Aiven offers multiple levels of monitoring:Built-in Metrics
- Real-time service metrics
- CPU, memory, disk, network
- Service-specific metrics
- Available in Console
Service Logs
- Service operation logs
- Error and debug messages
- Connection logs
- 4-day retention
Audit Logs
- Organization events
- Project events
- User actions
- Configuration changes
Service metrics
View real-time metrics for all your services:Built-in metrics
Available for every service without additional configuration:- Host Metrics
- Service-Specific
Infrastructure-level metrics
Percentage of CPU resources consumed by the service
Percentage of memory utilized by the service
Percentage of disk space used
5-minute average CPU load indicating system computational load
Input/output operations per second for disk reads
Input/output operations per second for disk writes
Network traffic received by the service
Network traffic transmitted by the service
Viewing metrics
Advanced metrics integration
For detailed service-specific metrics, set up metrics integration:Metrics integration requires separate PostgreSQL and Grafana services (additional cost). Predefined dashboards are automatically created and maintained.
Service logs
View logs for troubleshooting and monitoring:Accessing service logs
- Aiven Console
- Aiven CLI
Log retention
Log integration
Send logs to OpenSearch for long-term storage and analysis:All services in the project can send logs to the same OpenSearch service. Create one OpenSearch service for centralized logging.
Audit logs
Track administrative actions and changes:Organization audit logs
View organization-level events:Project event logs
View project-level events:API access to audit logs
Organization audit logs require
organization:audit_logs:read permission. Project logs require project:audit_logs:read permission.Prometheus integration
Expose metrics in Prometheus format for scraping:Enabling Prometheus
Create Prometheus endpoint
Navigate to project → Integration endpoints → Add new endpoint → Prometheus
Enable for service
Service → Overview → Service integrations → Manage integrations → Prometheus → Enable
Get metrics endpoint
Service → Overview → Connection information → Prometheus tabCopy the Service URI and credentials
Prometheus in VPC
If using VPC peering, enable public Prometheus access:Prometheus metrics
Available metrics include:- System metrics: CPU, memory, disk, network
- Service metrics: Connections, queries, cache hits
- Custom metrics: Application-specific (service dependent)
External integrations
Integrate Aiven metrics and logs with external platforms:Datadog integration
Send metrics to Datadog:Create Datadog endpoint
Project → Integration endpoints → DatadogEnter your Datadog API key and site (US/EU)
Jolokia (JMX) integration
Access JMX metrics for Kafka and other Java services:Rsyslog integration
Send logs to external syslog servers:Alerts and notifications
Set up alerts for service issues:Email notifications
Manage project and service notifications:Service contacts
Add contacts per service:Grafana alerts
Set up alerts in Grafana for metrics:Configure alert rule
- Set threshold (e.g., CPU > 80%)
- Set evaluation interval
- Configure for clause (duration)
Common alert scenarios
High CPU usage
High CPU usage
Alert: CPU usage > 80% for 10 minutesActions:
- Review slow queries or processes
- Check for unusual traffic patterns
- Consider upgrading service plan
Low disk space
Low disk space
Alert: Disk usage > 85%Actions:
- Review disk usage by table/index
- Clean up unnecessary data
- Enable disk autoscaler
- Upgrade to larger plan
High connection count
High connection count
Alert: Connections > 80% of maxActions:
- Check for connection leaks in applications
- Implement connection pooling
- Review connection limits
- Upgrade service plan
Replication lag
Replication lag
Alert: Replication lag > 60 secondsActions:
- Check network connectivity
- Review write load on primary
- Check replica performance
- Contact support if persists
Monitoring best practices
Monitor key service metrics
Focus on:
- Resource utilization (CPU, memory, disk)
- Connection count and errors
- Query performance and slow queries
- Replication lag (if applicable)
Troubleshooting
Metrics not appearing in Grafana
Metrics not appearing in Grafana
Cause: Integration not properly configured or needs time to populateSolution:
- Verify metrics integration is active
- Wait 1-2 minutes for initial data
- Check PostgreSQL has space and is running
- Verify network connectivity if using VPC
Cannot access Prometheus endpoint
Cannot access Prometheus endpoint
Cause: Service in VPC without public Prometheus accessSolution:
- Enable public access:
public_access.prometheus=true - Or access from peered VPC
- Check IP allowlist includes your scraper’s IP
Logs not appearing in OpenSearch
Logs not appearing in OpenSearch
Cause: Integration not enabled or OpenSearch fullSolution:
- Verify log integration is active
- Check OpenSearch disk space
- Review index lifecycle management settings
- Check for ingestion errors in OpenSearch
Not receiving alert emails
Not receiving alert emails
Cause: Email addresses not configured or notifications disabledSolution:
- Verify email addresses in project notifications
- Check spam folder
- Verify notification types are enabled
- Test with manual service restart
API reference
Next steps
Service Integrations
Set up metrics and log integrations
Security
Review security audit logs
VPC & Networking
Configure network for Prometheus access
Users & Permissions
Grant audit log access permissions