監視とは
- 公式
- 計画する内容:
-
監視項目
- OS
- PostgreSQLサーバプロセス
-
監視間隔
- SLA等に照らして必要十分な間隔で計測する
-
監視内容により監視間隔も変わってくる
- 死活は数秒とか
- 性能は数分とか
-
閾値
- 想定、実績、閾値をもって、「異常」の定義をする
監視項目の選定
- 最終目的は「データベースが健全に動作しているか」の監視
-
そのためにブレークダウン
- サーバに問題が起きていないか
- PostgreSQLに問題が起きていないか
サーバに問題が起きていないか?を監視する
- Unixの監視ツールをつかう
-
sar
- CPU,メモリ,I/O,ネットワークぜんぶ
- sysstatデーモンが動いていて
/var/log/sysstat/sa<日>
にログを書き出すやつ - sarクライアントで取り出したりする
-
iostat
- I/O
- CPU
Linux 4.19.84-microsoft-standard (DESKTOP-2PJLLS0) 02/02/20 _x86_64_ (6 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.41 0.00 0.14 0.00 0.00 99.44
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.56 0.01 283.69 165 4195632
sdb 3.89 49.66 43.73 734501 646704
-
vmstat
- virtual memory
- CPUも
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 24867800 136884 835760 0 0 8 54 16 43 0 0 99 0 0
-
top
- 各プロセスのCPU、メモリ
-
netstat
-
ネットワーク
- TCP/IPとかUnixドメインソケット通信とか
-
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:33154 localhost:6010 ESTABLISHED
tcp 0 0 localhost:6010 localhost:33144 ESTABLISHED
...
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 1564
unix 3 [ ] STREAM CONNECTED 1563
...
PostgreSQLに問題が起きていないか?を監視する
必要なプロセスは正常に動作しているか
ps
コマンド-
pg_stat_activity
ビュー- 公式
-
1行あたり1サーバプロセス
- 実行中のクエリ情報とか
postgres=# TABLE pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+----------+-----+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+--------+-------------+--------------+-------------------------+------------------------------
| | 60 | | | | | | | 2020-02-02 05:55:26.707822+00 | | | | Activity | AutoVacuumMain | | | | | autovacuum launcher
| | 62 | 10 | postgres | | | | | 2020-02-02 05:55:26.708161+00 | | | | Activity | LogicalLauncherMain | | | | | logical replication launcher
13408 | postgres | 674 | 10 | postgres | psql | | | -1 | 2020-02-02 07:52:49.013246+00 | 2020-02-02 07:53:08.898073+00 | 2020-02-02 07:53:08.898073+00 | 2020-02-02 07:53:08.898074+00 | | | active | | 489 | TABLE pg_stat_activity; | client backend
| | 58 | | | | | | | 2020-02-02 05:55:26.707524+00 | | | | Activity | BgWriterHibernate | | | | | background writer
| | 57 | | | | | | | 2020-02-02 05:55:26.707363+00 | | | | Activity | CheckpointerMain | | | | | checkpointer
| | 59 | | | | | | | 2020-02-02 05:55:26.707688+00 | | | | Activity | WalWriterMain | | | | | walwriter
(6 rows)
-
pg_stat_archiver
ビュー- 公式
- WALアーカイバの活動監視
postgres=# TABLE pg_stat_archiver;
archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset
----------------+-------------------+--------------------+--------------+-----------------+------------------+-------------------------------
0 | | | 0 | | | 2020-02-02 05:55:26.500914+00
(1 row)
-
pg_stat_progress_vacuum
ビュー- 公式
-
1行あたりVACUUMを実行しているバックエンド1プロセス
- autovacuumワーカ含む
postgres=# TABLE pg_stat_progress_vacuum;
pid | datid | datname | relid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
-----+-------+---------+-------+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
(0 rows)
- C-S接続確認
ディスク容量に問題はないか
df
,du
コマンドpg_database_size(<database_name>)
関数
postgres=# SELECT pg_database_size('postgres');
pg_database_size
------------------
8135535
(1 row)
pg_total_relation_size(<relation_size>)
関数
postgres=# CREATE TABLE tbl (col int);
CREATE TABLE
postgres=# INSERT INTO tbl values (1);
INSERT 0 1
postgres=# SELECT pg_total_relation_size('tbl');
pg_total_relation_size
------------------------
8192
(1 row)
- 1ページ(8kB)に収まっている
- テーブル、インデックスの容量やスキャン履歴も監視対象
postgres=# CREATE INDEX idx ON tbl USING btree (col);
CREATE INDEX
pg_stat_user_tables
ビュー
postgres=# TABLE pg_stat_user_tables;
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+---------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------+-----------------+--------------+------------------+--------------+------------------+---------------+-------------------
16386 | public | tbl | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 2 | | | | | 0 | 0 | 0 | 0
(1 row)
pg_stat_user_indexes
ビュー
postgres=# TABLE pg_stat_user_indexes;
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+---------+--------------+----------+--------------+---------------
16386 | 16389 | public | tbl | idx | 0 | 0 | 0
(1 row)
想定通りの性能を出せているか
log_min_duration_statement
でスロークエリをロギングできる
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
- 当該設定項目について
postgres=# SELECT name,setting,context,unit,boot_val FROM pg_settings WHERE name='log_min_duration_statement';
name | setting | context | unit | boot_val
----------------------------+---------+-----------+------+----------
log_min_duration_statement | -1 | superuser | ms | -1
(1 row)
- スーパーユーザのSET文で設定変更可能
- ミリ秒単位
- -1で無効
- 0以上ならば、設定値ミリ秒よりも長時間かかったクエリをロギング
postgres=# SET log_min_duration_statement=0;
SET
postgres=# SELECT * FROM tbl;
col
-----
1
2
(2 rows)
- stderr(デフォルトのログ出力先)を見てみる
2020-02-02 08:20:34.384 UTC [674] LOG: duration: 0.041 ms statement: SET log_min_duration_statement=0;
2020-02-02 08:20:44.917 UTC [674] LOG: duration: 0.101 ms statement: SELECT * FROM tbl;
log_min_duration_statement
の設定を変更したSET
文自体のログも出るよう
PostgreSQLのログに異常を知らせる通知はないか
- デフォルトで最低限なので適宜設定する
PostgreSQLログの設定
-
実運用ではデータベースに触れることが許されないことがある
- システムカタログを参照できない
- のでログが大事
PostgreSQLログをどこに出力するか
postgresql.conf
#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
...
# - Where to Log -
#log_destination = 'stderr' # Valid values are combinations of
# stderr, csvlog, syslog, and eventlog,
# depending on platform. csvlog
# requires logging_collector to be on.
# This is used when logging to stderr:
#logging_collector = off # Enable capturing of stderr and csvlog
# into log files. Required to be on for
# csvlogs.
# (change requires restart)
# These are only used if logging_collector is on:
#log_directory = 'log' # directory where log files are written,
# can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
# can include strftime() escapes
#log_file_mode = 0600 # creation mode for log files,
# begin with 0 to use octal notation
#log_truncate_on_rotation = off # If on, an existing log file with the
# same name as the new log file will be
# truncated rather than appended to.
# But such truncation only occurs on
# time-driven rotation, not on restarts
# or size-driven rotation. Default is
# off, meaning append to existing files
# in all cases.
#log_rotation_age = 1d # Automatic rotation of logfiles will
# happen after that time. 0 disables.
#log_rotation_size = 10MB # Automatic rotation of logfiles will
# happen after that much log output.
# 0 disables.
# These are relevant when logging to syslog:
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'
#syslog_sequence_numbers = on
#syslog_split_messages = on
# This is only relevant when logging to eventlog (win32):
# (change requires restart)
#event_source = 'PostgreSQL'
PostgreSQLログをいつ出力するか
postgresql.conf
#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------
# - Statement Behavior -
#client_min_messages = notice # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# log
# notice
# warning
# error
...
#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
...
# - When to Log -
#log_min_messages = warning # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# info
# notice
# warning
# error
# log
# fatal
# panic
#log_min_error_statement = error # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# info
# notice
# warning
# error
# log
# fatal
# panic (effectively off)
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
#log_transaction_sample_rate = 0.0 # Fraction of transactions whose statements
# are logged regardless of their duration. 1.0 logs all
# statements from all transactions, 0.0 never logs.
PostgreSQLログに何を出力するか
postgresql.conf
#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
...
#log_autovacuum_min_duration = -1 # -1 disables, 0 logs all actions and
# their durations, > 0 logs only
# actions running at least this number
# of milliseconds.
...
#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
# - What to Log -
#debug_print_parse = off
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = on
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_duration = off
#log_error_verbosity = default # terse, default, or verbose messages
#log_hostname = off
#log_line_prefix = '%m [%p] ' # special values:
# %a = application name
# %u = user name
# %d = database name
# %r = remote host and port
# %h = remote host
# %p = process ID
# %t = timestamp without milliseconds
# %m = timestamp with milliseconds
# %n = timestamp with milliseconds (as a Unix epoch)
# %i = command tag
# %e = SQL state
# %c = session ID
# %l = session line number
# %s = session start timestamp
# %v = virtual transaction ID
# %x = transaction ID (0 if none)
# %q = stop here in non-session
# processes
# %% = '%'
# e.g. '<%u%%%d> '
#log_lock_waits = off # log lock waits >= deadlock_timeout
#log_statement = 'none' # none, ddl, mod, all
#log_replication_commands = off
#log_temp_files = -1 # log temporary files equal or larger
# than the specified size in kilobytes;
# -1 disables, 0 logs all temp files
log_timezone = 'Etc/UTC'
- ログ出力内容例
2020-02-02 08:20:44.917 UTC [674] LOG: duration: 0.101 ms statement: SELECT * FROM tbl;
- prefixはこの部分:
2020-02-02 08:20:44.917 UTC [674]
-
デフォルトは
'%m [%p] '
%m
: millisec単位のタイムスタンプ (TZ)%p
: PID
PostgreSQLログをどのように保持するか
-
ログローテート
-
syslog使用時
- syslogのログローテート機能を使う
-
syslog不使用時
- PostgreSQLのログローテート機能を使う
-
このへん
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
# can include strftime() escapes
...
#log_truncate_on_rotation = off # If on, an existing log file with the
# same name as the new log file will be
# truncated rather than appended to.
# But such truncation only occurs on
# time-driven rotation, not on restarts
# or size-driven rotation. Default is
# off, meaning append to existing files
# in all cases.
#log_rotation_age = 1d # Automatic rotation of logfiles will
# happen after that time. 0 disables.
#log_rotation_size = 10MB # Automatic rotation of logfiles will
# happen after that much log output.
# 0 disables.
- 間隔、もしくはログファイルサイズ、もしくはその両方でログローテートできる
異常時の判断基準
-
何をもって異常とするか
- 想定と実績と閾値で判断する
- そのための閾値を決定する