PgFincore

PgFincore is a set of functions to manage blocks in memory.

Each Table or Index is truncated in segment of (usually) 1GB, and each segment is truncated in block in filesystem.

Those functions let you know which and how many disk block from a relation are in the page cache of the operating system, and eventually write the result to a file. Then using this file, it is possible to restore the page cache state for each block of the relation.

Other functions are used to set a POSIX_FADVISE flag on the entire relation (each segment). The more usefull are probably WILLNEED and DONTNEED which push and pop blocks of each segments of a relation from page cache, respectively.

Each functions are call with at least a table name or an index name (or oid) as a parameter and walk each segment of the relation.

Download

Lastest release is 0.4.1 : direct download

You can grab the latest code with git :

 git clone git://git.postgresql.org/git/pgfincore.git

Or browse it

And the project is on pgfoundry : PgFincore

Install

First, make stage :

USE_PGXS=1 make clean
USE_PGXS=1 make
su
USE_PGXS=1 make install

Then, create the functions from the sql script (it should be in your contrib directory) :

psql mydb -f pgfincore.sql

Use Cases

Here are some examples of usage. If you want more details go to Documentation

Load a table or an index in OS Page Cache

You may want to try to keep a table or an index into the OS Page Cache, or preload a table before your well know big query is executed (reducing the query time).

To do so, just execute the following query:

cedric=# SELECT * FROM pgfadv_willneed('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     111882
 base/16384/24598.1 |       4096 |      55318 |      56764
(2 ROWS)
 
TIME: 39309,294 ms

The column “block_size” report that block size of the filesystem is 4KB. The column “block_disk” is the number of blocks of the specified file. The column “block_free” is the number of free blocks in memory (for caching).

Snapshot and Restore the OS Page Cache state of a table or an index (or more)

You may want to restore a table or an index into the OS Page Cache as it was while you did the snapshot. For example if you have to reboot your server, then when PostgreSQL start up the first queries might be slow because nether PostgreSQL or the OS have cache pages about the relations involved in those first queries. Executing a snapshot and a restore is very simple :

-- Snapshot
cedric=# SELECT * FROM pgmincore_snapshot('pgbench_accounts');
      relpath       | block_size | block_disk | block_mem | group_mem 
--------------------+------------+------------+-----------+-----------
 base/16384/24598   |       4096 |     262144 |    131745 |         1
 base/16384/24598.1 |       4096 |      55318 |     55318 |         1
 
-- Restore
cedric=# SELECT * FROM pgfadv_willneed_snapshot('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     105335
 base/16384/24598.1 |       4096 |      55318 |      50217
(2 ROWS)
 
TIME: 38745,140 ms

The column “block_mem” report how many blocks of the file are in memory. The column “group_mem” report that all the bloks in memory are contigous (only one group).

A more complete example is at my postgres stuff.

Documentation

Several functions are actually provided :

  • pgsysconf
  • pgmincore
  • pgmincore_snapshot
  • pgfadv_dontneed
  • pgfadv_willneed
  • pgfadv_willneed_snapshot
  • pgfadv_normal
  • pgfadv_random
  • pgfadv_sequential

pgsysconf

This function output size of OS blocks, number of free page in the OS Page Cache.

Example

cedric=# select * from pgsysconf();
 block_size | block_free
------------+------------
       4096 |     417534

pgmincore

This function provide information about the file system cache (page cache).

For the specified relation (can be call with tableoid too) it return :

  • relpath : the relation path
  • block_size : the size of one block disk
  • block_disk : the total number of file system blocks of the relation
  • block_mem : the total number of file system blocks of the relation in page cache. (not the shared buffers from PostgreSQL but the OS cache)
  • group_mem : the number of groups of adjacent block_mem

Example

cedric=# select * from pgmincore('pgbench_accounts');
      relpath       | block_size | block_disk | block_mem | group_mem
--------------------+------------+------------+-----------+-----------
 base/16384/16603   |       4096 |     262144 |         0 |         0
 base/16384/16603.1 |       4096 |      65726 |         0 |         0

cedric=# select * from pgbench_accounts limit 10000;

cedric=# select * from pgmincore('pgbench_accounts');
      relpath       | block_size | block_disk | block_mem | group_mem
--------------------+------------+------------+-----------+-----------
 base/16384/16603   |       4096 |     262144 |       414 |         1
 base/16384/16603.1 |       4096 |      65726 |         0 |         0

pgmincore_snapshot

This function write a file with _mincore suffix for each segment of the relation. So it does a snapshot of on memory blocks per segments, allowing a reload with pgfadv_willneed_snapshot.

Example

cedric=# select * from pgmincore_snapshot('pgbench_accounts');
          relpath           | block_size | block_disk | block_mem | group_mem
----------------------------+------------+------------+-----------+-----------
 base/16385/49240_mincore   |       4096 |     262144 |         0 |         0
 base/16385/49240.1_mincore |       4096 |     262144 |    238180 |         2
 base/16385/49240.2_mincore |       4096 |     262144 |     56478 |         2
 base/16385/49240.3_mincore |       4096 |      46902 |         0 |         0

# ls -1 /var/lib/postgresql/8.4/main/base/16385/49240*
49240
49240.1
49240.1_mincore
49240.2
49240.2_mincore
49240.3
49240.3_mincore
49240_fsm
49240_mincore
49240_vm

pgfadv_*

All those function call fadvise with the prefix as the flag and return the same columns :

For the specified relation (can be call with tableoid too) it return :

  • relpath : the relation path
  • block_size : The size in bytes of the filesystem block.
  • block_disk : the total number of file system blocks of the relation
  • block_free : the number of free blocks in memory (for caching).

pgfadv_WILLNEED

This function set WILLNEED flag on the current relation. It means that OS will try to load as much blocks as possible of the relation. Main idea is to preload files on server startup, perhaps using cache hit/miss ratio or most required relations/indexes.

Example
cedric=# select * from pgfadv_willneed('pgbench_accounts');
      relpath       | block_size | block_disk | block_free
--------------------+------------+------------+------------
 base/16384/16603   |       4096 |     262144 |       3744
 base/16384/16603.1 |       4096 |      65726 |       4236

pgfadv_WILLNEED_snapshot

This function set WILLNEED flag on each block which were in memory when pgmincore_snapshot was called .

Example
cedric=# select * from pgfadv_willneed_snapshot('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     105335
 base/16384/24598.1 |       4096 |      55318 |      50217

pgfadv_DONTNEED

This function set DONTNEED flag on the current relation. It means that OS will first unload blocks of the file if it need to free so memory. Main idea is to unload files when they are not usefull anymore (instead of perhaps more interesting blocks)

Example
cedric=# select * from pgfadv_dontneed('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     178743
 base/16384/24598.1 |       4096 |      55318 |     234078

pgfadv_NORMAL

This function set NORMAL flag on the current relation.

Example
cedric=# select * from pgfadv_NORMAL('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     233954
 base/16384/24598.1 |       4096 |      55318 |     233954

pgfadv_RANDOM

This function set RANDOM flag on the current relation.

Example
cedric=# select * from pgfadv_RANDOM('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     234233
 base/16384/24598.1 |       4096 |      55318 |     234233

pgfadv_SEQUENTIAL

This function set SEQUENTIAL flag on the current relation.

Example
cedric=# select * from pgfadv_SEQUENTIAL('pgbench_accounts');
      relpath       | block_size | block_disk | block_free 
--------------------+------------+------------+------------
 base/16384/24598   |       4096 |     262144 |     233985
 base/16384/24598.1 |       4096 |      55318 |     233985

Debug

You can debug the function with the following error level : DEBUG1 and DEBUG5

set client_min_messages TO debug1; -- debug5 is only usefull to trace each block

Notes

* linux 2.6.34 : POSIX_FADV_RANDOM now does not deactivate readahead 100% . see http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0141450f66c3c12a3aaa869748caa64241885cdf

  1. → impact O_DIRECT
  2. → improv

More

A talk about PgFincore has been scheduled at PgDay.eu.

A talk about PgFincore has been scheduled at PgCon10.

 
Haut de page
projects/pgfincore/v0.4.txt · Dernière modification : 2011/05/20 00:20 de 127.0.0.1
 
 
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0 Google Analytics