控制文件与数据文件头SCN不一致导致数据库无法启动故障处理 Fuzzy scn

霸气de小男生 提交于 2019-12-02 08:50:19

Reference:
https://www.askmaclean.com/archives/rman-06026-absolute_fuzzy_change.html
https://blog.csdn.net/songxixi/article/details/7010934


RMAN> run{
debug on;
set until time "to_date('2013-08-08 19:12:03','yyyy-mm-dd hh24:mi:ss')";
restore database ;
debug off;
}
2> 3> 4> 5> 6> 
RMAN-03036: Debugging set to level=9, types=ALL

RMAN-03023: executing command: SET until clause

RMAN-03090: Starting restore at 2013-08-15 10:19:14
RMAN-06009: using target database control file instead of recovery catalog
RMAN-08030: allocated channel: ORA_DISK_1
RMAN-08605: channel ORA_DISK_1: SID=661 instance=PTRDDB1 device type=DISK

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 08/15/2013 10:19:18
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 7
RMAN-06100: no channel to restore a backup or copy of datafile 4
RMAN-06100: no channel to restore a backup or copy of datafile 3
RMAN-06100: no channel to restore a backup or copy of datafile 2
RMAN-06100: no channel to restore a backup or copy of datafile 1

RMAN> 

RMAN> 

RMAN> 

RMAN> 

RMAN> 

RMAN> run{
debug on;
set until time "to_date('2013-08-08 19:12:03','yyyy-mm-dd hh24:mi:ss')";
restore database preview;
debug off;
}2> 3> 4> 5> 6> 

RMAN-03036: Debugging set to level=9, types=ALL

RMAN-03023: executing command: SET until clause

RMAN-03090: Starting restore at 2013-08-15 10:19:48
RMAN-12016: using channel ORA_DISK_1

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 08/15/2013 10:19:54
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 7
RMAN-06100: no channel to restore a backup or copy of datafile 4
RMAN-06100: no channel to restore a backup or copy of datafile 3
RMAN-06100: no channel to restore a backup or copy of datafile 2
RMAN-06100: no channel to restore a backup or copy of datafile 1

 

 

 

检查数据文件的ABSOLUTE_FUZZY_CHANGE#,发现 出现问题的数据文件 1、2、3、4、7 有较大的ABSOLUTE_FUZZY_CHANGE#。

 

 

      recid     file#     checkpoint_time  checkpoint_change#    ABSOLUTE_FUZZY_CHANGE#
      3934          3 2013-08-08 19:00:50          424443795              424454469
      3935          4 2013-08-08 19:00:50          424443795              424454521
      3936          7 2013-08-08 19:00:50          424443795              424456295
      3937          8 2013-08-08 19:00:50          424443795                      0
      3938          9 2013-08-08 19:00:50          424443795                      0
      3939          2 2013-08-08 19:00:50          424443795              424452449
      3940          5 2013-08-08 19:00:50          424443795                      0
      3941          1 2013-08-08 19:00:50          424443795              424453386
      3942         10 2013-08-08 19:00:50          424443795                      0
      3943          6 2013-08-08 19:00:50          424443795                      0
      3944         11 2013-08-08 19:00:50          424443795                      0
      3945          0 2013-08-08 19:00:50          424443795                      0
      3946          0 2013-08-08 19:00:50          424443795                      0
      3948          0 2013-08-08 19:00:50          424443795                      0
3949          0 2013-08-08 19:27:03          424492130                      0

 

 

 

对于restore until time而言要求数据文件备份的ABSOLUTE_FUZZY_CHANGE#对应的时间点要小于指定的until time 时间点,rman才认为该数据文件备份是有效的,否则将跳过该备份。

 

 

ABSOLUTE_FUZZY_CHANGE#是rman备份中服务进程读取到数据块中的High Scn,为了维护一致性要求 restore时恢复到的时间点 要 大约备份点对应的checkpoint_change#和ABSOLUTE_FUZZY_CHANGE#。

 

 

详见文档Common Causes for RMAN-06023 and RMAN-06026 (Doc ID 1366610.1)

 

 

Backup start on T1 (SCN=1000) and ends on T2 (SCN=1050), than the backup can ONLY be used if the UNTIL SCN is 1050 or higher.
So if the ‘UNTIL TIME T2’ is converted to SCN 1045, than this backup will NOT be used.
V$BACKUP_DATAFILE / RC_BACKUP_DATAFILE is giving more info on this.
CHECKPOINT_CHANGE# corresponds with T1
ABSOLUTE_FUZZY_CHANGE# corresponds with T2. When ABSOLUTE_FUZZY_CHANGE# is NULL, than it is the same as the CHECKPOINT_CHANGE#

 

 

 

我们测试使用restore until scn并指定大于ABSOLUTE_FUZZY_CHANGE#的一个SCN,可以绕过该问题:

 

 

RMAN> run
{
set until scn 424456295;
restore database preview;
}2> 3> 4> 5> 

executing command: SET until clause

Starting restore at 2013-08-15 13:30:39
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=649 instance=PTRDDB1 device type=DISK

List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time    
------- ---- -- ---------- ----------- ------------ -------------------
6719    Full    554.91M    DISK        00:02:38     2013-08-08 19:07:21
        BP Key: 6719   Status: AVAILABLE  Compressed: YES  Tag: TAG20130808T190511
        Piece Name: /export/home/oracle/rman/ptddb_before_3nogq6j8
  List of Datafiles in backup set 6719
  File LV Type Ckp SCN    Ckp Time            Name
  ---- -- ---- ---------- ------------------- ----

 

 

 

但是使用该scn对应的时间点则失败:

 

通过dump logfile 获得scn对应的时间点:

 

 

scn 424456290        194CB062    194cb062   08/08/2013 19:07:00
scn 424464586        194CD0CA   194cd0ca   08/08/2013 19:12:03
scn 424456295        194CB067    194cb067   08/08/2013 19:07:00  ==>之前使用成功的SCN号对应的时间点

 

 

 

之前我们测试成功的scn 424456295 对应时间点08/08/2013 19:07:00, 但使用set until time该时间点仍报错

 

 

 

 RMAN>  run{
set until time "to_date('2013-08-08 19:07:20','yyyy-mm-dd hh24:mi:ss')";
restore database preview;
}2> 3> 4> 

executing command: SET until clause

Starting restore at 2013-08-15 13:28:41
using channel ORA_DISK_1

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 08/15/2013 13:28:42
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 7
RMAN-06100: no channel to restore a backup or copy of datafile 4
RMAN-06100: no channel to restore a backup or copy of datafile 3
RMAN-06100: no channel to restore a backup or copy of datafile 2
RMAN-06100: no channel to restore a backup or copy of datafile 1

 

 

 

由于在数据库open之前没有严格的scn和时间的对照表,所以set until time时通过评估将时间转换为scn的, rman在这里只能做评估(estimate)。 特别是某个十分接近于备份结束的时间点的时间戳时,若该时间戳被转换为scn,且该scn小于对应备份的数据文件的ABSOLUTE_FUZZY_CHANGE#,则会造成该备份不可用。
该问题详见Common Causes for RMAN-06023 and RMAN-06026 (Doc ID 1366610.1)中的描述:
When an SET UNTIL TIME is being used, RMAN will convert it to an UNTIL SCN. This is an estimate as there is NO hard relation between a timestamp and an SCN.

RMAN is making an estimate. Especially when a timestamp is used which is close to the end-time of the backup, than this might be an issue. If the conversion to an SCN is generating an SCN which is BEFORE the end fuzziness of the datafiles in the backup, than the backup can NOT be used.

 

 

 

热备份的相关流程就做具体解释了,就具体出现的问题,我们做下探讨!

如果我们正在拷贝一个BLOCK的时候,正好DBWR也在写一个BLOCK,那么我们拷贝的一个BLOCK可能是不一致的。Oracle恢复的时候如何避免这种情况的发生呢。

1、BEGIN BACKUP的时候会做CHECKPOINT,尽量把脏块写入文件

2、对于处于BEGIN BACKUP状态的文件,当某个BLOCK第一次被变更的时候,整个BLOCK会被写入REDO LOG,而不仅仅是CHANGE VECTOR。这样做恢复的时候,可以从这个完整的BLOCK开始提交REDO VECTOR,从而恢复这个块的数据。

3、DATAFILE会保留上一次完整CHECKPOINT的SCN,并且在备份过程中不会改变,直到END BACKUP后。这样恢复的时候Oracle知道哪些数据块需要前滚。

如果RMAN备份的时候,拷贝一个块的时候(由于操作系统块大小和Oracle的BLOCK SIZE是不同的,比如操作系统BLOCK SIZE 是 512字节,Oracle是8K),因此可能会出现RMAN备份的时候每次拷贝4个操作系统块,也就是2K,那么一个Oracle 块需要4次IO才能完成。这4次IO中拷贝的数据可能不一致(比如第一次IO后这个Oracle块被改变了),这样,在恢复的时候,这个备份的块就无法使用了。这就叫块断裂。
当某个表空间被设置为BEGIN BACKUP时,这个表空间的所有文件会做一次CHECKPOINT,这个时候LRU-XR链上会链入这个文件的所有的脏块,LRU-XR会交给DBWR,将脏块写入文件。
此后,这个块的第一次修改的时候,会将整个块写入LOG BUFFER,而不是仅仅写入CHANGE VECTOR,这样当做恢复的时候,就不会使用这个断裂的块,而是使用REDO LOG中的这个块,作为前滚的基础。
从此可以看出,在做热备期间,对于相应数据文件的修改操作的成本是高于平时的。因此我们要注意尽量使某个数据文件处于BEGIN BACKUP状态的时间缩短,这样会减少REDO LOG产生的数量
 

从Oracle的角度考虑,我觉得这种处理模式不现实,第一是重新拷贝也避免不了块断裂,除非把整个块锁住。第二是这么做效率不高。可能拷贝完了,发现所有的块又变更了。(Oracle实际的算法是RMAN作为一个Oracle Session,和OS COPY命令实现是不同的,RMAN可以读取一个数据块。RMAN备份某个文件的时候,会设置文件头的Absolute Fuzzy和absolute fuzzy scn,这个时候做其他的备份就会被禁止。absolute fuzzy scn最初是空的,当RMAN在读取BUFFER的时候,比较这个BUFFER的SCN和当前的FUZZY SCN,如果BUFFER的更大,就更改为BUFFER的。这样,当整个文件备份完后,文件头里的FUZZY SCN是整个文件中的最大的。备份结束的时候,会比较ABSOLUTE FUZZY SCN和CHECKPOINT SCN,如果CHECKPOINT SCN已经高于FUZZY SCN,那么说明这个文件这个时间点恢复所需要的数据都已经写入磁盘,这个时候清除FUZZY BIT就可以了。如果CHECKPONT SCN还比较低,那么就保留。)(有关相关SCN的解释在其他篇章我已经介绍过,这里不做具体流程解释了)

下面做个实验:

SYS用户

SQL> ALTER TABLESPACE INDX BEGIN BACKUP;

表空间已更改。

SQL> select max(ktuxescnw * power(2, 32) + ktuxescnb) from x$ktuxe;

MAX(KTUXESCNW*POWER(2,32)+KTUXESCNB)
------------------------------------
                            18074167

 

然后SCOTT 用户:

SQL> UPDATE TBK SET A=5 WHERE B=5;

已更新 1 行。

SQL> COMMIT;

提交完成。

然后SYS用户:

SQL> select max(ktuxescnw * power(2, 32) + ktuxescnb) from x$ktuxe;

MAX(KTUXESCNW*POWER(2,32)+KTUXESCNB)
------------------------------------
                            18074182

SQL> alter system dump logfile 'd:\oracle\oradata\ora92\redo01.log' scn min 18074167 SCN MAX 18074182;

系统已更改。

 

 

以下是REDO LOG的DUMP信息:

REDO RECORD - Thread:1 RBA: 0x0000ad.0000dfdc.0010 LEN: 0x0048 VLD: 0x02
SCN: 0x0000.0113ca40 SUBSCN:  1 03/26/2008 11:44:54
CHANGE #1 MEDIA RECOVERY MARKER SCN:0x0000.00000000 SEQ:  0 OP:23.1  ------标准的事务
 Block Written - afn: 2 rdba: 0x00809528(2,38184) 
                   scn: 0x0000.0113b6bb seq: 0x01 flg:0x04
 Block Written - afn: 2 rdba: 0x00806e6d(2,28269)
                   scn: 0x0000.0113b6ab seq: 0x01 flg:0x04
 
REDO RECORD - Thread:1 RBA: 0x0000ad.0000dfdd.0010 LEN: 0x1018 VLD: 0x01
SCN: 0x0000.0113ca44 SUBSCN:  1 03/26/2008 11:45:03
CHANGE #1 TYP:3 CLS: 1 AFN:5 DBA:0x01403606 SCN:0x0000.0113c851 SEQ:  1 OP:18.1     -----这个可是新看见的,Log block image


Log block image redo entry
Dump of memory from 0x047C0220 to 0x047C1208
47C0220 00000001 00007D75 0113C664 00000000  [....u}..d.......]
47C0230 00320002 01403601 000E0006 000074AA  [..2..6@......t..]
47C0240 00805FC4 002006DE 00008000 01137D68  [._.... .....h}..]
47C0250 000B0003 0000765F 00800101 002B0787  [...._v........+.]
47C0260 00002001 0113C851 00000000 00000000  [. ..Q...........]
47C0270 00050100 001CFFFF 0F4F0F6B 00000F4F  [........k.O.O...]
47C0280 0F8F0005 0F7D0F86 0F6B0F74 00000000  [......}.t.k.....]
47C0290 00000000 00000000 00000000 00000000  [................]
        Repeat 243 times
47C11D0 00000000 00000000 2C000000 C1020200  [...........,....]
47C11E0 07C10202 0202002C C10202C1 02022C06  [....,........,..]
47C11F0 0204C102 002C05C1 04C10202 2C04C102  [......,........,]
47C1200 C1020200 03C10203                    [........]        
Dump of memory from 0x047C1208 to 0x047C1209
47C1200                   00000006                   [....]    
 
REDO RECORD - Thread:1 RBA: 0x0000ad.0000dfe5.00a8 LEN: 0x01ec VLD: 0x01
SCN: 0x0000.0113ca44 SUBSCN:  1 03/26/2008 11:45:03
CHANGE #1 TYP:0 CLS:33 AFN:2 DBA:0x00800111 SCN:0x0000.0113ca11 SEQ:  1 OP:5.2
ktudh redo: slt: 0x0010 sqn: 0x00007663 flg: 0x0012 siz: 128 fbi: 0
            uba: 0x00800247.072a.2b    pxid:  0x0000.000.00000000
CHANGE #2 TYP:0 CLS:34 AFN:2 DBA:0x00800247 SCN:0x0000.0113ca10 SEQ:  1 OP:5.1
ktudb redo: siz: 128 spc: 654 flg: 0x0012 seq: 0x072a rec: 0x2b
            xid:  0x0009.010.00007663  
ktubl redo: slt: 16 rci: 0 opc: 11.1 objn: 32117 objd: 32117 tsn: 5
Undo type:  Regular undo        Begin trans    Last buffer split:  No 
Temp Object:  No 
Tablespace Undo:  No 
             0x00000000  prev ctl uba: 0x00800247.072a.2a 
prev ctl max cmt scn:  0x0000.0113c99e  prev tx cmt scn:  0x0000.0113c9a2 
KDO undo record:
KTB Redo 
op: 0x04  ver: 0x01  
op: L  itl: xid:  0x0006.00e.000074aa uba: 0x00805fc4.06de.20
                      flg: C---    lkc:  0     scn: 0x0000.01137d68
KDO Op code: URP row dependencies Disabled
  xtype: XA  bdba: 0x01403606  hdba: 0x01403603
itli: 1  ispac: 0  maxfr: 2401
tabn: 0 slot: 3(0x3) flag: 0x2c lock: 0 ckix: 0   -----UNDO信息(10进制2,就是这个字段修改前的值)
ncol: 2 nnew: 1 size: 0
col  0: [ 2]  c1 02
CHANGE #3 TYP:2 CLS: 1 AFN:5 DBA:0x01403606 SCN:0x0000.0113c851 SEQ:  1 OP:11.5   这个是标准的UPDATE
KTB Redo 
op: 0x01  ver: 0x01  
op: F  xid:  0x0009.010.00007663    uba: 0x00800247.072a.2b
KDO Op code: URP row dependencies Disabled
  xtype: XA  bdba: 0x01403606  hdba: 0x01403603
itli: 1  ispac: 0  maxfr: 2401
tabn: 0 slot: 3(0x3) flag: 0x2c lock: 1 ckix: 0     ----03号SLOT,第四行,就是刚才更改的行
ncol: 2 nnew: 1 size: 0
col  0: [ 2]  c1 06   ----10进制5,就是我们更改的值
CHANGE #4 MEDIA RECOVERY MARKER SCN:0x0000.00000000 SEQ:  0 OP:5.19
session number   = 8
serial  number   = 5
current username = SCOTT
login   username = SCOTT
client info      = 
OS username      = JACKSONXU\jackson xu
Machine name     = WORKGROUP\JACKSONXU
OS terminal      = JACKSONXU
OS process id    = 3868:3988
OS program name  = sqlplus.exe
transaction name = 
 
REDO RECORD - Thread:1 RBA: 0x0000ad.0000dfe7.0010 LEN: 0x0054 VLD: 0x01
SCN: 0x0000.0113ca46 SUBSCN:  1 03/26/2008 11:45:06
CHANGE #1 TYP:0 CLS:33 AFN:2 DBA:0x00800111 SCN:0x0000.0113ca44 SEQ:  1 OP:5.4
ktucm redo: slt: 0x0010 sqn: 0x00007663 srt: 0 sta: 9 flg: 0x2 
ktucf redo: uba: 0x00800247.072a.2b ext: 2 spc: 524 fbi: 0 
END OF REDO DUMP

其实只要Oracle把变更块的整个映像放入REDO LOG里,就没问题了,恢复的时候就不会因为块断裂而出现问题了


————————————————
版权声明:本文为CSDN博主「太阳上有风」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/songxixi/article/details/7010934

 

建议:

Restore database/datafile 建议优先使用set until scn 指定scn号,该scn号应当大于checkpoint_change#和ABSOLUTE_FUZZY_CHANGE#。
ABSOLUTE_FUZZY_CHANGE#信息可以通过V$BACKUP_DATAFILE视图获得。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!