forked from AOMediaCodec/iamf
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.bs
2847 lines (2058 loc) · 174 KB
/
index.bs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<pre class='metadata'>
Group: AOM
Status: WGD
Title: Immersive Audio Model and Formats
Editor: SungHee Hwang, Samsung, [email protected]
Editor: Felicia Lim, Google, [email protected]
Repository: AOMediaCodec/iamf
Shortname: iamf
URL: https://aomediacodec.github.io/iamf/
Date: 2023-07-17
Abstract: This document specifies an immersive audio (IA) model, a standalone IA sequence format and an [[!ISOBMFF]]-based IA container format.
Local Boilerplate: footer yes
</pre>
<pre class="anchors">
url: https://www.iso.org/standard/68960.html#; spec: ISOBMFF; type: dfn;
text: AudioSampleEntry
text: grouping_type
text: channelcount
text: samplerate
text: roll_distance
text: SamplingRateBox
url: https://www.iso.org/standard/68960.html#; spec: ISOBMFF; type: property;
text: iso6
text: stsd
text: edts
text: stts
text: roll
text: elst
text: trun
text: ctts
url: https://aomedia.org/av1/specification/conventions/; spec: AV1-Convention; type: dfn;
text: leb128()
text: Clip3
url: https://www.iso.org/standard/43345.html#; spec: AAC; type: dfn;
text: raw_data_block()
text: ADTS
text: Low Complexity Profile
url: https://opus-codec.org/docs/opus_in_isobmff.html#; spec: OPUS-IN-ISOBMFF; type: dfn;
text: OpusSpecificBox
text: OutputChannelCount
text: OutputGain
text: ChannelMappingFamily
text: PreSkip
text: InputSampleRate
url: https://opus-codec.org/docs/opus_in_isobmff.html#; spec: OPUS-IN-ISOBMFF; type: property;
text: Opus
text: dOps
url: https://www.iso.org/standard/55688.html#; spec: MP4-Systems; type: dfn;
text: objectTypeIndication
text: streamType
text: upstream
text: decSpecificInfo()
text: DecoderConfigDescriptor()
text: Syntactic Description Language
url: https://www.iso.org/standard/76383.html#; spec: MP4-Audio; type: dfn;
text: AudioSpecificConfig()
text: audioObjectType
text: channelConfiguration
text: GASpecificConfig()
text: frameLengthFlag
text: dependsOnCoreCoder
text: extensionFlag
text: samplingFrequencyIndex
url: https://www.iso.org/standard/79110.html#; spec: MP4; type: dfn;
text: ESDBox
url: https://www.iso.org/standard/79110.html#; spec: MP4; type: property;
text: mp4a
text: esds
url: https://tools.ietf.org/html/rfc6381#; spec: RFC6381; type: property;
text: codecs
url: https://tools.ietf.org/html/rfc8486#; spec: RFC8486; type: dfn;
text: channel count
url: https://tools.ietf.org/html/rfc7845#; spec: RFC7845; type: dfn;
text: ID Header
text: Magic Signature
text: Output Channel Count
text: Output Gain
text: Pre-skip
url: https://tools.ietf.org/html/rfc6716#; spec: RFC6716; type: dfn;
text: Opus packet
url: https://tools.ietf.org/html/rfc8174#; spec: RFC8174; type: property;
text:
url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf#; spec: ITU1770-4; type: dfn;
text: LKFS
url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf#; spec: ITU2051-3; type: dfn;
text: Loudspeaker configuration for Sound System A (0+2+0)
text: Loudspeaker configuration for Sound System B (0+5+0)
text: Loudspeaker configuration for Sound System C (2+5+0)
text: Loudspeaker configuration for Sound System D (4+5+0)
text: Loudspeaker configuration for Sound System E (4+5+1)
text: Loudspeaker configuration for Sound System F (3+7+0)
text: Loudspeaker configuration for Sound System G (4+9+0)
text: Loudspeaker configuration for Sound System H (9+10+3)
text: Loudspeaker configuration for Sound System I (0+7+0)
text: Loudspeaker configuration for Sound System J (4+7+0)
text: SP Label
url: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf#; spec: ITU2127-0; type: dfn;
text:
url: https://en.wikipedia.org/wiki/Q_(number_format); spec: Q-Format; type: dfn;
text:
url: https://xiph.org/flac/format.html; spec: FLAC; type: dfn;
text: METADATA_BLOCK
text: METADATA_BLOCK_STREAMINFO
text: FRAME
text: FRAME_HEADER
text: SUBFRAME
text: FRAME_FOOTER
text: minimum block size
text: maximum block size
text: minimum frame size
text: maximum frame size
text: number of channels
text: MD5 signature
text: Block size in inter-channel samples
text: Sample rate
text: Channel assignment
text: Sample size in bits
url: https://xiph.org/flac/format.html; spec: FLAC; type: property;
text: fLaC
url: https://www.iso.org/standard/77752.html#; spec: MP4-PCM; type: dfn;
text: format_flags
text: PCM_sample_size
url: https://www.iso.org/standard/77752.html#; spec: MP4-PCM; type: property;
text: ipcm
</pre>
<pre class='biblio'>
{
"AI-CAD-Mixing": {
"title": "AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework",
"status": "Paper",
"publisher": "AES",
"href": "https://www.aes.org/e-lib/browse.cfm?elib=21489"
},
"AAC": {
"title": "Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC)",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/43345.html"
},
"MP4-Audio": {
"title": "Information technology — Coding of audio-visual objects — Part 3: Audio",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/76383.html"
},
"MP4-Systems": {
"title": "Information technology — Coding of audio-visual objects — Part 1: Systems",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/55688.html"
},
"OPUS-IN-ISOBMFF": {
"title": "Encapsulation of Opus in ISO Base Media File Format",
"status": "Best Practice",
"publisher": "IETF",
"href": "https://opus-codec.org/docs/opus_in_isobmff.html"
},
"ISOIEC-23091-3-2018": {
"title": "Information Technology - Coding-Independent Code Points - Part 3: Audio",
"status" : "Standard",
"publisher" : "ISO/IEC",
"href" : "https://www.iso.org/standard/73413.html"
},
"ITU1770-4": {
"title": "Algorithms to measure audio programme loudness and true-peak audio level",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf"
},
"ITU2051-3": {
"title": "Advance sound system for programme production",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf"
},
"Q-Format": {
"title": "Q (number format)",
"status": "Best Practice",
"publisher": "Wikipedia",
"href": "https://en.wikipedia.org/wiki/Q_(number_format)"
},
"BCP47": {
"title": "BCP 47",
"status": "Best Practice",
"publisher": "IETF",
"href": "https://www.rfc-editor.org/info/bcp47"
},
"FLAC": {
"title": "Free Lossless Audio Codec",
"status": "Best Practice",
"publisher": "xiph.org",
"href": "https://xiph.org/flac/format.html"
},
"AV1-Convention": {
"title": "Conventions",
"status": "Spec",
"publisher": "aomedia.org",
"href": "https://aomedia.org/av1/specification/conventions/"
},
"ITU2127-0": {
"title": "Audio Definition Model renderer for advanced sound systems",
"status": "Standard",
"publisher": "ITU",
"href": "https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf"
},
"EBU-Tech-3396": {
"title": "BINAURAL EBU ADM RENDERER (BEAR) FOR OBJECT-BASED SOUND OVER HEADPHONES",
"status": "Spec",
"publisher": "EBU",
"href": "https://tech.ebu.ch/publications/tech3396"
},
"Resonance-Audio": {
"title": "Efficient Encoding and Decoding of Binaural Sound with Resonance Audio",
"status": "Paper",
"publisher": "AES",
"href": "https://www.aes.org/e-lib/browse.cfm?elib=20446"
},
"AmbiX": {
"title": "AMBIX - A SUGGESTED AMBISONICS FORMAT",
"status": "Paper",
"publisher": "Ambisonics Symposium, June 2011",
"href": "https://iem.kug.ac.at/fileadmin/media/iem/projects/2011/ambisonics11_nachbar_zotter_sontacchi_deleflie.pdf"
},
"MP4-PCM": {
"title": "Information technology — MPEG audio technologies — Part 5: Uncompressed audio in MPEG-4 file format",
"status": "Standard",
"publisher": "ISO/IEC",
"href": "https://www.iso.org/standard/77752.html"
}
}
</pre>
# Introduction # {#introduction}
This specification defines an immersive audio model and formats (IAMF) to provide an immersive audio experience to end-users.
- The term <dfn noexport>Immersive Audio</dfn> (IA) means the combination of [=3D audio signal=]s recreating a sound experience close to that of a natural environment.
- The term <dfn noexport>3D audio signal</dfn> means a representation of sound that incorporates additional information beyond traditional stereo or surround sound formats such as Ambisonics (Scene-based), Object-based audio and Channel-based audio (e.g., 3.1.2ch or 7.1.4ch).
IAMF is used to provide [=Immersive Audio=] content for presentation on a wide range of devices in both streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g., headphones, mobile phones, tablets, TVs, sound bars, home theater systems, and big screens.
Here are some typical IAMF use cases and examples of how to instantiate the model for the use cases.
- UC1: One [=Audio Element=] (e.g., 3.1.2ch or First Order Ambisonics (FOA)) is delivered to a big-screen TV (in a home) or a mobile device through a unicast network. It is rendered to a loudspeaker layout (e.g., 3.1.2ch) or headphones with loudness normalization, and is played back on loudspeakers built into the big-screen TV or headphones connected to the mobile device, respectively.
- UC2: Two [=Audio Element=]s (e.g., 5.1.2ch and Stereo) are delivered to a big-screen TV through a unicast network. Both are rendered to the same loudspeaker layout built into the big-screen TV and are mixed. After applying loudness normalization appropriate to the home environment, the [=Rendered Mix Presentation=] is played back on the loudspeakers.
- UC3: Two [=Audio Element=]s (e.g., FOA and Non-diegetic Stereo) are delivered to a mobile device through a unicast network. FOA is rendered to Binaural (or Stereo) and Non-diegetic is rendered to Stereo. After mixing them, it is processed with loudness normalization and is played back on headphones through the mobile device.
Example 1: UC1 with [=3D audio signal=] = 3.1.2ch.
- Audio Substream: The left (L) and right (R) channels are coded as one audio stream, the left top front (Ltf) and right top front (Rtf) channels as one audio stream, the Center channel as one audio stream, and the low-frequency effects (LFE) channel as one audio stream.
- Audio Element (3.1.2ch): Consists of 4 Audio Substreams which are grouped into one [=ChannelGroup=].
- Mix Presentation: Provides rendering algorithms for rendering the Audio Element to popular loudspeaker layouts and headphones, and the loudness information of the [=3D audio signal=].
Example 2: UC2 with two [=3D audio signal=]s = 5.1.2ch and Stereo.
- Audio Substream: The L and R channels are coded as one audio stream, the left surround (Ls) and right surround (Rs) channels as one audio stream, the Ltf and Rtf channels as one audio stream, the Center channel as one audio stream, and the LFE channel as one audio stream.
- Audio Element 1 (5.1.2ch): Consists of 5 Audio Substreams which are grouped into one [=ChannelGroup=].
- Audio Element 2 (Stereo): Consists of 1 Audio Substream which is grouped into one [=ChannelGroup=].
- Parameter Substream 1-1: Contains mixing parameter values that are applied to Audio Element 1 by considering the home environment.
- Parameter Substream 1-2: Contains mixing parameter values that are applied to Audio Element 2 by considering the home environment.
- Mix Presentation: Provides rendering algorithms for rendering Audio Elements 1 & 2 to popular loudspeaker layouts, mixing information based on Parameter Substreams 1-1 & 1-2, and loudness information of the [=Rendered Mix Presentation=].
Example 3: UC3 with two [=3D audio signal=]s = first order Ambisonics (FOA) and Non-diegetic Stereo.
- Audio Substream: The L and R channels are coded as one audio stream and each channel of the FOA signal as one audio stream.
- Audio Element 1 (FOA): Consists of 4 Audio Substreams which are grouped into one [=ChannelGroup=].
- Audio Element 2 (Non-diegetic Stereo): Consists of 1 Audio Substream which is grouped into one [=ChannelGroup=].
- Parameter Substream 1-1: Contains mixing parameter values that are applied to Audio Element 1 by considering the mobile environment.
- Parameter Substream 1-2: Contains mixing parameter values that are applied to Audio Element 2 by considering the mobile environment.
- Mix Presentation: Provides rendering algorithms for rendering Audio Elements 1 & 2 to popular loudspeaker layouts and headphones, mixing information based on Parameter Substreams 1-1 & 1-2, and loudness information of the [=Rendered Mix Presentation=].
# Immersive Audio Model # {#iamodel}
This specification defines a model for representing [=Immersive Audio=] contents based on [=Audio Substream=]s contributing to [=Audio Element=]s meant to be rendered and mixed to form one or more presentations as depicted in the figure below.
<center><img src="images/decoding_flow_cropped.svg" width="800"></center>
<center><figcaption>Processing flow to decode, reconstruct, render, and mix the 3D audio signals for immersive audio playback.</figcaption></center>
The model comprises a number of coded [=Audio Substream=]s and the metadata that describes how to decode, render and mix the [=Audio Substream=]s for playback. The model itself is codec-agnostic; any supported audio codec may be used to code the [=Audio Substream=]s.
The model includes one or more [=Audio Element=]s, each of which consists of one or more [=Audio Substream=]s. The [=Audio Substream=]s that make up an [=Audio Element=] are grouped into one or more [=ChannelGroup=]s. The model further includes [=Mix Presentation=]s and [=Parameter Substream=]s.
## Terminology ## {#terminology}
The term <dfn noexport>Audio Substream</dfn> means a sequence of audio samples, which may be encoded with any compatible audio codec.
The term <dfn noexport>Audio Element</dfn> means a [=3D audio signal=], and is constructed from one or more [=Audio Substream=]s and the metadata describing them. The [=Audio Substream=]s associated with one [=Audio Element=] use the same audio codec.
The term <dfn noexport>ChannelGroup</dfn> means a set of [=Audio Substream=](s) which is(are) able to provide a spatial resolution of audio contents by itself or which is(are) able to provide an enhanced spatial resolution of audio contents by combining with the preceding [=ChannelGroup=]s.
The term <dfn noexport>Parameter Substream</dfn> means a sequence of parameter values that are associated with the algorithms used for reconstructing, rendering, and mixing. It is applied to its associated [=Audio Element=] or [=Mix Presentation=].
- [=Parameter Substream=]s may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time duration. As such, they may be viewed as a 1D signal with different metadata specified for different time durations.
The term <dfn noexport>Mix Presentation</dfn> means a series of processes to present [=Immersive Audio=] contents to end-users by using [=Audio Element=](s). It contains metadata that describes how the [=Audio Element=](s) is(are) rendered and mixed together for playback through physical loudspeakers or headphones, and loudness information.
The term <dfn noexport>Rendered Mix Presentation</dfn> means a [=3D audio signal=] after the [=Audio Element=](s) defined in a [=Mix Presentation=] is(are) rendered and mixed together for playback through physical loudspeakers or headphones.
## Architecture ## {#architecture}
Based on the model, this specification defines the immersive audio model and format (<dfn noexport>IAMF</dfn>) architecture as depicted in the figure below.
<center><img src="images/Hypothetical IAMF Architecture.png" style="width:100%; height:auto;"></center>
<center><figcaption>IAMF Architecture</figcaption></center>
For a given input 3D audio,
- A Pre-Processor generates [=ChannelGroup=](s), [=Descriptors=] and [=Parameter Substream=](s).
- A Codec Enc generates coded [=Audio Substream=](s).
- An OBU Packetizer generates an [=IA Sequence=] from the coded [=Audio Substream=](s) and [=Descriptors=] and [=Parameter Substream=](s).
- A File Packager (ISOBMFF Encapsulation) generates an IAMF File by encapsulating the [=IA Sequence=] into [[!ISOBMFF]] track(s).
- A File Parser (ISOBMFF Parser) reconstructs the [=IA Sequence=] by decapsulating the IAMF File.
- An OBU Parser outputs the coded [=Audio Substream=](s) and the [=Parameter Substream=](s).
- A Codec Dec outputs decoded [=ChannelGroup=](s) after decoding of the coded [=Audio Substream=](s).
- A Post-Processor outputs an [=Immersive Audio=] by using the [=ChannelGroup=](s), the [=Descriptors=] and the [=Parameter Substream=](s).
- Pre-Processor, [=ChannelGroup=](s), Codec Enc and OBU Packetizer are defined in [[#iamfgeneration]].
- [=IA Sequence=] is defined in [[#iasequence]].
- ISOBMFF Encapsulation, IAMF file (ISOBMFF file), and ISOBMFF Parser are defined in [[#isobmff]].
- OBU Parser, Codec Dec, and Post-Processor are defined in [[#processing]].
## Bitstream Structure ## {#bitstream}
### IA Sequence ### {#iasequence}
An [=IA Sequence=] is a bitstream to represent [=Immersive Audio=] contents and consists of [=Descriptors=] and [=IA Data=].
The metadata in the [=Descriptors=] and [=IA Data=] are packetized into individual Open Bitstream Units (OBU)s. The term Open Bitstream Unit (OBU) is the concrete, physical unit used to represent the components in the model.
### Use of OBU ### {#use-of-obu}
#### Descriptors #### {#bitstream-descriptors}
<dfn noexport>Descriptors</dfn> contain all the information that is required to set up and configure the decoders, reconstruction algorithm, renderers, and mixers. [=Descriptors=] do not contain audio signals.
- <dfn noexport>IA Sequence Header OBU</dfn> indicates the start of a full [=IA Sequence=] description and contains information related to profiles.
- <dfn noexport>Codec Config OBU</dfn> provides information to set up a decoder for a coded [=Audio Substream=].
- <dfn noexport>Audio Element OBU</dfn> provides information to combine one or more [=Audio Substream=]s to reconstruct an [=Audio Element=].
- <dfn noexport>Mix Presentation OBU</dfn> provides information to render and mix one or more [=Audio Element=]s to generate the final 3D audio output.
- Multiple [=Mix Presentation=]s can be defined as alternatives to each other within the same [=IA Sequence=]. Furthermore, the choice of which [=Mix Presentation=] to use at playback is left to the user. For example, multi-language support is implemented by defining different [=Mix Presentation=]s, where the first mix describes the use of the [=Audio Element=] with English dialogue, and the second mix describes the use of the [=Audio Element=] with French dialogue.
#### IA Data #### {#iadata}
<dfn noexport>IA Data</dfn> contains the time-varying data that is required in the generation of the final 3D audio output.
- <dfn noexport>Audio Frame OBU</dfn> provides the coded audio frame for an [=Audio Substream=]. Each frame has an implied start timestamp and an explicitly defined duration. A coded [=Audio Substream=] is represented as a sequence of [=Audio Frame OBU=]s with the same identifier, in time order.
- <dfn noexport>Parameter Block OBU</dfn> provides the parameter values in a block for a [=Parameter Substream=]. Each block has an implied start timestamp and an explicitly defined duration. A time-varying [=Parameter Substream=] is represented as a sequence of parameter values in [=Parameter Block OBU=]s with the same identifier, in time order.
- <dfn noexport>Temporal Delimiter OBU</dfn> identifies the [=Temporal Unit=]s. It may or may not be present in [=IA Sequence=]. If present, the first OBU of every [=Temporal Unit=] is the [=Temporal Delimiter OBU=].
## Timing Model ## {#timingmodel}
A coded [=Audio Substream=] is made of consecutive [=Audio Frame OBU=]s. Each [=Audio Frame OBU=] is made of audio samples at a given sample rate. The decode duration of an [=Audio Frame OBU=] is the number of audio samples divided by the sample rate. The presentation duration of an [=Audio Frame OBU=] is the number of audio samples remaining after trimming divided by the sample rate. The decode start time (respectively presentation start time) of an [=Audio Frame OBU=] is the sum of the decode durations (respectively presentation durations) of previous [=Audio Frame OBU=]s in the IA Sequence, or 0 otherwise. The decode duration (respectively presentation duration) of a coded [=Audio Substream=] is the sum of the decode durations (respectively presentation durations) of all its [=Audio Frame OBU=]s. The decode start time of an [=Audio Substream=] is the decode start time of its first [=Audio Frame OBU=]. The presentation start time of an [=Audio Substream=] is the presentation start time of its first [=Audio Frame OBU=] which is not entirely trimmed.
A [=Parameter Substream=] is made of consecutive [=Parameter Block OBU=]s. Each [=Parameter Block OBU=] is made of parameter values at a given sample rate. The decode duration of a [=Parameter Block OBU=] is the number of parameter values divided by the sample rate. The decode start time of a [=Parameter Block OBU=] is the sum of the decode duration of previous [=Parameter Block OBU=]s if any, 0 otherwise. The decode duration of a [=Parameter Substream=] is the sum of all its [=Parameter Block OBU=]s' decode durations. The start time of a [=Parameter Substream=] is the decode start time of its first [=Parameter Block OBU=]. When all parameter values in a [=Parameter Substream=] are constant, no [=Parameter Block OBU=]s may be present in the [=IA Sequence=].
Within an [=Audio Element=], the presentation start times of all [=Audio Substream=]s coincide and is the presentation start time of the [=Audio Element=]. All [=Audio Substream=]s have the same presentation duration which is the presentation duration of the [=Audio Element=].
- The decode start times of all coded [=Audio Substream=]s and all [=Parameter Substream=]s coincide and is the decode start time of the [=Audio Element=].
- All coded [=Audio Substream=]s and all [=Parameter Substream=]s have the same decode duration which is the decode duration of the [=Audio Element=].
Within a [=Mix Presentation=], the presentation start time of all [=Audio Element=]s coincide and all [=Audio Element=]s have the same duration defining the duration of the [=Mix Presentation=].
Within an [=IA Sequence=], all [=Mix Presentation=]s have the same duration, defining the duration of the [=IA Sequence=], and have the same presentation start time defining the presentation start time of the [=IA Sequence=].
The term <dfn noexport>Temporal Unit</dfn> means a set of all [=Audio Frame OBU=]s with the same decode start time and the same duration from all coded [=Audio Substream=]s and all non-redundant [=Parameter Block OBU=]s with the decode start time within the duration.
The figure below shows an example of the Timing Model in terms of the decode start times and durations of the coded [=Audio Substream=] and [=Parameter Substream=].
<center><img src="images/IAMF Timing Model.png" style="width:100%; height:auto;"></center>
<center><figcaption>An example of the IAMF Timing Model. AFO: Audio Frame OBU, PBO: Parameter Block OBU, PT<code>x</code>: time <code>x</code> (ms) on the presentation layer's timeline, DT<code>y</code>: time <code>y</code> (ms) on the decoding layer's timeline.</figcaption></center>
NOTE: For a given decoded [=Audio Substream=] (before trimming) and its associated [=Parameter Substream=](s), a decoder MAY apply trimming in 1 of 2 ways:
<br/>
1) The decoder processes the [=Audio Substream=] using the [=Parameter Substream=](s), and then trims the processed audio samples.
<br/>
2) The decoder trims both the [=Audio Substream=] and the [=Parameter Substream=](s). Then, the decoder processes the trimmed [=Audio Substream=] using the trimmed [=Parameter Substream=](s).
# Open Bitstream Unit (OBU) Syntax and Semantics # {#obu-syntax}
The [=IA Sequence=] uses the OBU syntax.
This section specifies the OBU syntax elements and their semantics.
## Immersive Audio OBU Syntax and Semantics ## {#immersiveaudio-obu}
OBUs are structured with an obu_header() and an OBU payload.
obu_header() and all OBU payloads including reserved_obu() are byte aligned.
<b>Syntax</b>
```
class ia_open_bitstream_unit() {
obu_header();
if (obu_type == OBU_IA_Sequence_Header)
ia_sequence_header_obu();
else if (obu_type == OBU_IA_Codec_Config)
codec_config_obu();
else if (obu_type == OBU_IA_Audio_Element)
audio_element_obu();
else if (obu_type == OBU_IA_Mix_Presentation)
mix_presentation_obu();
else if (obu_type == OBU_IA_Parameter_Block)
parameter_block_obu();
else if (obu_type == OBU_IA_Temporal_Delimiter)
temporal_delimiter_obu();
else if (obu_type == OBU_IA_Audio_Frame)
audio_frame_obu(true);
else if (obu_type >= 6 and <= 23)
audio_frame_obu(false);
else if (obu_type >=24 and <= 30)
reserved_obu();
}
```
<b>Semantics</b>
If the syntax element [=obu_type=] is equal to OBU_IA_Sequence_Header, an ordered series of OBUs is presented to the decoding process as a string of bytes.
## OBU Header Syntax and Semantics ## {#obu-header}
<b>Syntax</b>
```
class obu_header() {
unsigned int (5) obu_type;
unsigned int (1) obu_redundant_copy;
unsigned int (1) obu_trimming_status_flag;
unsigned int (1) obu_extension_flag;
leb128() obu_size;
if (obu_trimming_status_flag) {
leb128() num_samples_to_trim_at_end;
leb128() num_samples_to_trim_at_start;
}
if (obu_extension_flag == 1) {
leb128() extension_header_size;
unsigned int (8*extension_header_size) extension_header_bytes;
}
}
```
<b>Semantics</b>
<dfn noexport>obu_type</dfn> specifies the type of data structure contained in the OBU payload.
<pre class = "def">
obu_type: Name of obu_type
0 : OBU_IA_Codec_Config
1 : OBU_IA_Audio_Element
2 : OBU_IA_Mix_Presentation
3 : OBU_IA_Parameter_Block
4 : OBU_IA_Temporal_Delimiter
5 : OBU_IA_Audio_Frame
6~23 : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17
24~30 : Reserved
31 : OBU_IA_Sequence_Header
</pre>
<dfn noexport>obu_redundant_copy</dfn> indicates whether this OBU is a redundant copy of the previous OBU with the same [=obu_type=] in the [=IA Sequence=]. A value of 1 indicates that it is a redundant copy, while a value of 0 indicates that it is not.
It SHALL always be set to 0 for the following [=obu_type=] values:
- OBU_IA_Temporal_Delimiter
- OBU_IA_Audio_Frame
- OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17
If a decoder encounters an OBU with [=obu_redundant_copy=] = 1, and it has also received the previous non-redundant OBU, it MAY ignore the redundant OBU. If the decoder has not received the previous non-redundant OBU, it SHALL treat the redundant copy as a non-redundant OBU and process the OBU accordingly.
<dfn noexport>obu_trimming_status_flag</dfn> indicates whether this OBU has audio samples to be trimmed. It SHALL be set only when [=obu_type=] is set to OBU_IA_Audio_Frame or OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17.
For a given coded [=Audio Substream=],
- If an [=Audio Frame OBU=] has its [=num_samples_to_trim_at_start=] field set to a non-zero value N, the decoder SHALL discard the first N audio samples.
- If an [=Audio Frame OBU=] has its [=num_samples_to_trim_at_end=] field set to a non-zero value N, the decoder SHALL discard the last N audio samples.
NOTE: Because of coding dependency, discarding a sample can sometimes mean decoding the entire audio frame.
- For a given [=Audio Frame OBU=], the sum of [=num_samples_to_trim_at_start=] and [=num_samples_to_trim_at_end=] SHALL be less than or equal to the number of samples in the [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]).
NOTE: This means that if one of the values is set to the number of samples in the [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]), the other value is set to 0.
- When [=num_samples_to_trim_at_start=] is non-zero, all [=Audio Frame OBU=]s with the same [=audio_substream_id=], and preceding this OBU back until the [=Codec Config OBU=] defining this [=Audio Substream=], SHALL have their [=num_samples_to_trim_at_start=] field equal to the number of samples in the corresponding [=Audio Frame OBU=] (i.e., [=num_samples_per_frame=]).
- When [=num_samples_to_trim_at_end=] is non-zero in an [=Audio Frame OBU=], there SHALL be no subsequent [=Audio Frame OBU=] with the same [=audio_substream_id=] until a non-redundant [=Codec Config OBU=] defining an [=Audio Substream=] with the same [=audio_substream_id=].
<dfn noexport>obu_extension_flag</dfn> indicates whether the [=extension_header_size=] field is present. If it is set to 0, the [=extension_header_size=] field SHALL NOT be present. Otherwise, the [=extension_header_size=] field SHALL be present.
This flag SHALL be set to 0 for this version of the specification. An OBU parser that is conformant with this version of the specification SHOULD ignore the [=extension_header_bytes=].
NOTE: A future version of the specification may use this flag to specify an extension header field by setting [=obu_extension_flag=] = 1 and setting the size of the extended header to [=extension_header_size=].
<dfn noexport>obu_size</dfn> indicates the size in bytes of the OBU immediately following the obu_size field of the OBU. An OBU MAY have extra bytes after consuming all the bytes per the OBU syntax definition. Parsers compliant with this version of the specification SHOULD ignore the extra bytes.
<dfn noexport>num_samples_to_trim_at_start</dfn> indicates the number of samples that need to be trimmed from the start of the samples in this [=Audio Frame OBU=].
<dfn noexport>num_samples_to_trim_at_end</dfn> indicates the number of samples that need to be trimmed from the end of the samples in this [=Audio Frame OBU=].
<dfn noexport>extension_header_size</dfn> indicates the size in bytes of the extension header immediately following this field.
<dfn noexport>extension_header_bytes</dfn> indicates the byte representations of the syntaxes of the extension header.
## Reserved OBU Syntax and Semantics ## {#obu-reserved}
Reserved OBUs SHOULD be ignored by parsers compliant with this version of the specification. Future versions of the specification MAY define semantics for these reserved OBUs that would only be supported by parsers compliant with these future versions.
<b>Syntax</b>
```
class reserved_obu() {
}
```
## IA Sequence Header OBU Syntax and Semantics ## {#obu-iasequenceheader}
This section specifies the OBU payload of OBU_IA_Sequence_Header.
This OBU is used to indicate the start of an [=IA Sequence=]. So, the first OBU in an [=IA Sequence=] SHALL have [=obu_type=] = OBU_IA_Sequence_Header.
NOTE: When an [=IA Sequence=] is stored in a file, the [=IA Sequence Header OBU=] can be used to identify that the file contains an [=IA Sequence=].
This OBU MAY be placed frequently within one single [=IA Sequence=] for an application such as broadcasting or multicasting. In that case, all [=IA Sequence Header OBU=]s except the first one SHALL be marked as redundant (i.e., [=obu_redundant_copy=] = 1).
<b>Syntax</b>
```
class ia_sequence_header_obu() {
unsigned int (32) ia_code;
unsigned int (8) primary_profile;
unsigned int (8) additional_profile;
}
```
<b>Semantics</b>
<dfn noexport>ia_code</dfn> is a ‘four-character code’ (4CC), <code>iamf</code>.
NOTE: When IA OBUs are delivered over a protocol that does not provide explicit [=IA Sequence=] boundaries, a parser may locate the [=IA Sequence=] start by searching for the code <code>iamf</code> preceded by specific OBU header values. For example, by assuming that [=obu_extension_flag=] is set to 0 and because [=obu_trimming_status_flag=] is set to 0 for an [=IA Sequence Header OBU=], the OBU header can be 0xF806 or 0xFC06.
<dfn noexport>primary_profile</dfn> indicates the primary profile that this [=IA Sequence=] complies with. Parsers compliant with this version of the specification SHOULD discard the [=IA Sequence=] if they do not support the value indicated here.
The mappings below are applied for both [=primary_profile=] and [=additional_profile=].
- 0: Simple Profile
- 1: Base Profile
- 2~255: Reserved
<dfn noexport>additional_profile</dfn> indicates an additional profile that this [=IA Sequence=] complies with. If an [=IA Sequence=] only complies with the [=primary_profile=], this field SHALL be set to the same value as [=primary_profile=].
NOTE: If a future version defines a new profile, e.g., HypotheticalProfile, that is backward compatible with the Base profile, for example by defining new OBUs that would be ignored by the Base-compatible parser, an IA writer can decide to set the [=primary_profile=] to "Base Profile" while setting the [=additional_profile=] to "HypotheticalProfile". This way an old processor will know it can parse and produce an acceptable rendering, while a new processor still knows it can produce a better result because it will not ignore the additional features.
## Codec Config OBU Syntax and Semantics ## {#obu-codecconfig}
This section specifies the OBU payload of OBU_IA_Codec_Config.
<b>Syntax</b>
```
class codec_config_obu() {
leb128() codec_config_id;
codec_config();
}
class codec_config() {
unsigned int (32) codec_id;
leb128() num_samples_per_frame;
signed int (16) audio_roll_distance;
decoder_config(codec_id);
}
```
<b>Semantics</b>
<dfn noexport>codec_config_id</dfn> defines an identifier for a codec configuration. Within an [=IA Sequence=], there SHALL be one unique [=codec_config_id=] per codec. There SHALL be exactly one [=Codec Config OBU=] with a given identifier in a set of [=Descriptors=]. [=Audio Element=]s use this identifier to indicate that its corresponding [=Audio Substream=]s are coded with this codec configuration.
<dfn noexport>codec_id</dfn> indicates a ‘four-character code’ (4CC) to identify the codec used to generate the coded [=Audio Substream=]s. For this version of the specification, it SHALL be set to one of the four [=codec_id=] values defined below:
- 'Opus': All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!RFC6716]] specification and the [=decoder_config()=] structure SHALL comply with the constraints given in [[#opus-specific]].
- 'mp4a': All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!AAC]] specification and the [=decoder_config()=] structure SHALL comply with the constraints given in [[#aac-lc-specific]].
- 'fLaC': All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL comply with the [[!FLAC]] specification and the [=decoder_config()=] structure SHALL comply with the constraints given in [[#flac-specific]].
- 'ipcm': All coded [=Audio Substream=]s referred to by all [=Audio Element=]s with this codec configuration SHALL contain linear PCM (LPCM) audio samples and the [=decoder_config()=] structure SHALL comply with the constraints given in [[#lpcm-specific]].
Parsers compliant with this version of the specification SHOULD ignore [=Codec Config OBU=]s with an unknown [=codec_id=].
NOTE: 'ipcm' should not be confused with <code>lpcm</code>, which is another 4CC to identify codecs in other container formats (e.g., QuickTime).
<dfn noexport>num_samples_per_frame</dfn> indicates the frame length, in samples, of the [=audio_frame()=] provided in the audio_frame_obu(). It SHALL NOT be set to zero. If the [=decoder_config()=] structure for a given codec specifies a value for the frame length, the two values SHALL be equal.
<dfn noexport>audio_roll_distance</dfn> indicates how many audio frames prior to the current audio frame need to be decoded (and the decoded samples discarded) to set the encoder in a state that will produce the perfect decoded audio signal. It SHALL always be a negative value or zero. For some audio codecs, even if an audio frame can be decoded independently, the decoded signal after decoding only that frame may not represent a perfect, decoded audio signal, even ignoring compression artifacts. This can be due to overlap transforms. While potentially acceptable when starting to decode an [=Audio Substream=], it may be problematic when automatically switching between similar [=Audio Substream=]s of different quality and/or bitrate.
- It SHALL be set to -R when [=codec_id=] is set to 'Opus', where R is <code>ceil(3840 / [=num_samples_per_frame=])</code>.
- It SHALL be set to -1 when [=codec_id=] is set to 'mp4a'.
- It SHALL be set to 0 when [=codec_id=] is set to 'fLaC' or 'ipcm'.
<dfn noexport>decoder_config()</dfn> specifies the set of codec parameters required to decode the [=Audio Substream=]. It is byte aligned.
## Audio Element OBU Syntax and Semantics ## {#obu-audioelement}
This section specifies the OBU payload of OBU_IA_Audio_Element.
<b>Syntax</b>
```
class audio_element_obu() {
leb128() audio_element_id;
unsigned int (3) audio_element_type;
unsigned int (5) reserved;
leb128() codec_config_id;
leb128() num_substreams;
for (i = 0; i < num_substreams; i++) {
leb128() audio_substream_id;
}
leb128() num_parameters;
for (i = 0; i < num_parameters; i++) {
leb128() param_definition_type;
if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
DemixingParamDefinition demixing_info;
}
else if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
ReconGainParamDefinition recon_gain_info;
}
else if (param_definition_type > 2) {
leb128() param_definition_size;
unsigned int (8*param_definition_size) param_definition_bytes;
}
}
if (audio_element_type == CHANNEL_BASED) {
scalable_channel_layout_config();
} else if (audio_element_type == SCENE_BASED) {
ambisonics_config();
} else {
leb128() audio_element_config_size;
unsigned int (8*audio_element_config_size) audio_element_config_bytes;
}
}
```
```
class DemixingParamDefinition() extends ParamDefinition() {
default_demixing_info_parameter_data();
}
```
```
class default_demixing_info_parameter_data() extends demixing_info_parameter_data() {
unsigned int (4) default_w;
unsigned int (4) reserved;
}
```
```
class ReconGainParamDefinition() extends ParamDefinition() {
}
```
<b>Semantics</b>
<dfn noexport>audio_element_id</dfn> defines an identifier for an [=Audio Element=]. Within an [=IA Sequence=], there SHALL be one unique [=audio_element_id=] per [=Audio Element=]. There SHALL be exactly one [=Audio Element OBU=] with a given identifier in a set of [=Descriptors=]. [=Mix Presentation=]s refer to a particular [=Audio Element=] using this identifier.
<dfn noexport>audio_element_type</dfn> specifies the audio representation of this [=Audio Element=], which is constructed from one or more [=Audio Substream=]s. Parsers compliant with this version of the specification SHOULD ignore [=Audio Element OBU=]s with a reserved [=audio_element_type=].
<pre class = "def">
audio_element_type: The type of audio representation.
0 : CHANNEL_BASED
1 : SCENE_BASED
2~7 : Reserved
</pre>
<dfn value noexport for="audio_element_obu()">codec_config_id</dfn> indicates the identifier for the codec configuration which this [=Audio Element=] refers to. Parsers compliant with this version of the specification SHOULD ignore [=Audio Element OBU=]s with a [=codec_config_id=] identifying an unknown [=codec_id=].
<dfn noexport>num_substreams</dfn> specifies the number of [=Audio Substream=]s that are used to reconstruct this [=Audio Element=]. It SHALL NOT be set to 0.
<dfn value noexport for="audio_element_obu()">audio_substream_id</dfn> indicates the identifier for an [=Audio Substream=] which this [=Audio Element=] refers to.
Let a particular [=ChannelGroup=]'s [=Audio Substream=]s be indexed as [<dfn noexport>c</dfn>, <dfn noexport>n_c</dfn>], where a [=ChannelGroup=] generation rule is described in [[#iamfgeneration-scalablechannelaudio-channelgroupgenerationrule]] and
- [=c=] = [1, ..., C] is the [=ChannelGroup=] index and C is the number of [=ChannelGroup=]s.
- [=n_c=] = [1, ..., N_c] is the [=Audio Substream=] index in the c-th [=ChannelGroup=] and N_c is the number of [=Audio Substream=]s in the c-th [=ChannelGroup=].
Then, the i-th [=audio_substream_id=] maps to a [=ChannelGroup=]'s [=Audio Substream=]s as follows, where i is the index of the array:
```
[
[1, 1], [1, 2], ..., [1, N_1],
[2, 1], [2, 2], ..., [2, N_2],
...,
[C, 1], [C, 2], ..., [C, N_c]
]
```
The order of the [=Audio Substream=]s in each [=ChannelGroup=] (i.e., the semantics of n_c) is specified in [[#syntax-scalable-channel-layout-config]].
<dfn noexport>num_parameters</dfn> specifies the number of [=Parameter Substream=]s that are used by the algorithms specified in this [=Audio Element=].
- When [=audio_element_type=] = 0, this field SHALL be set to 0, 1, or 2 for this version of the specification.
- When [=audio_element_type=] = 1, this field SHALL be set to 0 for this version of the specification.
- Parsers compliant with this version of the specification SHALL be able to parse any value of [=num_parameters=].
NOTE: For a given [=audio_element_type=], a future version of the specification may define a new [=Parameter Substream=] which may be ignored by IA decoders compliant with this version of the specification. In that case, a new [=param_definition_type=] will be defined in a future version of [=Audio Element OBU=].
<dfn noexport>param_definition_type</dfn> specifies the type of the parameter definition. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.
<table class = "def">
<tr>
<th>param_definition_type</th><th>Parameter definition type</th><th>Parameter definition</th>
</tr>
<tr>
<td>0</td><td>PARAMETER_DEFINITION_MIX_GAIN</td><td>MixGainParamDefinition</td>
</tr>
<tr>
<td>1</td><td>PARAMETER_DEFINITION_DEMIXING</td><td>DemixingParamDefinition</td>
</tr>
<tr>
<td>2</td><td>PARAMETER_DEFINITION_RECON_GAIN</td><td>ReconGainParamDefinition</td>
</tr>
</table>
- The type PARAMETER_DEFINITION_MIX_GAIN SHALL NOT be present in [=Audio Element OBU=].
- The type SHALL NOT be duplicated in one [=Audio Element OBU=].
- When [=codec_id=] = 'fLaC' or 'ipcm', the type PARAMETER_DEFINITION_RECON_GAIN SHALL NOT be present.
- When [=num_layers=] > 1, the type PARAMETER_DEFINITION_RECON_GAIN SHALL be present.
- When the highest [=loudspeaker_layout=] of the (non-)scalable channel audio (i.e., [=num_layers=] = 1) is less than or equal to 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING SHALL NOT be present.
- When the highest [=loudspeaker_layout=] of the scalable channel audio (i.e., [=num_layers=] > 1) is greater than 3.1.2ch, both PARAMETER_DEFINITION_DEMIXING and PARAMETER_DEFINITION_RECON_GAIN types SHALL be present.
- When [=num_layers=] = 1 and [=loudspeaker_layout=] is greater than 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING MAY be present.
- An OBU parser that is conformant with this version of the specification SHALL be able to parse [=param_definition_type=] = P (where P > 2) and [=param_definition_size=]. The OBU Parser SHOULD ignore the bytes indicated by [=param_definition_size=].
<dfn noexport>demixing_info</dfn> provides the parameter definition for the demixing information, which is used to reconstruct a scalable channel audio representation. The parameter definition is provided by DemixingParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in [=demixing_info_parameter_data()=].
In this parameter definition,
- [=parameter_rate=] SHALL be set to the sample rate of this [=Audio Element=].
- [=param_definition_mode=] SHALL be set to 0.
- [=duration=] SHALL be the same as [=num_samples_per_frame=] of this [=Audio Element=].
- [=num_subblocks=] SHALL be set to 1.
- [=constant_subblock_duration=] SHALL be the same as [=duration=].
<dfn noexport>recon_gain_info</dfn> provides the parameter definition for the gain value, which is used to reconstruct a scalable channel audio representation. The parameter definition is provided by ReconGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in [=recon_gain_info_parameter_data()=].
In this parameter definition,
- [=parameter_rate=] SHALL be set to the sample rate of this [=Audio Element=].
- [=param_definition_mode=] SHALL be set to 0.
- [=duration=] SHALL be the same as [=num_samples_per_frame=] of this [=Audio Element=].
- [=num_subblocks=] SHALL be set to 1.
- [=constant_subblock_duration=] SHALL be same as [=duration=].
<dfn noexport>param_definition_size</dfn> indicates the size in bytes of [=param_definition_bytes=].
<dfn noexport>param_definition_bytes</dfn> represents reserved bytes for future use when new [=param_definition_type=] values are defined. Parsers compliant with this version of the specification SHOULD ignore these bytes.
<dfn noexport>scalable_channel_layout_config()</dfn> provides the metadata required for combining the [=Audio Substream=]s referred to here in order to reconstruct a scalable channel layout.
<dfn noexport>ambisonics_config()</dfn> provides the metadata required for combining the [=Audio Substream=]s referred to here in order to reconstruct an Ambisonics layout.
<dfn noexport>audio_element_config_size</dfn> indicates the size in bytes of [=audio_element_config_bytes=].
<dfn noexport>audio_element_config_bytes</dfn> represents reserved bytes for future use when new [=audio_element_type=] values are defined. Parsers compliant with this version of the specification SHOULD ignore these bytes.
<dfn noexport>default_demixing_info_parameter_data()</dfn> provides the default demixing parameter data to apply to all audio samples when there are no [=Parameter Block OBU=]s (with the same [=parameter_id=] defined in this DemixingParamDefinition()) provided.
- In this class, [=w_idx_offset=] in [=demixing_info_parameter_data()=] SHALL be ignored.
- Instead, [=default_w=] directly indicates the weight value [=w(k)=].
<dfn noexport>default_w</dfn> indicates the weight value [=w(k)=] for the [=TF2toT2 de-mixer=] specified in [[#processing-scalablechannelaudio-demixer]].
The mapping of [=default_w=] to [=w(k)=] SHOULD be as follows:
<pre class = "def">
default_w : w(k)
0 : 0
1 : 0.0179
2 : 0.0391
3 : 0.0658
4 : 0.1038
5 : 0.25
6 : 0.3962
7 : 0.4342
8 : 0.4609
9 : 0.4821
10 : 0.5
11 ~ 15 : reserved
</pre>
A default recon gain value of 0 dB is implied when there are no [=Parameter Block OBU=]s (with the same [=parameter_id=] defined in this ReconGainParamDefinition()) provided.
### Parameter Definition Syntax and Semantics ### {#parameter-definition}
Parameter definition classes inherit from the abstract <dfn noexport>ParamDefinition()</dfn> class.
<b>Syntax</b>
```
abstract class ParamDefinition() {
leb128() parameter_id;
leb128() parameter_rate;
unsigned int (1) param_definition_mode;
unsigned int (7) reserved;
if (param_definition_mode == 0) {
leb128() duration;
leb128() constant_subblock_duration;
if (constant_subblock_duration == 0) {
leb128() num_subblocks;
for (i=0; i< num_subblocks; i++) {
leb128() subblock_duration;
}
}
}
}
```
<b>Semantics</b>
<dfn value noexport for="ParamDefinition()">parameter_id</dfn> indicates the identifier for the [=Parameter Substream=] which this parameter definition refers to. There SHALL be one unique [=parameter_id=] per [=Parameter Substream=].
<dfn noexport>parameter_rate</dfn> specifies the rate used by this [=Parameter Substream=], expressed as ticks per second. Time-related fields associated with this [=Parameter Substream=], such as durations, SHALL be expressed in the number of ticks.
- The rate SHALL be a value such that (the rate * [=num_samples_per_frame=]) / (the sample rate of [=Audio Element=]) is a non-zero integer.
<dfn noexport>param_definition_mode</dfn> indicates whether this parameter definition specifies the [=duration=], [=num_subblocks=], [=constant_subblock_duration=] and [=subblock_duration=] fields for the parameter blocks with the same [=parameter_id=].
- When this field is set to 0, all of the [=duration=], [=num_subblocks=], [=constant_subblock_duration=], and [=subblock_duration=] fields SHALL be specified in this parameter definition. None of the parameter blocks with the same [=parameter_id=] SHALL specify these same fields.
- When this field is set to 1, none of the [=duration=], [=num_subblocks=], [=constant_subblock_duration=], and [=subblock_duration=] fields SHALL be specified in this parameter definition. Instead, each parameter block with the same [=parameter_id=] SHALL specify these same fields.
<dfn noexport>duration</dfn> specifies the duration for which each parameter block with the same [=parameter_id=] is valid and applicable. It SHALL NOT be set to 0.
<dfn noexport>constant_subblock_duration</dfn> specifies the duration of each subblock, in the case where all subblocks except the last subblock have equal durations. If all subblocks except the last subblock do not have equal durations, the value of constant_subblock_duration SHALL be set to 0.
Let <dfn noexport>D</dfn> = the value of [=duration=], <dfn noexport>NS</dfn> = the value of [=num_subblocks=], <dfn noexport>CSD</dfn> = the value of [=constant_subblock_duration=] and <dfn noexport>SD</dfn> = the value of [=subblock_duration=].
- When [=CSD=] != 0, [=num_subblocks=] is implicitly calculated as [=NS=] = ceil([=D=] / [=CSD=]).
- If [=NS=] * [=CSD=] > [=D=], the actual duration of the last subblock SHALL be [=D=] - ([=NS=] - 1) * [=CSD=].
- When [=CSD=] = 0, the summation of all [=SD=]s in this parameter block SHALL be equal to [=D=].
<dfn noexport>num_subblocks</dfn> specifies the number of different sets of parameter values specified in each parameter block with the same [=parameter_id=], where each set describes a different subblock of the timeline, contiguously.
<dfn noexport>subblock_duration</dfn> specifies the duration for the given subblock. It SHALL NOT be set to 0.
The values for [=duration=], [=constant_subblock_duration=], and [=subblock_duration=] SHALL be expressed as the number of ticks at the [=parameter_rate=] specified in the corresponding parameter definition.
### Scalable Channel Layout Config Syntax and Semantics ### {#syntax-scalable-channel-layout-config}
[=scalable_channel_layout_config()=] provides the configuration for a given scalable channel audio representation.
<b>Syntax</b>
```
class scalable_channel_layout_config() {
unsigned int (3) num_layers;
unsigned int (5) reserved;
for (i = 1; i <= num_layers; i++) {
channel_audio_layer_config(i);
}
}
class channel_audio_layer_config(i) {
unsigned int (4) loudspeaker_layout(i);
unsigned int (1) output_gain_is_present_flag(i);
unsigned int (1) recon_gain_is_present_flag(i);
unsigned int (2) reserved;
unsigned int (8) substream_count(i);
unsigned int (8) coupled_substream_count(i);
if (output_gain_is_present_flag(i) == 1) {
unsigned int (6) output_gain_flags(i);
unsigned int (2) reserved;
signed int (16) output_gain(i);
}
}
```
When an [=Audio Element=] is composed of G(r) number of [=Audio Substream=]s, its scalable channel audio representation is layered into [=num_layers=] = r number of [=ChannelGroup=]s.
- The order of the [=ChannelGroup=]s in each [=Temporal Unit=] SHALL be same as the order of channel_audio_layer_config()s in scalable_channel_layout_config().
- The q-th [=ChannelGroup=] consists of G(q) - G(q-1) number of [=Audio Substream=]s, where q = 1, 2, ..., r and G(0) = 0.
- Let the term "Audio Frames" mean the set of all [=Audio Frame OBU=]s (for this [=Audio Element=]) that have the same start timestamp. All Audio Frames in an [=IA Sequence=] SHALL have the same number of [=Audio Frame OBU=]s.
- [=Parameter Block OBU=]s MAY be associated with Audio Frames.
<center><img src="images/Immersive Audio Sequence with scalable channel audio (before OBU packing).png" style="width:100%; height:auto;"></center>
<center><figcaption>Immersive Audio Sequence with scalable channel audio (before OBU packing). See [[#standalone]] for related details on OBU ordering within an IA Sequence.</figcaption></center>
Each [=ChannelGroup=] (or scalable audio channel layer) is associated with a different [=loudspeaker_layout=]. The IA decoder SHALL select one of the layers according to the following rules, in order:
- The IA decoder SHOULD first attempt to select the layer with a [=loudspeaker_layout=] that matches the physical playback layout.
- If there is no match, the IA decoder SHOULD select the layer with the closest [=loudspeaker_layout=] to the physical layout and then apply up- or down-mixing appropriately, after decoding and reconstruction of the channel audio. Sections [[#iamfgeneration-scalablechannelaudio-downmixmechanism]] and [[#processing-downmixmatrix]] provide examples of dynamic and static down-mixing matrices for some common layouts that MAY be used.
<b>Semantics</b>
<dfn noexport>num_layers</dfn> indicates the number of [=ChannelGroup=]s for scalable channel audio. It SHALL NOT be set to zero and its maximum value SHALL be 6.
- If [=loudspeaker_layout=] is set to Binaural, this field SHALL be set to 1.
<dfn noexport>channel_audio_layer_config()</dfn> provides the i-th [=ChannelGroup=]'s configuration, where i is the layer index provided as input argument to this class.
<dfn noexport>loudspeaker_layout</dfn> indicates the channel layout to be reconstructed from the precedent [=ChannelGroup=]s and current [=ChannelGroup=]. When a reserved value for [=loudspeaker_layout=] is used, parsers compliant with this version of the specification SHOULD skip the [=channel_audio_layer_config()=] for that layer and all subsequent ones, if any.
In this version of the specification, [=loudspeaker_layout=] indicates one of the 10 channel layouts listed below, where
- <dfn noexport>Stereo</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System A (0+2+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System B (0+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1.2ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System C (2+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>5.1.4ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System D (4+5+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System I (0+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1.2ch</dfn> is the combination of the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System I (0+7+0)=] of [[!ITU2051-3]] and the left and right top front pair of the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System J (4+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>7.1.4ch</dfn> is the loudspeaker configuration as depicted in [=Loudspeaker configuration for Sound System J (4+7+0)=] of [[!ITU2051-3]].
- <dfn noexport>3.1.2ch</dfn> is the front subset (L/C/R/Ltf/Rtf/LFE) of [=7.1.4ch=].
<pre class = "def">
Loudspeaker Layout (4 bits) : Channel Layout : Loudspeaker Location Ordering
0000 : Mono : C
0001 : Stereo : L/R
0010 : 5.1ch : L/C/R/Ls/Rs/LFE
0011 : 5.1.2ch : L/C/R/Ls/Rs/Ltf/Rtf/LFE
0100 : 5.1.4ch : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
0101 : 7.1ch : L/C/R/Lss/Rss/Lrs/Rrs/LFE
0110 : 7.1.2ch : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
0111 : 7.1.4ch : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
1000 : 3.1.2ch : L/C/R/Ltf/Rtf/LFE
1001 : Binaural : L/R
others : reserved :
</pre>
```
Where C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround,
Rs: Right Surround, Rss: Right Side Surround, Lrs: Left Rear Surround, Rrs: Right Rear Surround
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear,
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects
```
NOTE: The Ltr and Rtr of 5.1.4ch down-mixed from 7.1.4ch is within the range of Ltb and Rtb of 7.1.4ch, in terms of their positions according to [[!ITU2051-3]].
For a given input audio with [=audio_element_type=] = CHANNEL_BASED, if the input audio has height channels (e.g., 7.1.4ch or 5.1.2ch), it is RECOMMENDED to use channel layouts with height channels (i.e., higher than or equal to 3.1.2ch) for all [=loudspeaker_layouts=].
- Examples for RECOMMENDED list of channel layouts: 3.1.2ch/5.1.2ch, 3.1.2ch/5.1.2ch/7.1.4ch, 5.1.2ch/7.1.4ch, etc.
- Examples for NOT RECOMMENDED list of channel layouts: 2ch/3.1.2ch/5.1.2ch, 2ch/3.1.2ch/5.1.2ch/7.1.4ch, 2ch/5.1.2ch/7.1.4ch, 2ch/7.1.4ch, etc.
NOTE: This specification allows down-mixing mechanisms (e.g., as specified in [[#iamfgeneration-scalablechannelaudio-downmixmechanism]]) to drop the height channel if the output layout has no height channels. An example is down-mixing from 7.1.4ch to Mono, Stereo, 5.1ch or 7.1ch. Therefore, given an input audio with height channels, an encoder may generate a set of scalable audio channel groups with layouts that do not have height channels.
<dfn noexport>output_gain_is_present_flag</dfn> indicates if the output_gain information fields for the [=ChannelGroup=] are present.
- 0: No output_gain information fields for the [=ChannelGroup=] are present.
- 1: output_gain information fields for the [=ChannelGroup=] are present. In this case, [=output_gain_flags=] and [=output_gain=] fields are present.
<dfn noexport>recon_gain_is_present_flag</dfn> indicates if the recon_gain information fields for the [=ChannelGroup=] are present in [=recon_gain_info_parameter_data()=].
- 0: No recon_gain information fields for the [=ChannelGroup=] are present in [=recon_gain_info_parameter_data()=].
- 1: recon_gain information fields for the [=ChannelGroup=] are present in [=recon_gain_info_parameter_data()=]. In this case, the [=recon_gain_flags=] and [=recon_gain=] fields are present.
<dfn noexport>substream_count</dfn> specifies the number of [=Audio Substream=]s. The sum of all [=substream_count=]s in this OBU SHALL be the same as [=num_substreams=] in this OBU. It SHALL NOT be set to 0.
<dfn noexport>coupled_substream_count</dfn> specifies the number of referenced [=Audio Substream=]s, each of which is coded as coupled stereo channels.
Each pair of coupled stereo channels in the same [=ChannelGroup=] SHALL be coded in stereo mode to generate one single coded [=Audio Substream=] and each of the non-coupled channels in the same [=ChannelGroup=] SHALL be coded in mono mode to generate one single coded [=Audio Substream=].
- <dfn noexport>Coupled stereo channels</dfn>: L/R, Ls/Rs, Lss/Rss, Lrs/Rrs, Ltf/Rtf, Ltb/Rtb
- <dnf noexport>Non-coupled channels</dfn>: C, LFE, L
The order of the [=Audio Substream=]s in each [=ChannelGroup=] SHALL be as follows:
- Coupled substreams come first and are followed by non-coupled substreams.
- Coupled substreams for surround channels come first and are followed by the coupled substreams for top channels.
- Coupled substreams for front channels come first and are followed by the coupled substreams for the side, rear and back channels.
- Coupled substreams for side channels come first and are followed by the coupled substreams for rear channels.
- Center channel comes first and is followed by LFE, and then L.
- Where, <dfn noexport>non-coupled substream</dfn> is a coded [=Audio Substream=] from one of non-coupled channels.
<dfn noexport>output_gain_flags</dfn> indicates the channels which [=output_gain=] is applied to. If a bit is set to 1, [=output_gain=] SHALL be applied to the channel. Otherwise, [=output_gain=] SHALL NOT be applied to the channel.
<pre class = "def">
Bit position : Channel Name
b5(MSB) : Left channel (L1, L2, L3)
b4 : Right channel (R2, R3)
b3 : Left Surround channel (Ls5)
b2 : Right Surround channel (Rs5)
b1 : Left Top Front channel (Ltf)
b0 : Right Top Front channel (Rtf)
</pre>
<dfn noexport>output_gain</dfn> indicates the gain value to be applied to the mixed channels which are indicated by [=output_gain_flags=], where each mixed channel is generated by downmixing two or more input channels. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e., Q7.8 in [[!Q-Format]]).
### Ambisonics Config Syntax and Semantics ### {#syntax-ambisonics-config}
[=ambisonics_config()=] provides the configuration for a given Ambisonics representation. In this specification, the [[!AmbiX]] format is adopted, which uses Ambisonics Channel Number (ACN) channel ordering and normalizes the channels with Schmidt Semi-Normalization (SN3D).