Parallel Reverb Processor

Carl Doersch and Keith Bare

Proposal

Milestone 1

Milestone 2

Final Report

Samples



Parallel Freeverb Code

Final Report

April 30, 2008

Introduction:

When a musical instrument is played in a small room or on a synthesizer, it usually doesn't sound very good--it tends to sound "flat" or "dry." This is because, natural, large spaces, "reverberate" sound--that is, the sound reflects off of the room's surfaces. Hence, the listener hears not only the initial sound, but thousands of echos that are delayed by different amounts. Musicians often add artificial reverberation effects to digital audio, but good computer reverb algorithms can be computationally intensive.



Reverb is applied to a series of samples



Hence, the goal of this project is to parallelize a reverb algorithm. The parallel version of the algorithm should be able to outperform the sequential version of the algorithm. Ideally, the parallel algorithm will speed up linearly as more processors are used.

Methods and Process:

We began with an open source reverb audio effects package called Freeverb3, and focused on the implementation of a Schroeder/Moorer reverb model. Freeverb3 encapsulates this in an object called "revmodel." A revmodel then takes in a stream of audio samples and outputs a stream containing the audio with reverb applied. Internally, the revmodel contains 24 "filters," which also take in a stream of audio samples and output a transformed version of the audio. Since we're processing stereo, 12 filters process the left stream and 12 process the right stream. On each side, there are 8 "comb" filters and 4 "allpass" filters. Comb filters are slightly more computationally intensive than allpass filters. The revmodel passes the input through each comb filter in parallel and then through each allpass filter in series, summing the output from the 8 comb filters to create the input stream for the first allpass filter. Filters are stateful; the samples must be passed through each filter in the same order that they appear in the original audio stream.

One of our initial ideas for parallelizing the algorithm was to simply run each filter on a separate thread, and pass the streams of audio data from one processor to another. However, we ultimately gave up on this idea, because it makes load balancing extremely difficult. First of all, different types of filters require different amounts of computation. Second, we didn't want to constrain the number of processors used in the computation. Hence, in order to assure good load balancing, we would have needed to build some mechanism to sense bottlenecks in the stream, and then we would have needed to pass overloaded filters to underloaded processors, along with the buffered samples that were not processed. Of course, there are many cases where this sort of passing would simply cause the bottleneck to reappear on the recipient processor. Hence, we concluded that this design would probably not lead to reliable speedups.

The next design we considered, and the one we finally implemented, involves distributing the audio data between the processors. The audio is first divided up into chunks. Processors each calculate separately which chunk of the file to process, and then separately read the correct portion of the file and process it.

Next, we decided to implement a basic version of this program to get some baseline performance numbers. We decided to use MPI because there were no operations where a shared memory system would be helpful. This basic version showed us that the slowest operation was not the computation, but instead reading and writing the files. Speedups other than I/O were reasonable. Reading and writing speeds were similar to each other, equal to about 1.0-1.5 MB/s. Since modern commodity disks can transfer at rates up to 125 MB/s, it seems that I/O is having a larger effect on our times than would be expected on a commodity PC. This first version did reading in parallel, but not writing. We decided that it made sense to parallelize I/O for the sake of measuring and analyzing performance, but we would expect that on a commodity system the parallelizing of I/O would not be as helpful.

Next, we needed to implement a mechanism for splicing together two chunks that have already had reverb applied to them. The problem is that reverb is inherently stateful. Chunks in the middle of the file lack information required to correctly calculate reverb, since the sound in the previous chunk will have echos that last into the current chunk. As a result, a revmodel can't generate the echos for the sounds in the previous chunk. However, over time, the effect of samples before the chunk diminishes, and the output begins to converge toward the expected output.

We considered several designs that made use of this convergence phenomenon. Ultimately, we decided that a processor, processing chunk n, should serialize its entire revmodel and pass it on to the processor processing chunk n+1. This struck us as an efficient way to provide the reverb state from the previous chunk; all the data necessary for the revmodel is transferred, without any extra information.

Finally, we parallelized the writing process. This was done by simply writing the samples produced by each processor to a separate file. We looked at ways to write all the samples to the same file, but the library we use to write wav files, libsndfile, makes it difficult to correctly create a file header when there are multiple writers. The obvious solution to this problem is modification to the sound file library, but that isn't in the scope of this project. We did, however, write a short program to sequentially combine the output files into a single result.

There was one odd phenomenon that we haven't yet been able to explain. To perform the splicing, we would pass revmodels from one processor to another. When the revmodels arrive at the destination processor, the destination processor feeds zero-samples into it, until the revmodel's output is sufficiently silent. This usually ends up calculating 2 seconds of audio. We discovered that when the zeros fed into the revmodel are allocated at a single memory location (either a constant on the stack or dynamically allocated on the heap), the processing would go about 30 times slower than when we allocated a large array of zeros on the heap and then fed a different zero into the revmodel at each step of the processing. We noticed this when a separate bug (silence threshold was too low) caused us to feed an entire chunk-size worth of zeros into the revmodel, and this phenomenon caused our computation times to be dominated by zero processing. At the moment, our solution that avoids wasting the memory required in allocating an array full of zeros is a custom "feedzero" method inside the revmodel that does the computation slightly differently. We speculate that this phenomenon could be due to some sort of odd caching effects, or the single memory address blocking some compiler optimizations.

Results:

We tested our code on Cobalt as well as Bigben. We tested with two different files - the "medium" file was approximately 65 MB (equivalent to about 6 minutes of sound), and the "large" file was approximately 440 MB (equivalent to about 45 minutes of audio). We have attained significant speedups on both machines, although the speedups were not quite linear. Data from Cobalt may be seen below. Bigben, however, is having problems running our jobs at the moment, and so our data is incomplete. We expect to have information about the Bigben runs posted on the website once Bigben is working properly.







We speculate that the reason for most of the non-linearity is contention. There is one large burst of communication near the end of reverb processing when the revmodels are passed. At some point, this burst maxes out the bandwidth of the interconnect. Similarly, there are a number of overheads present on each processor (notably, the computation needed to splice the chunks together). These are constant, and so any single processor cannot finish in less time than is allowed by these overheads. Judging by the shape of the curve however (a sudden, sharp leveling of performance and even drops in performance for large numbers of processors), contention is probably the more important issue.

The speedup graphs also show that overheads due to I/O are important, especially with large numbers of processors. Hence, we plotted the amount of time spent on each operation (reading, computation, and writing) with respect to the number of processors. Note that these graphs use a logarithmic scale; hence, ideal speedups would give rise to straight lines on the graph.







From the graphs above, we can see that I/O performance does not scale with increased processors as well as computation performance. For both problems, speedup due to parallelization of I/O seems to be good for 8 and fewer processors. However, with more than 8 processors performance starts to degrade, especially for reads. We're not sure exactly why this is the case. One of our theories is that as the number of processors performing the read increases, the bandwidth requirements to complete the reads become large enough to saturate the machine's interconnect, and reads take longer due to the contention.

Write performance seems to scale a bit better than read performance. There is a slight bump in the write time with 12 processors, but after that bump write times continue to decrease. One suspicion is that this could be due to buffering, either in the operating system or the interconnection network. Another is that, during writing, the system has more flexibility to chose where, physically, the data is written to. Hence, it can write to p different storage devices when it's writing from p processors, but it can't choose how many storage devices to read from.

Conclusion:

In the end, we were able to achieve decent speedups performing reverb processing in parallel. In many ways, this was an interesting problem. Dealing with digital audio samples increases the importance of high performance I/O, as the inputs for processing can be very large. We observed speedups that weren't perfect, but this was expected; our method of parallelization requires each processor to do some extra work to correctly handle transitions between chunks.

We believe our work is of interest mostly due to the emergence of multicore workstations. Parallelizing algorithms for audio processing will allow more complex calculations to be used. Since processor clock rates are no longer increasing, parallelism will be the only way to continue to do more work in the same amount of time.

Appendix (May 1, 2008):

Now that Bigben is working again, here's our data from the bigben runs.











Clearly I/O is much more important on Bigben; the majority of the runtime is spent doing reading and writing. Our parallelizations of the I/O seem to have helped greatly; speedups on reading and writing are near linear.

It is somewhat interesting that, in terms of computation on the large problem, Bigben seems to perform best when the jobs are allocated on a number of processors that is a power of 2. Since every processor p needs to communicate with processor p+1 mod p, it perhaps Bigben lays out the jobs on processors in such a way that processor i is close to processor i+1 and processors that have numbers that are a power of 2 are close to processor 0.

This does not seem to be the case on the medium problem, however. It is possible that, in these cases, other overheads are dominating (a much larger proportion of the time is spent doing the splicing, since the amount of time required to do the splice is not dependent on the length of the file).

We were also surprised that the total speedups for the small problem look better than the speedups for the large problem on 24 processors. We aren't sure why this happened; it seems like the overheads for I/O should remain constant per byte read or written, and seek times should remain constant regardless of the number of bytes read or written. It is possible that some of these effects are a result of others on the machine; perhaps there is a limit on I/O bandwidth over a certain amount of time and our largest reads and writes exceed that bandwith. It is also possible that they are a result of coincidental interactions with other jobs writing to disk.





This final chart shows a comparison of the times for both machines. It is difficult to compare the speedups for the machines since Bigben takes so much longer for I/O than Cobalt. It would seem that Bigben overall scales better in terms of processor performance and I/O performance; Cobalt is faster for a moderate number of processors, but for a higher number of processors its performance levels off, whereas Bigben seems to continue to get higher performance at as many as 64 processors. Presumably this is due to the interconnect; Bigben's topology probably allows it to avoid contention where Cobalt cannot.

Data (large problem):

Processors

Computation Time

Computation Speedup

Total Time

Total Speedup


COBALT

BIGBEN

COBALT

BIGBEN

COBALT

BIGBEN

COBALT

BIGBEN

1

80.0255

34.5718

1.0000

1.0000

94.5851

499.3740

1.0000

1.0000

2

39.9669

19.1783

2.0023

1.8027

47.4466

259.6760

1.9935

1.9231

4

20.0072

9.6634

3.9998

3.5776

24.3191

179.2920

3.8893

2.7853

8

10.0954

5.8955

7.9269

5.8641

12.6481

77.8826

7.4782

6.4119

12

6.8702

4.5187

11.6482

7.6508

9.7304

69.1050

9.7206

7.2263

16

5.1426

3.3363

15.5613

10.3624

7.5507

51.9893

12.5267

9.6053

24

3.5501

3.6620

22.5419

9.4406

5.8098

49.8869

16.2803

10.0101

32

2.7952

1.1637

28.6293

29.7080

5.0586

27.3689

18.6979

18.2460

40


1.0706


32.2935


24.8149


20.1240

48

2.7859

1.0666

28.7249

32.4137

4.9947

22.1412

18.9370

22.5541

56


1.0056


34.3800


16.8916


29.5635

64

3.0551

0.7427

26.1940

46.5491

4.8460

14.7431

19.5181

33.8717



Data (medium problem):

Processors

Computation Time

Computation Speedup

Total Time

Total Speedup


COBALT

BIGBEN

COBALT

BIGBEN

COBALT

BIGBEN

COBALT

BIGBEN

1

11.593

5.2844

1.0000

1.0000

13.6662

92.066

1.0000

1.0000

2

5.82614

2.98339

1.9898

1.7712

6.96535

55.8371

1.9620

1.6488

4

3.03349

1.6407

3.8216

3.2208

3.60864

20.3676

3.7870

4.5202

8

1.52072

1.0456

7.6233

5.0539

1.90229

13.2046

7.1840

6.9722

12

1.19466

0.840953

9.7040

6.2838

1.67296

8.35388

8.1688

11.0207

16

0.811849

0.939031

14.2797

5.6275

1.22487

10.0013

11.1572

9.2054

24

0.741781

0.911556

15.6286

5.7971

1.00156

5.91694

13.6449

15.5597