[Audacity-devel] Automatic time-syncing feature

Discussion:

Raphaël Marinier

2016-06-26 14:27:07 UTC

Hi Audacity developers,

One feature that I have been missing in Audacity is automatic time-syncing
of two audio tracks.

The use case is when one has multiple recordings of the same event, not
time-synced, and wants to align them together. This happens for instance
when the two tracks come from multiple devices (e.g. video camera and
portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure they
align, which is cumbersome and imprecise.

I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of tracks in an efficient way using a combination of
audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate
syncing, and maximization of cross-correlation for the fine-tuning.

I could implement such feature in audacity as a new effect. Would this
contribution be welcome in Audacity? Is it possible that the output of an
effect be a "time-shift"?

Thanks,

RaphaÃ«l

Roger Dannenberg

2016-06-26 15:29:22 UTC

Permalink

This is an interesting problem. Offhand, I would guess results would be
better if you emphasized low frequencies in some way: high
frequencies/short wavelengths are more easily reflected so if recording
devices or objects around them move, reflection paths could change,
shifting the timing by milliseconds (about 1ms per 1ft change in path
length). Of course, low frequencies will have less timing precision, so
there's a tradeoff.

Another important consideration is the difference in sample rates
between recordings. Even if 2 devices claim to record at 44.1kHz, the
*actual* sample rate is slightly different. A 0.01% difference, which is
very likely in consumer devices, over 20 minutes (1200s) of recording
time would result in a drift of 0.12s (!), so any time-syncing should
estimate time shift at multiple points and try to correct for sample
rate differences.

-Roger

Post by RaphaÃ«l Marinier
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event,
not time-synced, and wants to align them together. This happens for
instance when the two tracks come from multiple devices (e.g. video
camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure
they align, which is cumbersome and imprecise.
I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of tracks in an efficient way using a
combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
<https://www.ee.columbia.edu/%7Edpwe/papers/Wang03-shazam.pdf>) for
approximate syncing, and maximization of cross-correlation for the
fine-tuning.
I could implement such feature in audacity as a new effect. Would this
contribution be welcome in Audacity? Is it possible that the output of
an effect be a "time-shift"?
Thanks,
Raphaël
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

James Crook

2016-06-26 15:49:16 UTC

Permalink

Raphaël, we are interested, but in something more general than this.
http://wiki.audacityteam.org/wiki/Proposal_Audio_Diff

Aligning two tracks is a special case.
You should look at VAMP plug ins for Audacity. These can do analysis,
not just effects.
Here is some information about the MATCH plug in for audio diff.

https://code.soundsoftware.ac.uk/projects/match-vamp
Calculate alignment between two performances in separate channel inputs.

The code for doing the alignment is one part of the problem. We also
need to design good user interface for using it. My view is that when
designing an interface to align two audio sequences without inserting
gaps, we should at the same time be thinking about the interface for
aligning them with gaps. Otherwise we will eventually end up with two
different interfaces doing 'the same thing'.

I would very much like it if you worked with the VAMP MATCH plug in, and
get details sorted and it written up for the manual so that we would
want to ship it with Audacity.

--James.

Bill Unruh

2016-06-26 16:00:53 UTC

Permalink

Well, time shift is not the only problem, since most recordings are not at the
same frequency even if they have the same nominal frequency. 44100 and 48000
are obvious, but 44100 and 44150 are far more possible with standard consumer
grade sound cards. Of course one could break up the item into blocks, and time
shift each one. (for example, it would take about 800 sec for the above two
frequencies to be out by 1 sec in their time sync, so timeshifting once a
second could be done. But even then a dropping or adding of 50 frames would
surely be noticeable.) Ie, one should also do frequency shifting as well if it
were to work. Once could of course do time shift at the beginning and the end
of a block and use the difference to also impliment a freq shift.

Post by RaphaÃ«l Marinier
I've researched a bit the subject, and I think it would be doable to implement auto-syncing of
tracks in an efficient way using a combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect. Would this contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
RaphaÃ«l

Raphaël Marinier

2016-06-26 20:11:59 UTC

Permalink

Varying sampling rate if indeed an issue that will need to be taken care
of. Do the actual frequencies of multiple devices tend to only differ by a
constant multiplier (e.g. constant 44100 vs constant 44150), or is it
common to have small variations of sampling rate in a recording from a
single device (e.g. device first records at 44100, and then drifts to
44150)? The former is of course easier to solve.

James, thanks for the background and advice.
Indeed the "Audio Diff" proposal is more general. It also seems quite
harder, at least because of the variations in the way of playing, speed and
potential gaps, as you mentioned, and because of all the UI questions
around the handling of imprecise and partial matches, time expansion,
errors, etc.. Also, the algorithms will of course be more generic and
complex than for aligning two recordings of the same performance. I had a
quick look at the MATCH paper
<http://www.eecs.qmul.ac.uk/~simond/pub/2005/ismir05.pdf>, and the max
errors for commercial recordings on page 5 shows that the algorithm is far
from perfect.

I'll have a look into the MATCH plugin and do some tests. Do you think
there would be space for both features: (1) Simple alignment of N
recordings of the same sound (my original proposal) (2) Audio Diff, with
advanced UI to visualize and work with diffs? Is there any other software
doing (2), so that we can have an idea of the user experience?

RaphaÃ«l

Post by RaphaÃ«l Marinier
Hi Audacity developers,

Post by RaphaÃ«l Marinier
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event, not
time-synced, and wants to
align them together. This happens for instance when the two tracks come
from multiple devices
(e.g. video camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure they
align, which is cumbersome
and imprecise.

Post by RaphaÃ«l Marinier
I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of
tracks in an efficient way using a combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect. Would this
contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
RaphaÃ«l

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

James Crook

2016-06-26 23:01:09 UTC

Permalink

Do you think there would be space for both features: (1) Simple
alignment of N recordings of the same sound (my original proposal) (2)
Audio Diff, with advanced UI to visualize and work with diffs?

Yes.
All I am suggesting is that in designing the UI for the special case the
more general case be thought about.

For example in the general case we might have indications of how
stretchy different parts of the audio are. Silence and vowel sounds are
stretchy. Percussion sounds are not. The visuals and interaction for
indicating that could be used for the 'stretchy' pieces at the beginning
and ends of the audio in the simpler case of just time shifting whole
sequence without otherwise changing it.

Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track. Under the hood you really need source separation, to pick
out the central instruments, and delayed right and left instruments
(respectively). The alignment of left and right audio channels is
ambiguous if you are not allowed to split the sources out.

Is there any other software doing (2), so that we can have an idea of
the user experience?

I'm not aware of it for audio, but have not researched it.

I am aware of it for DNA and protein sequence alignment editors (from
the 90s). The sync-lock we have in Audacity is a starting point for a
manual alignment editor. We would need to be able to lock particular
segments of audio together, not just whole sequence, and to be able to
turn on and off those local sync-locks easily. Our time ruler would need
to allow insertions and deletions in it just like any other track. One
exercise is to think about how 'Truncate Silence' effect would look as
an effect if it did the same thing as now but affected the
timeline/ruler rather than the waveform.

As well as an alignment view we would want a dotplot view.

--James.

Roger Dannenberg

2016-06-27 00:38:13 UTC

Permalink

Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.

Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.

-Roger

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

Robert Hänggi

2016-06-27 07:19:10 UTC

Permalink

Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
For the background see:
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.

In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
proper aligning it with the base audio:
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.

In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.

Cheers
Robert

Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

Raphaël Marinier

2016-07-07 22:00:26 UTC

Permalink

Thanks for the information.

I did some testing of the MATCH vamp plugin, running it via sonic analyzer,
which integrates it already.

First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a
max allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such as 10 minutes will be quite expensive...

I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.

(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.

(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio is
wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.

(3) 2 audio tracks recorded from the same concert (left right channels from
same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.

(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.

(5) 2 recordings of two performances of the same composition, time shift of
about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.

My conclusion is that the match plugin cannot be used that easily, even for
the simple case of 2 recordings of the same event, because of accuracy and
performance. The former could be fixable by imposing stronger regularity of
the path (e.g. piecewise linear). The latter might be harder.

I propose to start working on an algorithm and feature specific to the case
of 2 recordings of the same event, which is an easier case to start with
both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.

Cheers,

--
RaphaÃ«l

Post by Robert HÃ¤nggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

------------------------------------------------------------------------------

Post by Roger Dannenberg
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

James Crook

2016-07-13 22:02:58 UTC

Permalink

Sorry for the delay in getting back to you on this thread.

If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely
discriminating comparison that is picking up broad characteristics.
What's great about running two match matrices is that the algorithm
naturally switches in to using the best kind of matching for different
sections.

On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach. Instead of allocating space length x max-shift you sample
evenly and only allocate space of k x max-shift for some small value of
k such as 100. The cost is that you have to repeat the analysis log(
length-of-sequence) times, where log is to the base k. So aligning to
the nearest 10ms on two 1hr sequences with a shift of up to 20 mins
would take 50Mb storage (if one match matrix) or 100Mb (with two in
parallel), and the analysis would be repeated 3 times. Because you stay
in cache in the analysis and write much less to external memory it's a
big net win both in storage and speed over a single pass approach.

I haven't written versions for sound. This is extrapolating from back
in old times, in the late 80's when I was analysing DNA and protein
sequences on a PC with a fraction of the power and storage of modern
PCs. You had to be inventive to get any decent performance at all.
This kind of trick can pay off in a big way, even today.

I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!

--James.

Post by RaphaÃ«l Marinier
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max allowed time shift of 60s, it takes 6 minutes on a recent
processor (Intel i5-5200U), and takes about 8GB of RAM. Using is for
largeer time shifts such as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the
audio is wrongly aligned. This will be quite problematic when building
a feature that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with
a time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent
as <0.8 or >1.2 a significant fraction of the time. This is pretty bad
since a correct match should find a tempo ratio of 1 throughout the
recording. Things can be improved using non-default parameters of
lowering the cost of the diagonal to 1.5, and enabling the "path
smoothing" feature, but tempo ratio still routinely hovers around 0.9
- 1.1.
(5) 2 recordings of two performances of the same composition, time
shift of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole
(10s and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves the match by mostly fixing the boundaries around the hole.
There is still a small issue with the first 0.5s of the performance
that remains incorrectly matched.
I cannot really evaluate the match more than that, because
sonic-analyzer just produces the graphs, but does not actually match
the tracks.
My conclusion is that the match plugin cannot be used that easily,
even for the simple case of 2 recordings of the same event, because of
accuracy and performance. The former could be fixable by imposing
stronger regularity of the path (e.g. piecewise linear). The latter
might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will allow for piecewise linear ratios between frequencies (with
additional regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert

Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will

generate

Post by Roger Dannenberg
similar problems. I would suggest that if you're recording with

multiple

Post by Roger Dannenberg
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least

surprise" I

Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable job

given the

Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters

(~30ms at

Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be

aligned within

Post by Roger Dannenberg
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing

the same

Post by Roger Dannenberg
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger

Post by James Crook
Something else to think about is what happens if you attempt to

align

Post by Roger Dannenberg

Post by James Crook
two mono tracks that happen actually to be left and right audio

of a

Post by Roger Dannenberg

Post by James Crook
stereo track.

Vaughan Johnson

2016-07-13 22:26:13 UTC

Permalink

James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."

Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .

Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.

-- V

Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence)
times, where log is to the base k. So aligning to the nearest 10ms on two
1hr sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a
max allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger
regularity of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
RaphaÃ«l

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

Raphaël Marinier

2017-06-10 10:51:43 UTC

Permalink

Hi all,

After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.

You can see the code there:
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567

The algorithm is as follows:
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.

There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.

I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).

Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?

There are of course some limitations that should still be addressed:
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.

The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).

James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.

Raphaël

Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.
-- V

Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

Roger Dannenberg

2017-06-10 14:54:05 UTC

Permalink

Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.

Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?

(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)

-Roger

Post by RaphaÃ«l Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël

Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

Robert Hänggi

2017-06-10 15:38:07 UTC

Permalink

I'm experiencing often that Audacity crashes when I use 'resample' or
'resamplev', especially when the selection is a bit longer or when the
(static) factor exceeds about 1:19.

Robert

Post by Roger Dannenberg
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger

Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely
discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100.
The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël

Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel

Raphaël Marinier

2017-06-17 22:49:34 UTC

Permalink

Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?

I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>
)

However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.

As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?

Thanks,

RaphaÃ«l

Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger

Post by RaphaÃ«l Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.

https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567

Post by Roger Dannenberg

Post by RaphaÃ«l Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
RaphaÃ«l

For