Discussion:
[Audacity-devel] Automatic time-syncing feature
Raphaël Marinier
2016-06-26 14:27:07 UTC
Permalink
Hi Audacity developers,

One feature that I have been missing in Audacity is automatic time-syncing
of two audio tracks.

The use case is when one has multiple recordings of the same event, not
time-synced, and wants to align them together. This happens for instance
when the two tracks come from multiple devices (e.g. video camera and
portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure they
align, which is cumbersome and imprecise.

I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of tracks in an efficient way using a combination of
audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate
syncing, and maximization of cross-correlation for the fine-tuning.

I could implement such feature in audacity as a new effect. Would this
contribution be welcome in Audacity? Is it possible that the output of an
effect be a "time-shift"?

Thanks,

Raphaël
Roger Dannenberg
2016-06-26 15:29:22 UTC
Permalink
This is an interesting problem. Offhand, I would guess results would be
better if you emphasized low frequencies in some way: high
frequencies/short wavelengths are more easily reflected so if recording
devices or objects around them move, reflection paths could change,
shifting the timing by milliseconds (about 1ms per 1ft change in path
length). Of course, low frequencies will have less timing precision, so
there's a tradeoff.

Another important consideration is the difference in sample rates
between recordings. Even if 2 devices claim to record at 44.1kHz, the
*actual* sample rate is slightly different. A 0.01% difference, which is
very likely in consumer devices, over 20 minutes (1200s) of recording
time would result in a drift of 0.12s (!), so any time-syncing should
estimate time shift at multiple points and try to correct for sample
rate differences.

-Roger
Post by Raphaël Marinier
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event,
not time-synced, and wants to align them together. This happens for
instance when the two tracks come from multiple devices (e.g. video
camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure
they align, which is cumbersome and imprecise.
I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of tracks in an efficient way using a
combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
<https://www.ee.columbia.edu/%7Edpwe/papers/Wang03-shazam.pdf>) for
approximate syncing, and maximization of cross-correlation for the
fine-tuning.
I could implement such feature in audacity as a new effect. Would this
contribution be welcome in Audacity? Is it possible that the output of
an effect be a "time-shift"?
Thanks,
Raphaël
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2016-06-26 15:49:16 UTC
Permalink
Raphaël, we are interested, but in something more general than this.
http://wiki.audacityteam.org/wiki/Proposal_Audio_Diff

Aligning two tracks is a special case.
You should look at VAMP plug ins for Audacity. These can do analysis,
not just effects.
Here is some information about the MATCH plug in for audio diff.

https://code.soundsoftware.ac.uk/projects/match-vamp
Calculate alignment between two performances in separate channel inputs.

The code for doing the alignment is one part of the problem. We also
need to design good user interface for using it. My view is that when
designing an interface to align two audio sequences without inserting
gaps, we should at the same time be thinking about the interface for
aligning them with gaps. Otherwise we will eventually end up with two
different interfaces doing 'the same thing'.

I would very much like it if you worked with the VAMP MATCH plug in, and
get details sorted and it written up for the manual so that we would
want to ship it with Audacity.

--James.
Post by Raphaël Marinier
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event,
not time-synced, and wants to align them together. This happens for
instance when the two tracks come from multiple devices (e.g. video
camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure
they align, which is cumbersome and imprecise.
I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of tracks in an efficient way using a
combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
<https://www.ee.columbia.edu/%7Edpwe/papers/Wang03-shazam.pdf>) for
approximate syncing, and maximization of cross-correlation for the
fine-tuning.
I could implement such feature in audacity as a new effect. Would this
contribution be welcome in Audacity? Is it possible that the output of
an effect be a "time-shift"?
Thanks,
Raphaël
Bill Unruh
2016-06-26 16:00:53 UTC
Permalink
Post by Raphaël Marinier
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event, not time-synced, and wants to
align them together. This happens for instance when the two tracks come from multiple devices
(e.g. video camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure they align, which is cumbersome
and imprecise.
Well, time shift is not the only problem, since most recordings are not at the
same frequency even if they have the same nominal frequency. 44100 and 48000
are obvious, but 44100 and 44150 are far more possible with standard consumer
grade sound cards. Of course one could break up the item into blocks, and time
shift each one. (for example, it would take about 800 sec for the above two
frequencies to be out by 1 sec in their time sync, so timeshifting once a
second could be done. But even then a dropping or adding of 50 frames would
surely be noticeable.) Ie, one should also do frequency shifting as well if it
were to work. Once could of course do time shift at the beginning and the end
of a block and use the difference to also impliment a freq shift.
Post by Raphaël Marinier
I've researched a bit the subject, and I think it would be doable to implement auto-syncing of
tracks in an efficient way using a combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect. Would this contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
Raphaël
Raphaël Marinier
2016-06-26 20:11:59 UTC
Permalink
Varying sampling rate if indeed an issue that will need to be taken care
of. Do the actual frequencies of multiple devices tend to only differ by a
constant multiplier (e.g. constant 44100 vs constant 44150), or is it
common to have small variations of sampling rate in a recording from a
single device (e.g. device first records at 44100, and then drifts to
44150)? The former is of course easier to solve.

James, thanks for the background and advice.
Indeed the "Audio Diff" proposal is more general. It also seems quite
harder, at least because of the variations in the way of playing, speed and
potential gaps, as you mentioned, and because of all the UI questions
around the handling of imprecise and partial matches, time expansion,
errors, etc.. Also, the algorithms will of course be more generic and
complex than for aligning two recordings of the same performance. I had a
quick look at the MATCH paper
<http://www.eecs.qmul.ac.uk/~simond/pub/2005/ismir05.pdf>, and the max
errors for commercial recordings on page 5 shows that the algorithm is far
from perfect.

I'll have a look into the MATCH plugin and do some tests. Do you think
there would be space for both features: (1) Simple alignment of N
recordings of the same sound (my original proposal) (2) Audio Diff, with
advanced UI to visualize and work with diffs? Is there any other software
doing (2), so that we can have an idea of the user experience?

Raphaël
Post by Raphaël Marinier
Hi Audacity developers,
Post by Raphaël Marinier
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same event, not
time-synced, and wants to
align them together. This happens for instance when the two tracks come
from multiple devices
(e.g. video camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make sure they
align, which is cumbersome
and imprecise.
Well, time shift is not the only problem, since most recordings are not at the
same frequency even if they have the same nominal frequency. 44100 and 48000
are obvious, but 44100 and 44150 are far more possible with standard consumer
grade sound cards. Of course one could break up the item into blocks, and time
shift each one. (for example, it would take about 800 sec for the above two
frequencies to be out by 1 sec in their time sync, so timeshifting once a
second could be done. But even then a dropping or adding of 50 frames would
surely be noticeable.) Ie, one should also do frequency shifting as well if it
were to work. Once could of course do time shift at the beginning and the end
of a block and use the difference to also impliment a freq shift.
Post by Raphaël Marinier
I've researched a bit the subject, and I think it would be doable to
implement auto-syncing of
tracks in an efficient way using a combination of audio fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) for approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect. Would this
contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
Raphaël
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2016-06-26 23:01:09 UTC
Permalink
Do you think there would be space for both features: (1) Simple
alignment of N recordings of the same sound (my original proposal) (2)
Audio Diff, with advanced UI to visualize and work with diffs?
Yes.
All I am suggesting is that in designing the UI for the special case the
more general case be thought about.

For example in the general case we might have indications of how
stretchy different parts of the audio are. Silence and vowel sounds are
stretchy. Percussion sounds are not. The visuals and interaction for
indicating that could be used for the 'stretchy' pieces at the beginning
and ends of the audio in the simpler case of just time shifting whole
sequence without otherwise changing it.

Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track. Under the hood you really need source separation, to pick
out the central instruments, and delayed right and left instruments
(respectively). The alignment of left and right audio channels is
ambiguous if you are not allowed to split the sources out.
Is there any other software doing (2), so that we can have an idea of
the user experience?
I'm not aware of it for audio, but have not researched it.

I am aware of it for DNA and protein sequence alignment editors (from
the 90s). The sync-lock we have in Audacity is a starting point for a
manual alignment editor. We would need to be able to lock particular
segments of audio together, not just whole sequence, and to be able to
turn on and off those local sync-locks easily. Our time ruler would need
to allow insertions and deletions in it just like any other track. One
exercise is to think about how 'Truncate Silence' effect would look as
an effect if it did the same thing as now but affected the
timeline/ruler rather than the waveform.

As well as an alignment view we would want a dotplot view.

--James.
Roger Dannenberg
2016-06-27 00:38:13 UTC
Permalink
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.

Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.

-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
Robert Hänggi
2016-06-27 07:19:10 UTC
Permalink
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
For the background see:
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.

In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
proper aligning it with the base audio:
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.

In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.

Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Raphaël Marinier
2016-07-07 22:00:26 UTC
Permalink
Thanks for the information.

I did some testing of the MATCH vamp plugin, running it via sonic analyzer,
which integrates it already.

First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a
max allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such as 10 minutes will be quite expensive...

I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.

(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.

(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio is
wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.

(3) 2 audio tracks recorded from the same concert (left right channels from
same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.

(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.

(5) 2 recordings of two performances of the same composition, time shift of
about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.

My conclusion is that the match plugin cannot be used that easily, even for
the simple case of 2 recordings of the same event, because of accuracy and
performance. The former could be fixable by imposing stronger regularity of
the path (e.g. piecewise linear). The latter might be harder.

I propose to start working on an algorithm and feature specific to the case
of 2 recordings of the same event, which is an easier case to start with
both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.

Cheers,

--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2016-07-13 22:02:58 UTC
Permalink
Sorry for the delay in getting back to you on this thread.


If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely
discriminating comparison that is picking up broad characteristics.
What's great about running two match matrices is that the algorithm
naturally switches in to using the best kind of matching for different
sections.


On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach. Instead of allocating space length x max-shift you sample
evenly and only allocate space of k x max-shift for some small value of
k such as 100. The cost is that you have to repeat the analysis log(
length-of-sequence) times, where log is to the base k. So aligning to
the nearest 10ms on two 1hr sequences with a shift of up to 20 mins
would take 50Mb storage (if one match matrix) or 100Mb (with two in
parallel), and the analysis would be repeated 3 times. Because you stay
in cache in the analysis and write much less to external memory it's a
big net win both in storage and speed over a single pass approach.

I haven't written versions for sound. This is extrapolating from back
in old times, in the late 80's when I was analysing DNA and protein
sequences on a PC with a fraction of the power and storage of modern
PCs. You had to be inventive to get any decent performance at all.
This kind of trick can pay off in a big way, even today.

I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!

--James.
Post by Raphaël Marinier
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max allowed time shift of 60s, it takes 6 minutes on a recent
processor (Intel i5-5200U), and takes about 8GB of RAM. Using is for
largeer time shifts such as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the
audio is wrongly aligned. This will be quite problematic when building
a feature that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with
a time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent
as <0.8 or >1.2 a significant fraction of the time. This is pretty bad
since a correct match should find a tempo ratio of 1 throughout the
recording. Things can be improved using non-default parameters of
lowering the cost of the diagonal to 1.5, and enabling the "path
smoothing" feature, but tempo ratio still routinely hovers around 0.9
- 1.1.
(5) 2 recordings of two performances of the same composition, time
shift of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole
(10s and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves the match by mostly fixing the boundaries around the hole.
There is still a small issue with the first 0.5s of the performance
that remains incorrectly matched.
I cannot really evaluate the match more than that, because
sonic-analyzer just produces the graphs, but does not actually match
the tracks.
My conclusion is that the match plugin cannot be used that easily,
even for the simple case of 2 recordings of the same event, because of
accuracy and performance. The former could be fixable by imposing
stronger regularity of the path (e.g. piecewise linear). The latter
might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will allow for piecewise linear ratios between frequencies (with
additional regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will
generate
Post by Roger Dannenberg
similar problems. I would suggest that if you're recording with
multiple
Post by Roger Dannenberg
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least
surprise" I
Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable job
given the
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters
(~30ms at
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be
aligned within
Post by Roger Dannenberg
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing
the same
Post by Roger Dannenberg
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right audio
of a
Post by Roger Dannenberg
Post by James Crook
stereo track.
Vaughan Johnson
2016-07-13 22:26:13 UTC
Permalink
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."



Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .

Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.

-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence)
times, where log is to the base k. So aligning to the nearest 10ms on two
1hr sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a
max allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis, but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger
regularity of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Raphaël Marinier
2017-06-10 10:51:43 UTC
Permalink
Hi all,

After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.

You can see the code there:
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567

The algorithm is as follows:
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.

There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.

I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).

Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?

There are of course some limitations that should still be addressed:
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.

The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).

James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.

Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Roger Dannenberg
2017-06-10 14:54:05 UTC
Permalink
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.

Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?

(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)

-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Robert Hänggi
2017-06-10 15:38:07 UTC
Permalink
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
I'm experiencing often that Audacity crashes when I use 'resample' or
'resamplev', especially when the selection is a bit longer or when the
(static) factor exceeds about 1:19.

Robert
Post by Roger Dannenberg
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples. For
aligning noise you want a fairly sloppy not very precisely
discriminating
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100.
The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono demodulator.
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Raphaël Marinier
2017-06-17 22:49:34 UTC
Permalink
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>
)

However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.

As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?

Thanks,

Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's when
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard systems,
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work, James
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for different
kinds of matching. The trick is to run two 'match matrices' at the same
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples.
For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's great about
running two match matrices is that the algorithm naturally switches in to
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer approach.
Instead of allocating space length x max-shift you sample evenly and only
allocate space of k x max-shift for some small value of k such as 100. The
cost is that you have to repeat the analysis log( length-of-sequence) times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would be
repeated 3 times. Because you stay in cache in the analysis and write much
less to external memory it's a big net win both in storage and speed over a
single pass approach.
I haven't written versions for sound. This is extrapolating from back in
old times, in the late 80's when I was analysing DNA and protein sequences
on a PC with a fraction of the power and storage of modern PCs. You had to
be inventive to get any decent performance at all. This kind of trick can
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with a max
allowed time shift of 60s, it takes 6 minutes on a recent processor (Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time shifts such
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment analysis,
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift of
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s hole
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the audio
is wrongly aligned. This will be quite problematic when building a feature
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different devices.
Throughout the match, it finds ratios of tempos that are as divergent as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad since a
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but tempo
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time shift
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole (10s
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly improves
the match by mostly fixing the boundaries around the hole. There is still a
small issue with the first 0.5s of the performance that remains incorrectly
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of accuracy
and performance. The former could be fixable by imposing stronger regularity
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to start
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I will
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific post.
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to align
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Roger Dannenberg
2017-06-18 02:57:45 UTC
Permalink
Here are some random thoughts: It would make sense to compute alignment
at many points and do some sort of smoothing. You might ask (and try to
solve): What alignment function minimizes the sum-of-squares of
alignment errors, considering only *plausible* alignment functions, i.e.
those that could be produced by real crystal clocks? I'm not even sure
of a reasonable model for clock drift, but one approach might be to just
take the alignment function, treat it as a signal and low-pass it. The
cut-off frequency would be very low, a tiny fraction of 1 Hz, and you'd
have to be careful not to introduce phase shift or lag: The standard
trick is to run an IIR filter over the signal, reverse it, filter it
again, and reverse it again, so that phase shifts or lags cancel. I
think getting the start and end of the signal right, i.e. initializing
the filter state, are also tricky. Another approach might be
least-squares regression to fit a higher-order polynomial rather than a
line to the data. At least, it seems that linear regression over a bunch
of alignment points would do a good job assuming clocks are stable and
just running at slightly different speeds.

-Roger
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When
doing local alignment (second phase of the algorithm) with very small
windows (e.g. 1ms), I indeed see varying detected time differences at
different positions in the two tracks. They seem to follow the loudest
source. E.g. detected time differences hover between -20 and +20ms for
two recordings ~15 meters apart, of sources ~10 meters apart (see this
graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just
apply this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying
time-stretching that jumps to the loudest source?
Thanks,
Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
Post by Roger Dannenberg
Post by Raphaël Marinier
- If the time drift is very small, we may want to avoid resampling
tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
On Thu, Jul 14, 2016 at 12:26 AM, Vaughan Johnson
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late
80's when
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard
systems,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Yes I've known about dynamic programming since about then. Good
work, James
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
kinds of matching. The trick is to run two 'match matrices' at
the same
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time, and have a penalty for switching between them. This is
excellent
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test
examples. For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's
great about
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
running two match matrices is that the algorithm naturally
switches in to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically
relative to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Instead of allocating space length x max-shift you sample evenly
and only
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allocate space of k x max-shift for some small value of k such as
100. The
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
cost is that you have to repeat the analysis log(
length-of-sequence) times,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where log is to the base k. So aligning to the nearest 10ms on
two 1hr
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage
(if one
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
match matrix) or 100Mb (with two in parallel), and the analysis
would be
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
repeated 3 times. Because you stay in cache in the analysis and
write much
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
less to external memory it's a big net win both in storage and
speed over a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
single pass approach.
I haven't written versions for sound. This is extrapolating from
back in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
old times, in the late 80's when I was analysing DNA and protein
sequences
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
on a PC with a fraction of the power and storage of modern PCs.
You had to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
be inventive to get any decent performance at all. This kind of
trick can
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime
seems
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
linear in the max time shift allowed. For aligning two 1h tracks,
with a max
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed time shift of 60s, it takes 6 minutes on a recent
processor (Intel
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed me - it can only report graphical results of the
alignment analysis,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a
time-shift of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a
30s hole
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where
the audio
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
is wrongly aligned. This will be quite problematic when building
a feature
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right
channels
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
from same device), except for a 30s hole filled with pink noise,
with a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Throughout the match, it finds ratios of tempos that are as
divergent as
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
<0.8 or >1.2 a significant fraction of the time. This is pretty
bad since a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
correct match should find a tempo ratio of 1 throughout the
recording.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of lowering
the cost of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the diagonal to 1.5, and enabling the "path smoothing" feature,
but tempo
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition,
time shift
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the
hole (10s
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the match by mostly fixing the boundaries around the hole. There
is still a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
small issue with the first 0.5s of the performance that remains
incorrectly
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that
easily, even
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
for the simple case of 2 recordings of the same event, because of
accuracy
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and performance. The former could be fixable by imposing stronger
regularity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific
to the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
case of 2 recordings of the same event, which is an easier case
to start
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the
sources. I will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allow for piecewise linear ratios between frequencies (with
additional
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert HÀnggi
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this
specific post.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we
have RIAA
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
in combination with a noise reduction expander, a delay caused
by the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square
fitting
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and
multiple
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
source problems. In the spirit of the "principle of least
surprise" I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable job
given the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters
(~30ms at
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt
to align
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right
audio of a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
are
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network bandwidth
and traffic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
patterns at an interface-level. Reveals which users, apps, and
protocols are
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Raphaël Marinier
2017-06-18 16:27:27 UTC
Permalink
Here are some random thoughts: It would make sense to compute alignment at
many points and do some sort of smoothing. You might ask (and try to
solve): What alignment function minimizes the sum-of-squares of alignment
errors, considering only *plausible* alignment functions, i.e. those that
could be produced by real crystal clocks? I'm not even sure of a reasonable
model for clock drift, but one approach might be to just take the alignment
function, treat it as a signal and low-pass it. The cut-off frequency would
be very low, a tiny fraction of 1 Hz, and you'd have to be careful not to
introduce phase shift or lag: The standard trick is to run an IIR filter
over the signal, reverse it, filter it again, and reverse it again, so that
phase shifts or lags cancel. I think getting the start and end of the
signal right, i.e. initializing the filter state, are also tricky. Another
approach might be least-squares regression to fit a higher-order polynomial
rather than a line to the data. At least, it seems that linear regression
over a bunch of alignment points would do a good job assuming clocks are
stable and just running at slightly different speeds.
Note that the current algorithm already does a linear regression over
multiple alignment points.This indeed corrects slightly different clock
speeds, assuming the speed differences are stable.

I'll go further and fit continuous piece-wise linear functions, to catch
unstable clock differences. I'll place the knots of the function ~10
minutes appart.

When we evaluate the alignment at many positions, some detected time
differences will be completely wrong. It can be because the algorithm did
not succeed (e.g. the time window considered is mostly filled with
silence), or because the two tracks only partially overlap. The model has
to be robust to those outlier and non-sensical values. The more complex the
function we fit, the harder it is, so I am very in favor of keeping the
fitting function as simple as possible.

Regarding the resampling, I've seen that soxr, currently used by Audacity,
supports piece-wise linear functions, and it seems straightforward to use
from Audacity's code. If we wanted to do this with Nyquist, I'd have to
execute nyquist instructions from my code, which seems quite more
complicated. Is there an easy to do it from code that I missed?

Raphaël
-Roger
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/
3276106c66c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
Post by Roger Dannenberg
Post by Raphaël Marinier
- If the time drift is very small, we may want to avoid resampling
tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's
when
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard
systems,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Yes I've known about dynamic programming since about then. Good work,
James
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
kinds of matching. The trick is to run two 'match matrices' at the
same
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time, and have a penalty for switching between them. This is
excellent
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test examples.
For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's great
about
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
running two match matrices is that the algorithm naturally switches
in to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Instead of allocating space length x max-shift you sample evenly and
only
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allocate space of k x max-shift for some small value of k such as
100. The
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
cost is that you have to repeat the analysis log( length-of-sequence)
times,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where log is to the base k. So aligning to the nearest 10ms on two
1hr
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage (if
one
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
match matrix) or 100Mb (with two in parallel), and the analysis would
be
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
repeated 3 times. Because you stay in cache in the analysis and
write much
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
less to external memory it's a big net win both in storage and speed
over a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
single pass approach.
I haven't written versions for sound. This is extrapolating from
back in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
old times, in the late 80's when I was analysing DNA and protein
sequences
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
on a PC with a fraction of the power and storage of modern PCs. You
had to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
be inventive to get any decent performance at all. This kind of
trick can
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks,
with a max
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a
time-shift of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the
audio
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
is wrongly aligned. This will be quite problematic when building a
feature
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with
a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Throughout the match, it finds ratios of tempos that are as divergent
as
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
since a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the
cost of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the diagonal to 1.5, and enabling the "path smoothing" feature, but
tempo
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time
shift
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole
(10s
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the match by mostly fixing the boundaries around the hole. There is
still a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
small issue with the first 0.5s of the performance that remains
incorrectly
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily,
even
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
for the simple case of 2 recordings of the same event, because of
accuracy
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and performance. The former could be fixable by imposing stronger
regularity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert HÀnggi <
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have
RIAA
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise"
I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable job given
the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters (~30ms
at
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right audio of
a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------
------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
are
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
patterns at an interface-level. Reveals which users, apps, and
protocols are
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
------------------------------------------------------------
------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Roger Dannenberg
2017-06-18 18:36:48 UTC
Permalink
Post by Roger Dannenberg
Here are some random thoughts: It would make sense to compute
alignment at many points and do some sort of smoothing. You might
ask (and try to solve): What alignment function minimizes the
sum-of-squares of alignment errors, considering only *plausible*
alignment functions, i.e. those that could be produced by real
crystal clocks? I'm not even sure of a reasonable model for clock
drift, but one approach might be to just take the alignment
function, treat it as a signal and low-pass it. The cut-off
frequency would be very low, a tiny fraction of 1 Hz, and you'd
have to be careful not to introduce phase shift or lag: The
standard trick is to run an IIR filter over the signal, reverse
it, filter it again, and reverse it again, so that phase shifts or
lags cancel. I think getting the start and end of the signal
right, i.e. initializing the filter state, are also tricky.
Another approach might be least-squares regression to fit a
higher-order polynomial rather than a line to the data. At least,
it seems that linear regression over a bunch of alignment points
would do a good job assuming clocks are stable and just running at
slightly different speeds.
Note that the current algorithm already does a linear regression over
multiple alignment points.This indeed corrects slightly different
clock speeds, assuming the speed differences are stable.
I'll go further and fit continuous piece-wise linear functions, to
catch unstable clock differences. I'll place the knots of the function
~10 minutes appart.
That sounds like a good idea. When you do regression on different
sections, the end-points will not match, creating another curve-fitting
problem. You may have a better idea, but here's one: Just use linear
regression to estimate the exact midpoint of each 10 minute segment;
then connect the midpoints to make a continuous piece-wise linear
function. I guess the first and last 5 minutes can just be a linear
extrapolation of that curve.
Post by Roger Dannenberg
When we evaluate the alignment at many positions, some detected time
differences will be completely wrong. It can be because the algorithm
did not succeed (e.g. the time window considered is mostly filled with
silence), or because the two tracks only partially overlap. The model
has to be robust to those outlier and non-sensical values. The more
complex the function we fit, the harder it is, so I am very in favor
of keeping the fitting function as simple as possible.
Another thought: Sometimes you can do much better by estimating both the
alignment and confidence in the alignment. Then you can do a weighted
linear regression using confidence as weights.
Post by Roger Dannenberg
Regarding the resampling, I've seen that soxr, currently used by
Audacity, supports piece-wise linear functions, and it seems
straightforward to use from Audacity's code. If we wanted to do this
with Nyquist, I'd have to execute nyquist instructions from my code,
which seems quite more complicated. Is there an easy to do it from
code that I missed?
I think it would be hard to call into Nyquist unless you did the whole
thing in Nyquist. I'm not sure what the soxr code does or what the API
looks like. I do know that resampling is tricky, so it might be worth
putting in some test signals with impulses or something very
distinguishable to test that resampling is working as intended. Since
the resampling algorithm is likely to use a number of windowing
operations, it's really easy to end up with shifted samples (or in some
implementations, shifting might be considered correct by the implementers).

-Roger
Post by Roger Dannenberg
Raphaël
-Roger
Post by Roger Dannenberg
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality
resampling,
Post by Roger Dannenberg
and unlike most implementations that simply resample with some
scale
Post by Roger Dannenberg
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where
f(t) is
Post by Roger Dannenberg
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation,
f(t) is
Post by Roger Dannenberg
actually a Nyquist Sound, so for example, if you had an aligned
points
Post by Roger Dannenberg
every 10s, you could make a piece-wise linear function
interpolating the
Post by Roger Dannenberg
alignment points, thus compensating for clocks that are slowly
changing
Post by Roger Dannenberg
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have
recordings from
Post by Roger Dannenberg
different locations recording sources from different locations?
There
Post by Roger Dannenberg
may be no perfect alignment, e.g. in one recording, source A
might be
Post by Roger Dannenberg
earlier than source B, but in the other source B is before
source A.
Post by Roger Dannenberg
Does this cause alignment to jump to the loudest source and
introduce a
Post by Roger Dannenberg
lot of timing jitter?
I checked a few examples that have the property you mention. When
doing local alignment (second phase of the algorithm) with very
small windows (e.g. 1ms), I indeed see varying detected time
differences at different positions in the two tracks. They seem
to follow the loudest source. E.g. detected time differences
hover between -20 and +20ms for two recordings ~15 meters apart,
of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>)
However, the algorithm performs relatively coarse alignment. We
fit an affine function on those time differences vs track time,
and just apply this affine transformation globally to one of the
tracks.
As you mention, we could of course fit a piece-wise linear
function instead, but do we want to introduce this kind of
varying time-stretching that jumps to the loudest source?
Thanks,
Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in
this
Post by Roger Dannenberg
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different
recordings
Post by Roger Dannenberg
Post by Raphaël Marinier
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
<https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567>
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a
sliding time
Post by Roger Dannenberg
Post by Raphaël Marinier
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This
is done
Post by Roger Dannenberg
Post by Raphaël Marinier
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This
allows
Post by Roger Dannenberg
Post by Raphaël Marinier
detecting small clock speed differences between devices. It
has been
Post by Roger Dannenberg
Post by Raphaël Marinier
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window
of audio
Post by Roger Dannenberg
Post by Raphaël Marinier
data, and finding the best peaks from the cross-correlation
function.
Post by Roger Dannenberg
Post by Raphaël Marinier
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96%
success
Post by Roger Dannenberg
Post by Raphaël Marinier
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h
audio
Post by Roger Dannenberg
Post by Raphaël Marinier
tracks on a recent laptop (plus resample time if it happens),
memory
Post by Roger Dannenberg
Post by Raphaël Marinier
requirements are very small (on the order of 3MBs for two 1h
tracks).
Post by Roger Dannenberg
Post by Raphaël Marinier
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to
shift tracks
Post by Roger Dannenberg
Post by Raphaël Marinier
by some offset, and resample them if need be. Does any plugin
system
Post by Roger Dannenberg
Post by Raphaël Marinier
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal
audacity
Post by Roger Dannenberg
Post by Raphaël Marinier
feature (for example shown in the Tracks menu)?
There are of course some limitations that should still be
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
Post by Roger Dannenberg
Post by Raphaël Marinier
- If the time drift is very small, we may want to avoid
resampling tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still
keeping
Post by Roger Dannenberg
Post by Raphaël Marinier
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be
great to
Post by Roger Dannenberg
Post by Raphaël Marinier
find a way to run this kind of automated benchmarks in a
uniform way
Post by Roger Dannenberg
Post by Raphaël Marinier
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
On Thu, Jul 14, 2016 at 12:26 AM, Vaughan Johnson
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the
late 80's when
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with
Blackboard systems,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
<http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf>
.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Yes I've known about dynamic programming since about then.
Good work, James
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
-- I like your trick.
-- V
On Wed, Jul 13, 2016 at 3:02 PM, James Crook
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a
neat trick I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
invented (in context of DNA sequence matching) that caters
for different
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
kinds of matching. The trick is to run two 'match matrices'
at the same
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time, and have a penalty for switching between them. This
is excellent
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test
examples. For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's
great about
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
running two match matrices is that the algorithm naturally
switches in to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically
relative to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
MATCH, even allowing large time shifts, by a divide and
conquer approach.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Instead of allocating space length x max-shift you sample
evenly and only
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allocate space of k x max-shift for some small value of k
such as 100. The
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
cost is that you have to repeat the analysis log(
length-of-sequence) times,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where log is to the base k. So aligning to the nearest 10ms
on two 1hr
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb
storage (if one
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
match matrix) or 100Mb (with two in parallel), and the
analysis would be
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
repeated 3 times. Because you stay in cache in the analysis
and write much
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
less to external memory it's a big net win both in storage
and speed over a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
single pass approach.
I haven't written versions for sound. This is extrapolating
from back in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
old times, in the late 80's when I was analysing DNA and
protein sequences
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
on a PC with a fraction of the power and storage of modern
PCs. You had to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
be inventive to get any decent performance at all. This
kind of trick can
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit
abbreviated in my
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via
sonic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its
runtime seems
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
linear in the max time shift allowed. For aligning two 1h
tracks, with a max
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed time shift of 60s, it takes 6 minutes on a recent
processor (Intel
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
i5-5200U), and takes about 8GB of RAM. Using is for largeer
time shifts such
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed me - it can only report graphical results of the
alignment analysis,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a
time-shift of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except
for a 30s hole
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
filled with pink noise, with a time-shift of about 15s
between them.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
There are 1-2 second zones at the boundaries of the hole
where the audio
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
is wrongly aligned. This will be quite problematic when
building a feature
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left
right channels
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
from same device), except for a 30s hole filled with pink
noise, with a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2
different devices.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Throughout the match, it finds ratios of tempos that are as
divergent as
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
<0.8 or >1.2 a significant fraction of the time. This is
pretty bad since a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
correct match should find a tempo ratio of 1 throughout the
recording.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of
lowering the cost of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the diagonal to 1.5, and enabling the "path smoothing"
feature, but tempo
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same
composition, time shift
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around
the hole (10s
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and 30s of incorrect matches).
However, using non-default cost for diagonal again
significantly improves
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the match by mostly fixing the boundaries around the hole.
There is still a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
small issue with the first 0.5s of the performance that
remains incorrectly
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the
tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
My conclusion is that the match plugin cannot be used that
easily, even
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
for the simple case of 2 recordings of the same event,
because of accuracy
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and performance. The former could be fixable by imposing
stronger regularity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature
specific to the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
case of 2 recordings of the same event, which is an easier
case to start
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in
particular
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
because of stereo. All we can do is best-effort given the
sources. I will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allow for piecewise linear ratios between frequencies (with
additional
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert HÀnggi
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example
where this
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
<http://forum.audacityteam.org/viewtopic.php?p=307553#p307553>
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
There are also two test (calibration) recordings in this
specific post.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some
difficulties in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because
we have RIAA
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
in combination with a noise reduction expander, a delay
caused by the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
low/high pass filter etc.
In summary, the alignment had to be very exact but at the
same time
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
insensitive to noise, phase & amplitude deviations, and on
and on...
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
For the moment, I will use cross-correlation and least
square fitting
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
for certain "anchor" points.
I look forward to seeing the aligning feature someday
implemented in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track
will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
generate
similar problems. I would suggest that if you're recording
with
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
multiple
microphones and devices, you're guaranteed to hit phase
and multiple
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
source problems. In the spirit of the "principle of least
surprise" I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable
job given the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10
meters (~30ms at
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be
aligned
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks
representing the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
same
collection of sound sources recorded from different
locations. It's
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you
attempt to align
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right
audio of a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network
bandwidth and
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
traffic
patterns at an interface-level. Reveals which users, apps,
and protocols
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
are
consuming the most bandwidth. Provides multi-vendor support
for NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
J-Flow, sFlow and other flows. Make informed decisions using
capacity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
planning
reports.http://sdm.link/zohodev2dev
<http://sdm.link/zohodev2dev>
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network
bandwidth and traffic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
patterns at an interface-level. Reveals which users, apps,
and protocols are
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
consuming the most bandwidth. Provides multi-vendor support
for NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
J-Flow, sFlow and other flows. Make informed decisions using
capacity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
planning
reports.http://sdm.link/zohodev2dev <http://sdm.link/zohodev2dev>
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Post by Roger Dannenberg
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
<https://lists.sourceforge.net/lists/listinfo/audacity-devel>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Federico Miyara
2017-06-19 01:45:33 UTC
Permalink
Roger,

In this article
http://www.electronicdesign.com/analog/minimize-frequency-drift-crystals
there are some insights as to crystal oscillator frequency drift.

For sealed crystals, where humidity and pressure are not a concern,
drift is mostly due to temperature (and mechanical stress, which I
presume is a long-term effect which will not affect a one-session
recording). As a worst case, a CT-cut crystal will change less than 20
parts per million in the range 15 ºC to 40 ºC which covers most ambient
temperature situations. This amounts to at most 72 ms in a one-hour
recording.

A possible approach would be to align the beginning and then test every
5 min or so for the time shift to get about 12 reference points per
hour, then fit a polynomial to the time shifts by least squares and
apply variable-resampling to align. The choice of 5 min is meant to
ensure that the error between reference points will be small and will
not alter significantly the natural delay that may exist between both
signals.

There is still a problem. All attempts to align by correlation peaks
will override the intended delay caused by different distances from
source to microphone (which in turn will vary with the position of the
source, and there could certainly be several sources!). A workaround
would be to find the best overall peak alignment and then interactively
fine-adjust the time delay to get the most realistic render.

However, this may present another problem: the alignment may change if
the main source at a given time is different fromn the main source at a
different time. There is no obvious solution for this.

Another approach would be to assume that both crystals have the same
drift (hopefully they will experience the same tempearature changes), so
the correction might be based just on the start and end instants,
assuming the desyncronization is due to a linear cummulative phase drift
due to slightly different frequencies.

Regards

Federico
Post by Roger Dannenberg
Here are some random thoughts: It would make sense to compute
alignment at many points and do some sort of smoothing. You might ask
(and try to solve): What alignment function minimizes the
sum-of-squares of alignment errors, considering only *plausible*
alignment functions, i.e. those that could be produced by real crystal
clocks? I'm not even sure of a reasonable model for clock drift, but
one approach might be to just take the alignment function, treat it as
a signal and low-pass it. The cut-off frequency would be very low, a
tiny fraction of 1 Hz, and you'd have to be careful not to introduce
phase shift or lag: The standard trick is to run an IIR filter over
the signal, reverse it, filter it again, and reverse it again, so that
phase shifts or lags cancel. I think getting the start and end of the
signal right, i.e. initializing the filter state, are also tricky.
Another approach might be least-squares regression to fit a
higher-order polynomial rather than a line to the data. At least, it
seems that linear regression over a bunch of alignment points would do
a good job assuming clocks are stable and just running at slightly
different speeds.
-Roger
Post by Roger Dannenberg
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function
interpolating the
Post by Roger Dannenberg
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When
doing local alignment (second phase of the algorithm) with very small
windows (e.g. 1ms), I indeed see varying detected time differences at
different positions in the two tracks. They seem to follow the
loudest source. E.g. detected time differences hover between -20 and
+20ms for two recordings ~15 meters apart, of sources ~10 meters
apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>)
However, the algorithm performs relatively coarse alignment. We fit
an affine function on those time differences vs track time, and just
apply this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying
time-stretching that jumps to the loudest source?
Thanks,
Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
Post by Roger Dannenberg
Post by Raphaël Marinier
- If the time drift is very small, we may want to avoid resampling
tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
On Thu, Jul 14, 2016 at 12:26 AM, Vaughan Johnson
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late
80's when
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard
systems,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Yes I've known about dynamic programming since about then. Good
work, James
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat
trick I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
invented (in context of DNA sequence matching) that caters for
different
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
kinds of matching. The trick is to run two 'match matrices' at
the same
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time, and have a penalty for switching between them. This is
excellent
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test
examples. For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's
great about
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
running two match matrices is that the algorithm naturally
switches in to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically
relative to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Instead of allocating space length x max-shift you sample evenly
and only
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allocate space of k x max-shift for some small value of k such
as 100. The
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
cost is that you have to repeat the analysis log(
length-of-sequence) times,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where log is to the base k. So aligning to the nearest 10ms on
two 1hr
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage
(if one
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
match matrix) or 100Mb (with two in parallel), and the analysis
would be
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
repeated 3 times. Because you stay in cache in the analysis and
write much
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
less to external memory it's a big net win both in storage and
speed over a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
single pass approach.
I haven't written versions for sound. This is extrapolating
from back in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
old times, in the late 80's when I was analysing DNA and protein
sequences
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
on a PC with a fraction of the power and storage of modern PCs.
You had to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
be inventive to get any decent performance at all. This kind of
trick can
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime
seems
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
linear in the max time shift allowed. For aligning two 1h
tracks, with a max
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed time shift of 60s, it takes 6 minutes on a recent
processor (Intel
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed me - it can only report graphical results of the
alignment analysis,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a
time-shift of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a
30s hole
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where
the audio
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
is wrongly aligned. This will be quite problematic when building
a feature
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right
channels
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
from same device), except for a 30s hole filled with pink noise,
with a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2
different devices.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Throughout the match, it finds ratios of tempos that are as
divergent as
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
<0.8 or >1.2 a significant fraction of the time. This is pretty
bad since a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
correct match should find a tempo ratio of 1 throughout the
recording.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of lowering
the cost of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the diagonal to 1.5, and enabling the "path smoothing" feature,
but tempo
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition,
time shift
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the
hole (10s
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the match by mostly fixing the boundaries around the hole. There
is still a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
small issue with the first 0.5s of the performance that remains
incorrectly
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that
easily, even
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
for the simple case of 2 recordings of the same event, because
of accuracy
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and performance. The former could be fixable by imposing
stronger regularity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific
to the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
case of 2 recordings of the same event, which is an easier case
to start
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the
sources. I will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allow for piecewise linear ratios between frequencies (with
additional
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert Hänggi
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where
this
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this
specific post.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some
difficulties in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we
have RIAA
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
in combination with a noise reduction expander, a delay caused
by the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and
on...
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
For the moment, I will use cross-correlation and least square
fitting
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
for certain "anchor" points.
I look forward to seeing the aligning feature someday
implemented in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and
multiple
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
source problems. In the spirit of the "principle of least
surprise" I
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
would expect an alignment effect to just do a reasonable job
given the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters
(~30ms at
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks
representing the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
same
collection of sound sources recorded from different locations.
It's
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt
to align
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right
audio of a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
are
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
J-Flow, sFlow and other flows. Make informed decisions using
capacity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network bandwidth
and traffic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
patterns at an interface-level. Reveals which users, apps, and
protocols are
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2017-06-18 09:12:12 UTC
Permalink
It all depends what you are doing the alignment for.

IF the assumption is that it is clock drift/different clocks, then you
need a different model for scoring alignments than if alignment is for
aligning multiple takes of the same song, and different again from
recordings of the same performance by different microphones.

For recording of the same performance by different microphones, you have
to have some model for the sources/reverb. In a sense the alignment is
then deciding on both the most probable alignment and the most probable
model parameters at the same time.

--James.
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeWM/view?usp=sharing>
)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
Post by Roger Dannenberg
(By the way, Nyquist's phase-vocoder works the same way, but in this
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c66c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
Post by Roger Dannenberg
Post by Raphaël Marinier
- If the time drift is very small, we may want to avoid resampling
tracks.
Post by Roger Dannenberg
Post by Raphaël Marinier
- We could use a much smaller time window in the second alignment
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
Post by Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's
when
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
I was analysing DNA and protein sequences..."
Didn't know that! I was doing similar work then, with Blackboard
systems,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
on the PROTEAN project at Stanford KSL,
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf .
Yes I've known about dynamic programming since about then. Good work,
James
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
-- I like your trick.
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
kinds of matching. The trick is to run two 'match matrices' at the
same
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
time, and have a penalty for switching between them. This is excellent
where there is a mix of signal and noise, as in your test examples.
For
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
aligning noise you want a fairly sloppy not very precisely
discriminating
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
comparison that is picking up broad characteristics. What's great
about
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
running two match matrices is that the algorithm naturally switches in
to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
using the best kind of matching for different sections.
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Instead of allocating space length x max-shift you sample evenly and
only
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allocate space of k x max-shift for some small value of k such as
100. The
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
cost is that you have to repeat the analysis log( length-of-sequence)
times,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where log is to the base k. So aligning to the nearest 10ms on two 1hr
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would
be
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
repeated 3 times. Because you stay in cache in the analysis and write
much
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
less to external memory it's a big net win both in storage and speed
over a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
single pass approach.
I haven't written versions for sound. This is extrapolating from back
in
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
old times, in the late 80's when I was analysing DNA and protein
sequences
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
on a PC with a fraction of the power and storage of modern PCs. You
had to
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
be inventive to get any decent performance at all. This kind of trick
can
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
pay off in a big way, even today.
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allowed time shift of 60s, it takes 6 minutes on a recent processor
(Intel
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
shifts such
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
as 10 minutes will be quite expensive...
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
but does not actually align the tracks.
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
about 15s between them.
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
filled with pink noise, with a time-shift of about 15s between them.
There are 1-2 second zones at the boundaries of the hole where the
audio
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
is wrongly aligned. This will be quite problematic when building a
feature
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
that allows mix and matching different versions of each passage.
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Throughout the match, it finds ratios of tempos that are as divergent
as
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
since a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
correct match should find a tempo ratio of 1 throughout the recording.
Things can be improved using non-default parameters of lowering the
cost of
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the diagonal to 1.5, and enabling the "path smoothing" feature, but
tempo
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
ratio still routinely hovers around 0.9 - 1.1.
(5) 2 recordings of two performances of the same composition, time
shift
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of about 15s, and hole of about 30s.
Default parameters lead to big issues at boundaries around the hole
(10s
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and 30s of incorrect matches).
However, using non-default cost for diagonal again significantly
improves
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
the match by mostly fixing the boundaries around the hole. There is
still a
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
small issue with the first 0.5s of the performance that remains
incorrectly
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
matched.
I cannot really evaluate the match more than that, because
sonic-analyzer
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
just produces the graphs, but does not actually match the tracks.
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of
accuracy
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
and performance. The former could be fixable by imposing stronger
regularity
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
of the path (e.g. piecewise linear). The latter might be harder.
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
with both in terms of algorithm and UI.
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
allow for piecewise linear ratios between frequencies (with additional
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
In essence, four tracks are embedded in a single stereo track.
The aim is to reverse-engineer what is in a hardware phono
demodulator.
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
I can demodulate the signal, however, there are some difficulties in
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given
the
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
two mono tracks that happen actually to be left and right audio of a
stereo track.
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
are
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
Post by Roger Dannenberg
Post by Raphaël Marinier
Post by Vaughan Johnson
patterns at an interface-level. Reveals which users, apps, and
protocols are
Raphaël Marinier
2017-06-18 16:47:49 UTC
Permalink
Post by James Crook
It all depends what you are doing the alignment for.
IF the assumption is that it is clock drift/different clocks, then you
need a different model for scoring alignments than if alignment is for
aligning multiple takes of the same song, and different again from
recordings of the same performance by different microphones.
For recording of the same performance by different microphones, you have
to have some model for the sources/reverb. In a sense the alignment is
then deciding on both the most probable alignment and the most probable
model parameters at the same time.
For sure I assume it's the same performance. Aligning different performance
is a harder problem, and I think we need more powerful features for it to
work, such as note onset detection. The MATCH plugin does that, but
unfortunately, does not do a good job in the simpler case of different
recordings of the same performance.

I think we should be able to align when there are slightly different clock
speeds, and also different microphone placement.

Could you expand on the model for sources/reverb?

At the end of the day, the proposed algorithm uses peaks in the
cross-correlation function, evaluated for multiple time windows, to produce
the final alignment. This should align mostly based on the louder
components of the signal, not the reverberated components.

Thanks!

Raphaël

--James.
Post by James Crook
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeW
M/view?usp=sharing>
)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
(By the way, Nyquist's phase-vocoder works the same way, but in this
Post by Roger Dannenberg
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c6
6c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
- If the time drift is very small, we may want to avoid resampling
tracks.
- We could use a much smaller time window in the second alignment
Post by Raphaël Marinier
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
James: "This is extrapolating from back in old times, in the late 80's
Post by Raphaël Marinier
when
I was analysing DNA and protein sequences..."
Post by Raphaël Marinier
Post by Vaughan Johnson
Didn't know that! I was doing similar work then, with Blackboard
systems,
on the PROTEAN project at Stanford KSL,
Post by Raphaël Marinier
Post by Vaughan Johnson
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
.
Yes I've known about dynamic programming since about then. Good work,
James
-- I like your trick.
Post by Raphaël Marinier
Post by Vaughan Johnson
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
kinds of matching. The trick is to run two 'match matrices' at the
Post by Raphaël Marinier
Post by Vaughan Johnson
same
time, and have a penalty for switching between them. This is excellent
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test examples.
For
aligning noise you want a fairly sloppy not very precisely
Post by Raphaël Marinier
Post by Vaughan Johnson
discriminating
comparison that is picking up broad characteristics. What's great
Post by Raphaël Marinier
Post by Vaughan Johnson
about
running two match matrices is that the algorithm naturally switches in
Post by Raphaël Marinier
Post by Vaughan Johnson
to
using the best kind of matching for different sections.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Instead of allocating space length x max-shift you sample evenly and
Post by Raphaël Marinier
Post by Vaughan Johnson
only
allocate space of k x max-shift for some small value of k such as
Post by Raphaël Marinier
Post by Vaughan Johnson
100. The
cost is that you have to repeat the analysis log( length-of-sequence)
Post by Raphaël Marinier
Post by Vaughan Johnson
times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would
be
repeated 3 times. Because you stay in cache in the analysis and write
Post by Raphaël Marinier
Post by Vaughan Johnson
much
less to external memory it's a big net win both in storage and speed
Post by Raphaël Marinier
Post by Vaughan Johnson
over a
single pass approach.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I haven't written versions for sound. This is extrapolating from back
in
old times, in the late 80's when I was analysing DNA and protein
Post by Raphaël Marinier
Post by Vaughan Johnson
sequences
on a PC with a fraction of the power and storage of modern PCs. You
Post by Raphaël Marinier
Post by Vaughan Johnson
had to
be inventive to get any decent performance at all. This kind of trick
Post by Raphaël Marinier
Post by Vaughan Johnson
can
pay off in a big way, even today.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max
allowed time shift of 60s, it takes 6 minutes on a recent processor
Post by Raphaël Marinier
Post by Vaughan Johnson
(Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
Post by Raphaël Marinier
Post by Vaughan Johnson
shifts such
as 10 minutes will be quite expensive...
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
but does not actually align the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of
about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
filled with pink noise, with a time-shift of about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
There are 1-2 second zones at the boundaries of the hole where the
audio
is wrongly aligned. This will be quite problematic when building a
Post by Raphaël Marinier
Post by Vaughan Johnson
feature
that allows mix and matching different versions of each passage.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Throughout the match, it finds ratios of tempos that are as divergent
Post by Raphaël Marinier
Post by Vaughan Johnson
as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
Post by Raphaël Marinier
Post by Vaughan Johnson
since a
correct match should find a tempo ratio of 1 throughout the recording.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of lowering the
cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but
Post by Raphaël Marinier
Post by Vaughan Johnson
tempo
ratio still routinely hovers around 0.9 - 1.1.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(5) 2 recordings of two performances of the same composition, time
shift
of about 15s, and hole of about 30s.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Default parameters lead to big issues at boundaries around the hole
(10s
and 30s of incorrect matches).
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
However, using non-default cost for diagonal again significantly
improves
the match by mostly fixing the boundaries around the hole. There is
Post by Raphaël Marinier
Post by Vaughan Johnson
still a
small issue with the first 0.5s of the performance that remains
Post by Raphaël Marinier
Post by Vaughan Johnson
incorrectly
matched.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of
accuracy
and performance. The former could be fixable by imposing stronger
Post by Raphaël Marinier
Post by Vaughan Johnson
regularity
of the path (e.g. piecewise linear). The latter might be harder.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start
with both in terms of algorithm and UI.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
allow for piecewise linear ratios between frequencies (with additional
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert HÀnggi <
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
In essence, four tracks are embedded in a single stereo track.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given
the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
two mono tracks that happen actually to be left and right audio of a
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
are
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
traffic
patterns at an interface-level. Reveals which users, apps, and
Post by Raphaël Marinier
protocols are
------------------------------------------------------------
------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2017-06-18 20:25:25 UTC
Permalink
Post by Raphaël Marinier
Post by James Crook
It all depends what you are doing the alignment for.
IF the assumption is that it is clock drift/different clocks, then you
need a different model for scoring alignments than if alignment is for
aligning multiple takes of the same song, and different again from
recordings of the same performance by different microphones.
For recording of the same performance by different microphones, you have
to have some model for the sources/reverb. In a sense the alignment is
then deciding on both the most probable alignment and the most probable
model parameters at the same time.
For sure I assume it's the same performance. Aligning different performance
is a harder problem, and I think we need more powerful features for it to
work, such as note onset detection. The MATCH plugin does that, but
unfortunately, does not do a good job in the simpler case of different
recordings of the same performance.
It is possible to adapt an algorithm that aligns performances to instead
align for clock drift by much increasing the penalties for small
time-excursions. Alignment algorithms (for DNA) usually have the
penalties set low, to showcase their strength in making small local
adjustments. Set those penalty high, and the results are closer to
what you want for the simpler case.
Post by Raphaël Marinier
I think we should be able to align when there are slightly different clock
speeds, and also different microphone placement.
Could you expand on the model for sources/reverb?
Only a little, really.

I think alignment can require a certain amount of source-separation. In
turn, source separation (in my view) entails having a model for the
sources, for example that one source is primarily percussive and another
primarily sustained tones.

You can align for percussive. You can align for sustained tones. You
can also align for both at the same time, simply by having an additional
small penalty (like the small penalty for a time excursion) for
switching between the two models. The alignment then tracks the time
alignment AND the preferred model, moment by moment, at minimal extra
computing cost compared to running two alignments separately.

The advantage is that instead of a result that is an average of two
models, you have an alignment that combines the best sections of both.

I have not tried this with sound. I'm extrapolating from work on DNA /
protein sequences.
Post by Raphaël Marinier
At the end of the day, the proposed algorithm uses peaks in the
cross-correlation function, evaluated for multiple time windows, to produce
the final alignment. This should align mostly based on the louder
components of the signal, not the reverberated components.
Thanks!
Raphaël
--James.
Post by James Crook
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeW
M/view?usp=sharing>
)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
(By the way, Nyquist's phase-vocoder works the same way, but in this
Post by Roger Dannenberg
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c6
6c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
Post by Raphaël Marinier
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
James: "This is extrapolating from back in old times, in the late 80's
Post by Raphaël Marinier
when
I was analysing DNA and protein sequences..."
Post by Raphaël Marinier
Post by Vaughan Johnson
Didn't know that! I was doing similar work then, with Blackboard
systems,
on the PROTEAN project at Stanford KSL,
Post by Raphaël Marinier
Post by Vaughan Johnson
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
.
Yes I've known about dynamic programming since about then. Good work,
James
-- I like your trick.
Post by Raphaël Marinier
Post by Vaughan Johnson
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
kinds of matching. The trick is to run two 'match matrices' at the
Post by Raphaël Marinier
Post by Vaughan Johnson
same
time, and have a penalty for switching between them. This is excellent
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test examples.
For
aligning noise you want a fairly sloppy not very precisely
Post by Raphaël Marinier
Post by Vaughan Johnson
discriminating
comparison that is picking up broad characteristics. What's great
Post by Raphaël Marinier
Post by Vaughan Johnson
about
running two match matrices is that the algorithm naturally switches in
Post by Raphaël Marinier
Post by Vaughan Johnson
to
using the best kind of matching for different sections.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Instead of allocating space length x max-shift you sample evenly and
Post by Raphaël Marinier
Post by Vaughan Johnson
only
allocate space of k x max-shift for some small value of k such as
Post by Raphaël Marinier
Post by Vaughan Johnson
100. The
cost is that you have to repeat the analysis log( length-of-sequence)
Post by Raphaël Marinier
Post by Vaughan Johnson
times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage (if one
match matrix) or 100Mb (with two in parallel), and the analysis would
be
repeated 3 times. Because you stay in cache in the analysis and write
Post by Raphaël Marinier
Post by Vaughan Johnson
much
less to external memory it's a big net win both in storage and speed
Post by Raphaël Marinier
Post by Vaughan Johnson
over a
single pass approach.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I haven't written versions for sound. This is extrapolating from back
in
old times, in the late 80's when I was analysing DNA and protein
Post by Raphaël Marinier
Post by Vaughan Johnson
sequences
on a PC with a fraction of the power and storage of modern PCs. You
Post by Raphaël Marinier
Post by Vaughan Johnson
had to
be inventive to get any decent performance at all. This kind of trick
Post by Raphaël Marinier
Post by Vaughan Johnson
can
pay off in a big way, even today.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max
allowed time shift of 60s, it takes 6 minutes on a recent processor
Post by Raphaël Marinier
Post by Vaughan Johnson
(Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
Post by Raphaël Marinier
Post by Vaughan Johnson
shifts such
as 10 minutes will be quite expensive...
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
but does not actually align the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of
about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
filled with pink noise, with a time-shift of about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
There are 1-2 second zones at the boundaries of the hole where the
audio
is wrongly aligned. This will be quite problematic when building a
Post by Raphaël Marinier
Post by Vaughan Johnson
feature
that allows mix and matching different versions of each passage.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Throughout the match, it finds ratios of tempos that are as divergent
Post by Raphaël Marinier
Post by Vaughan Johnson
as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
Post by Raphaël Marinier
Post by Vaughan Johnson
since a
correct match should find a tempo ratio of 1 throughout the recording.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of lowering the
cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but
Post by Raphaël Marinier
Post by Vaughan Johnson
tempo
ratio still routinely hovers around 0.9 - 1.1.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(5) 2 recordings of two performances of the same composition, time
shift
of about 15s, and hole of about 30s.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Default parameters lead to big issues at boundaries around the hole
(10s
and 30s of incorrect matches).
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
However, using non-default cost for diagonal again significantly
improves
the match by mostly fixing the boundaries around the hole. There is
Post by Raphaël Marinier
Post by Vaughan Johnson
still a
small issue with the first 0.5s of the performance that remains
Post by Raphaël Marinier
Post by Vaughan Johnson
incorrectly
matched.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of
accuracy
and performance. The former could be fixable by imposing stronger
Post by Raphaël Marinier
Post by Vaughan Johnson
regularity
of the path (e.g. piecewise linear). The latter might be harder.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start
with both in terms of algorithm and UI.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
allow for piecewise linear ratios between frequencies (with additional
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert Hänggi <
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
In essence, four tracks are embedded in a single stereo track.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least surprise" I
would expect an alignment effect to just do a reasonable job given
the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be aligned within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks representing the same
collection of sound sources recorded from different locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
two mono tracks that happen actually to be left and right audio of a
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
are
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
traffic
patterns at an interface-level. Reveals which users, apps, and
Post by Raphaël Marinier
protocols are
Federico Miyara
2017-06-19 05:33:49 UTC
Permalink
Friends,

Alignment of performances is a vastly more difficult task than alignment
of different versions of the same performance (such as recordings taken
with different recorders). It is very difficult to keep a constant
tempo, let alone keep pace with a different performer (or even oneself)
at a different time, so the time differenceacross different
performancesfor equivalent events may be huge. It wouldn't be possible
to make it just with variable resampling. This works fine as long as the
tunning change is below the just noticeable difference for pitch, which
wouldn't be the case here.

In some areas (for instance speech recognition) an algorithm based on
dynamic programming called "dynamic time warping" is used for optimal
time alignment. See

https://en.wikipedia.org/wiki/Dynamic_time_warping

https://www.google.com.ar/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&uact=8&ved=0ahUKEwinspr0kcnUAhWEhZAKHWvYDyMQFghFMAQ&url=http%3A%2F%2Fwww.springer.com%2Fcda%2Fcontent%2Fdocument%2Fcda_downloaddocument%2F9783540740476-c1.pdf%3FSGWID%3D0-0-45-452103-p173751818&usg=AFQjCNEoEbKBzZV-2qc3ujKtlAZuEYETng

This method, however, requires some sort of previous translation of the
audio features into symbols, a sort of WAV to MID conversion.

I think some of this has been accomplished in some cases where a
performance by a great artist of the past whose recording was superb as
musical rendering but very poor regarding audio quality. See, for instance,

http://www.npr.org/templates/story/story.php?storyId=10439850

I'm curious about what other applications performance alignmentwould have...

Regards,

Federico
Post by James Crook
Post by Raphaël Marinier
Post by James Crook
It all depends what you are doing the alignment for.
IF the assumption is that it is clock drift/different clocks, then you
need a different model for scoring alignments than if alignment is for
aligning multiple takes of the same song, and different again from
recordings of the same performance by different microphones.
For recording of the same performance by different microphones, you have
to have some model for the sources/reverb. In a sense the alignment is
then deciding on both the most probable alignment and the most probable
model parameters at the same time.
For sure I assume it's the same performance. Aligning different performance
is a harder problem, and I think we need more powerful features for it to
work, such as note onset detection. The MATCH plugin does that, but
unfortunately, does not do a good job in the simpler case of different
recordings of the same performance.
It is possible to adapt an algorithm that aligns performances to
instead align for clock drift by much increasing the penalties for
small time-excursions. Alignment algorithms (for DNA) usually have
the penalties set low, to showcase their strength in making small
local adjustments. Set those penalty high, and the results are
closer to what you want for the simpler case.
Post by Raphaël Marinier
I think we should be able to align when there are slightly different clock
speeds, and also different microphone placement.
Could you expand on the model for sources/reverb?
Only a little, really.
I think alignment can require a certain amount of source-separation.
In turn, source separation (in my view) entails having a model for the
sources, for example that one source is primarily percussive and
another primarily sustained tones.
You can align for percussive. You can align for sustained tones. You
can also align for both at the same time, simply by having an
additional small penalty (like the small penalty for a time excursion)
for switching between the two models. The alignment then tracks the
time alignment AND the preferred model, moment by moment, at minimal
extra computing cost compared to running two alignments separately.
The advantage is that instead of a result that is an average of two
models, you have an alignment that combines the best sections of both.
I have not tried this with sound. I'm extrapolating from work on DNA
/ protein sequences.
Post by Raphaël Marinier
At the end of the day, the proposed algorithm uses peaks in the
cross-correlation function, evaluated for multiple time windows, to produce
the final alignment. This should align mostly based on the louder
components of the signal, not the reverberated components.
Thanks!
Raphaël
--James.
Post by James Crook
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality
resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function
interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeW
M/view?usp=sharing>
)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying
time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
(By the way, Nyquist's phase-vocoder works the same way, but in this
Post by Roger Dannenberg
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c6
6c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
Post by Raphaël Marinier
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
On Thu, Jul 14, 2016 at 12:26 AM, Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's
Post by Raphaël Marinier
when
I was analysing DNA and protein sequences..."
Post by Raphaël Marinier
Post by Vaughan Johnson
Didn't know that! I was doing similar work then, with Blackboard
systems,
on the PROTEAN project at Stanford KSL,
Post by Raphaël Marinier
Post by Vaughan Johnson
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
.
Yes I've known about dynamic programming since about then. Good work,
James
-- I like your trick.
Post by Raphaël Marinier
Post by Vaughan Johnson
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
kinds of matching. The trick is to run two 'match matrices' at the
Post by Raphaël Marinier
Post by Vaughan Johnson
same
time, and have a penalty for switching between them. This is excellent
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test examples.
For
aligning noise you want a fairly sloppy not very precisely
Post by Raphaël Marinier
Post by Vaughan Johnson
discriminating
comparison that is picking up broad characteristics. What's great
Post by Raphaël Marinier
Post by Vaughan Johnson
about
running two match matrices is that the algorithm naturally
switches in
Post by Raphaël Marinier
Post by Vaughan Johnson
to
using the best kind of matching for different sections.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Instead of allocating space length x max-shift you sample evenly and
Post by Raphaël Marinier
Post by Vaughan Johnson
only
allocate space of k x max-shift for some small value of k such as
Post by Raphaël Marinier
Post by Vaughan Johnson
100. The
cost is that you have to repeat the analysis log( length-of-sequence)
Post by Raphaël Marinier
Post by Vaughan Johnson
times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb storage
(if
one
match matrix) or 100Mb (with two in parallel), and the analysis would
be
repeated 3 times. Because you stay in cache in the analysis and write
Post by Raphaël Marinier
Post by Vaughan Johnson
much
less to external memory it's a big net win both in storage and speed
Post by Raphaël Marinier
Post by Vaughan Johnson
over a
single pass approach.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I haven't written versions for sound. This is extrapolating from back
in
old times, in the late 80's when I was analysing DNA and protein
Post by Raphaël Marinier
Post by Vaughan Johnson
sequences
on a PC with a fraction of the power and storage of modern PCs. You
Post by Raphaël Marinier
Post by Vaughan Johnson
had to
be inventive to get any decent performance at all. This kind of trick
Post by Raphaël Marinier
Post by Vaughan Johnson
can
pay off in a big way, even today.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max
allowed time shift of 60s, it takes 6 minutes on a recent processor
Post by Raphaël Marinier
Post by Vaughan Johnson
(Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
Post by Raphaël Marinier
Post by Vaughan Johnson
shifts such
as 10 minutes will be quite expensive...
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also tested the quality of the results, to the extent
sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
but does not actually align the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of
about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
filled with pink noise, with a time-shift of about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
There are 1-2 second zones at the boundaries of the hole where the
audio
is wrongly aligned. This will be quite problematic when building a
Post by Raphaël Marinier
Post by Vaughan Johnson
feature
that allows mix and matching different versions of each passage.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with
a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Throughout the match, it finds ratios of tempos that are as divergent
Post by Raphaël Marinier
Post by Vaughan Johnson
as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
Post by Raphaël Marinier
Post by Vaughan Johnson
since a
correct match should find a tempo ratio of 1 throughout the recording.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of lowering the
cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but
Post by Raphaël Marinier
Post by Vaughan Johnson
tempo
ratio still routinely hovers around 0.9 - 1.1.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(5) 2 recordings of two performances of the same composition, time
shift
of about 15s, and hole of about 30s.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Default parameters lead to big issues at boundaries around the hole
(10s
and 30s of incorrect matches).
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
However, using non-default cost for diagonal again significantly
improves
the match by mostly fixing the boundaries around the hole. There is
Post by Raphaël Marinier
Post by Vaughan Johnson
still a
small issue with the first 0.5s of the performance that remains
Post by Raphaël Marinier
Post by Vaughan Johnson
incorrectly
matched.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
My conclusion is that the match plugin cannot be used that easily, even
for the simple case of 2 recordings of the same event, because of
accuracy
and performance. The former could be fixable by imposing stronger
Post by Raphaël Marinier
Post by Vaughan Johnson
regularity
of the path (e.g. piecewise linear). The latter might be harder.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I propose to start working on an algorithm and feature specific to the
case of 2 recordings of the same event, which is an easier case to
start
with both in terms of algorithm and UI.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
allow for piecewise linear ratios between frequencies (with additional
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert HÀnggi <
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
In essence, four tracks are embedded in a single stereo track.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we have RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday
implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will generate
similar problems. I would suggest that if you're recording with multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least
surprise"
I
would expect an alignment effect to just do a reasonable job given
the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be
aligned
within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks
representing the
same
collection of sound sources recorded from different
locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
two mono tracks that happen actually to be left and right audio of a
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
traffic
patterns at an interface-level. Reveals which users, apps, and
protocols
are
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
consuming the most bandwidth. Provides multi-vendor support for
NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------
------------------
Post by Roger Dannenberg
What NetFlow Analyzer can do for you? Monitors network bandwidth and
Post by Raphaël Marinier
traffic
patterns at an interface-level. Reveals which users, apps, and
Post by Raphaël Marinier
protocols are
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
James Crook
2017-06-19 08:31:17 UTC
Permalink
Post by Federico Miyara
Friends,
Alignment of performances is a vastly more difficult task than
alignment of different versions of the same performance (such as
recordings taken with different recorders). It is very difficult to
keep a constant tempo, let alone keep pace with a different performer
(or even oneself) at a different time, so the time differenceacross
different performancesfor equivalent events may be huge. It wouldn't
be possible to make it just with variable resampling. This works fine
as long as the tunning change is below the just noticeable difference
for pitch, which wouldn't be the case here.
In some areas (for instance speech recognition) an algorithm based on
dynamic programming called "dynamic time warping" is used for optimal
time alignment. See
https://en.wikipedia.org/wiki/Dynamic_time_warping
https://www.google.com.ar/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&uact=8&ved=0ahUKEwinspr0kcnUAhWEhZAKHWvYDyMQFghFMAQ&url=http%3A%2F%2Fwww.springer.com%2Fcda%2Fcontent%2Fdocument%2Fcda_downloaddocument%2F9783540740476-c1.pdf%3FSGWID%3D0-0-45-452103-p173751818&usg=AFQjCNEoEbKBzZV-2qc3ujKtlAZuEYETng
Dynamic time/sequence warping is what I am familiar with from DNA and
protein sequence alignment.

In the case of audio you HAVE to identify 'stretchy' regions of audio.
Silence, white noise, sustained vowel sounds are all good candidates.
Post by Federico Miyara
This method, however, requires some sort of previous translation of
the audio features into symbols, a sort of WAV to MID conversion.
Not strictly so. The method requires that you can score the
similarity/difference between two short segments of sound. This may or
may not involve a 'translation' into a more symbolic representation.
Also the 'translation' does not have to be locked in, in the sense that
you can provide multiple alternative translations, and the alignment
selects between them in doing the alignment.
Post by Federico Miyara
I think some of this has been accomplished in some cases where a
performance by a great artist of the past whose recording was superb
as musical rendering but very poor regarding audio quality. See, for
instance,
http://www.npr.org/templates/story/story.php?storyId=10439850
I'm curious about what other applications performance alignmentwould have...
Regards,
Federico
Post by James Crook
Post by Raphaël Marinier
Post by James Crook
It all depends what you are doing the alignment for.
IF the assumption is that it is clock drift/different clocks, then you
need a different model for scoring alignments than if alignment is for
aligning multiple takes of the same song, and different again from
recordings of the same performance by different microphones.
For recording of the same performance by different microphones, you have
to have some model for the sources/reverb. In a sense the
alignment is
then deciding on both the most probable alignment and the most probable
model parameters at the same time.
For sure I assume it's the same performance. Aligning different performance
is a harder problem, and I think we need more powerful features for it to
work, such as note onset detection. The MATCH plugin does that, but
unfortunately, does not do a good job in the simpler case of different
recordings of the same performance.
It is possible to adapt an algorithm that aligns performances to
instead align for clock drift by much increasing the penalties for
small time-excursions. Alignment algorithms (for DNA) usually have
the penalties set low, to showcase their strength in making small
local adjustments. Set those penalty high, and the results are
closer to what you want for the simpler case.
Post by Raphaël Marinier
I think we should be able to align when there are slightly different clock
speeds, and also different microphone placement.
Could you expand on the model for sources/reverb?
Only a little, really.
I think alignment can require a certain amount of source-separation.
In turn, source separation (in my view) entails having a model for
the sources, for example that one source is primarily percussive and
another primarily sustained tones.
You can align for percussive. You can align for sustained tones. You
can also align for both at the same time, simply by having an
additional small penalty (like the small penalty for a time
excursion) for switching between the two models. The alignment then
tracks the time alignment AND the preferred model, moment by moment,
at minimal extra computing cost compared to running two alignments
separately.
The advantage is that instead of a result that is an average of two
models, you have an alignment that combines the best sections of both.
I have not tried this with sound. I'm extrapolating from work on DNA
/ protein sequences.
Post by Raphaël Marinier
At the end of the day, the proposed algorithm uses peaks in the
cross-correlation function, evaluated for multiple time windows, to produce
the final alignment. This should align mostly based on the louder
components of the signal, not the reverberated components.
Thanks!
Raphaël
--James.
Post by James Crook
Post by Raphaël Marinier
Post by Roger Dannenberg
Just a comment on implementation: Nyquist has high-quality resampling,
and unlike most implementations that simply resample with some scale
factor, Nyquist allows you to construct a mapping from one clock to
another, e.g. if the signal is S, you can compute S(f(t)) where f(t) is
any monotonically increasing function (for example, to do a simple
speed-up, you can use f(t) = t * 1.01). In the implementation, f(t) is
actually a Nyquist Sound, so for example, if you had an aligned points
every 10s, you could make a piece-wise linear function
interpolating the
alignment points, thus compensating for clocks that are slowly changing
speed. Results are sub-sample accurate.
Some thoughts about alignment: What happens if you have
recordings from
different locations recording sources from different locations? There
may be no perfect alignment, e.g. in one recording, source A might be
earlier than source B, but in the other source B is before source A.
Does this cause alignment to jump to the loudest source and introduce a
lot of timing jitter?
I checked a few examples that have the property you mention. When doing
local alignment (second phase of the algorithm) with very small windows
(e.g. 1ms), I indeed see varying detected time differences at different
positions in the two tracks. They seem to follow the loudest source. E.g.
detected time differences hover between -20 and +20ms for two recordings
~15 meters apart, of sources ~10 meters apart (see this graph
<https://drive.google.com/file/d/0B7V5I4sAuUdfNDNsaWYyZGFQeW
M/view?usp=sharing>
)
However, the algorithm performs relatively coarse alignment. We fit an
affine function on those time differences vs track time, and just apply
this affine transformation globally to one of the tracks.
As you mention, we could of course fit a piece-wise linear function
instead, but do we want to introduce this kind of varying
time-stretching
that jumps to the loudest source?
Thanks,
Raphaël
(By the way, Nyquist's phase-vocoder works the same way, but in this
Post by Roger Dannenberg
case resampling would be the right operation.)
-Roger
Post by Raphaël Marinier
Hi all,
After almost one year, I finally managed to spend some time on a
prototype implementation in Audacity, that aligns different recordings
of the same event.
https://github.com/RaphaelMarinier/audacity/commit/3276106c6
6c35e390c8169d0ac9bfab22e352567
Post by Roger Dannenberg
Post by Raphaël Marinier
1. Summarize each track by computing summary values on a sliding time
window. Typically the window is 25ms.
2. Compute the cross-correlation between the summaries. This is done
in O(n log n) thanks to the FFT and convolution theorem.
3. Find the best shift from the cross-correlation function.
4. Split summaries into small chunks, and align them 1:1. This allows
detecting small clock speed differences between devices. It has been
tested successfully with 0.01% clock speed difference on 1h long
tracks.
5. Apply the shift, and resample one track if need be.
There are multiple algorithms and parameters that can be chosen at
each step, in particular regarding summarization of a window of audio
data, and finding the best peaks from the cross-correlation function.
I created a benchmark out of few recordings, with a few automated
audio transformations (low pass, high pass, forced clock speed
difference, etc..). With the best parameters, I get about 96% success
rate out of 150 audio pairs.
The run time is pretty reasonable, taking less than 10s for 1h audio
tracks on a recent laptop (plus resample time if it happens), memory
requirements are very small (on the order of 3MBs for two 1h tracks).
Would you like to have this in Audacity? If yes, what would be the
best way to integrate it? Note that we need to be able to shift tracks
by some offset, and resample them if need be. Does any plugin system
allow shifting the tracks without having to rewrite the samples?
Should this feature just be integrated as an ad-hoc internal audacity
feature (for example shown in the Tracks menu)?
- Sync lock track group handling.
- Alignment uses left channel only. We might want to make this
configurable.
- If the time drift is very small, we may want to avoid resampling tracks.
- We could use a much smaller time window in the second alignment
Post by Raphaël Marinier
phase. This could make the alignment more precise, while still keeping
the algorithm fast.
The benchmarking code is completely ad-hoc, it would also be great to
find a way to run this kind of automated benchmarks in a uniform way
across Audacity code base (I guess other parts of Audacity could
benefit as well).
James, thanks for your algorithmic suggestions. For now I went the
route of using a mix of global and local cross-correlation.
Raphaël
On Thu, Jul 14, 2016 at 12:26 AM, Vaughan Johnson
James: "This is extrapolating from back in old times, in the late 80's
Post by Raphaël Marinier
when
I was analysing DNA and protein sequences..."
Post by Raphaël Marinier
Post by Vaughan Johnson
Didn't know that! I was doing similar work then, with Blackboard
systems,
on the PROTEAN project at Stanford KSL,
Post by Raphaël Marinier
Post by Vaughan Johnson
http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19870014670.pdf
.
Yes I've known about dynamic programming since about then. Good work,
James
-- I like your trick.
Post by Raphaël Marinier
Post by Vaughan Johnson
-- V
Post by James Crook
Sorry for the delay in getting back to you on this thread.
If you do use a dynamic programming approach, there is a neat trick I
invented (in context of DNA sequence matching) that caters for
different
kinds of matching. The trick is to run two 'match matrices' at the
Post by Raphaël Marinier
Post by Vaughan Johnson
same
time, and have a penalty for switching between them. This is
excellent
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
where there is a mix of signal and noise, as in your test examples.
For
aligning noise you want a fairly sloppy not very precisely
Post by Raphaël Marinier
Post by Vaughan Johnson
discriminating
comparison that is picking up broad characteristics. What's great
Post by Raphaël Marinier
Post by Vaughan Johnson
about
running two match matrices is that the algorithm naturally switches in
Post by Raphaël Marinier
Post by Vaughan Johnson
to
using the best kind of matching for different sections.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
On storage requirements, these can be reduced dramatically relative to
MATCH, even allowing large time shifts, by a divide and conquer
approach.
Instead of allocating space length x max-shift you sample evenly and
Post by Raphaël Marinier
Post by Vaughan Johnson
only
allocate space of k x max-shift for some small value of k such as
Post by Raphaël Marinier
Post by Vaughan Johnson
100. The
cost is that you have to repeat the analysis log(
length-of-sequence)
Post by Raphaël Marinier
Post by Vaughan Johnson
times,
where log is to the base k. So aligning to the nearest 10ms on two 1hr
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
sequences with a shift of up to 20 mins would take 50Mb
storage (if
one
match matrix) or 100Mb (with two in parallel), and the
analysis would
be
repeated 3 times. Because you stay in cache in the analysis and write
Post by Raphaël Marinier
Post by Vaughan Johnson
much
less to external memory it's a big net win both in storage and speed
Post by Raphaël Marinier
Post by Vaughan Johnson
over a
single pass approach.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I haven't written versions for sound. This is extrapolating from back
in
old times, in the late 80's when I was analysing DNA and protein
Post by Raphaël Marinier
Post by Vaughan Johnson
sequences
on a PC with a fraction of the power and storage of modern PCs. You
Post by Raphaël Marinier
Post by Vaughan Johnson
had to
be inventive to get any decent performance at all. This kind of trick
Post by Raphaël Marinier
Post by Vaughan Johnson
can
pay off in a big way, even today.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I can spell out in more detail if you might go down the dynamic
programming route, as I realise I have been a bit abbreviated in my
description here!
--James.
Thanks for the information.
I did some testing of the MATCH vamp plugin, running it via sonic
analyzer, which integrates it already.
First of all, the algorithm is pretty expensive, and its runtime seems
linear in the max time shift allowed. For aligning two 1h tracks, with
a max
allowed time shift of 60s, it takes 6 minutes on a recent processor
Post by Raphaël Marinier
Post by Vaughan Johnson
(Intel
i5-5200U), and takes about 8GB of RAM. Using is for largeer time
Post by Raphaël Marinier
Post by Vaughan Johnson
shifts such
as 10 minutes will be quite expensive...
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also tested the quality of the results, to the extent sonic-analyzer
allowed me - it can only report graphical results of the alignment
analysis,
but does not actually align the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(1) 2 identical audio tracks of a recorded concert, with a time-shift
of
about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Alignment seems perfect.
(2) 2 identical audio tracks of a recorded concert, except for a 30s
hole
filled with pink noise, with a time-shift of about 15s between them.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
There are 1-2 second zones at the boundaries of the hole where the
audio
is wrongly aligned. This will be quite problematic when building a
Post by Raphaël Marinier
Post by Vaughan Johnson
feature
that allows mix and matching different versions of each passage.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(3) 2 audio tracks recorded from the same concert (left right channels
from same device), except for a 30s hole filled with pink noise, with
a
time-shift of about 15s between them.
Sames issues as (2), no new issues.
(4) 2 audio tracks of the same concert, recorded with 2 different
devices.
Throughout the match, it finds ratios of tempos that are as divergent
Post by Raphaël Marinier
Post by Vaughan Johnson
as
<0.8 or >1.2 a significant fraction of the time. This is pretty bad
Post by Raphaël Marinier
Post by Vaughan Johnson
since a
correct match should find a tempo ratio of 1 throughout the recording.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Things can be improved using non-default parameters of
lowering the
cost of
the diagonal to 1.5, and enabling the "path smoothing" feature, but
Post by Raphaël Marinier
Post by Vaughan Johnson
tempo
ratio still routinely hovers around 0.9 - 1.1.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
(5) 2 recordings of two performances of the same composition, time
shift
of about 15s, and hole of about 30s.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Default parameters lead to big issues at boundaries around the hole
(10s
and 30s of incorrect matches).
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
However, using non-default cost for diagonal again significantly
improves
the match by mostly fixing the boundaries around the hole. There is
Post by Raphaël Marinier
Post by Vaughan Johnson
still a
small issue with the first 0.5s of the performance that remains
Post by Raphaël Marinier
Post by Vaughan Johnson
incorrectly
matched.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I cannot really evaluate the match more than that, because
sonic-analyzer
just produces the graphs, but does not actually match the tracks.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
My conclusion is that the match plugin cannot be used that easily,
even
for the simple case of 2 recordings of the same event, because of
accuracy
and performance. The former could be fixable by imposing stronger
Post by Raphaël Marinier
Post by Vaughan Johnson
regularity
of the path (e.g. piecewise linear). The latter might be harder.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I propose to start working on an algorithm and feature
specific to the
case of 2 recordings of the same event, which is an easier case to
start
with both in terms of algorithm and UI.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
I also agree that we won't be able to align perfectly, in particular
because of stereo. All we can do is best-effort given the sources. I
will
allow for piecewise linear ratios between frequencies (with additional
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
regularity restrictions), to account for varying clock drifts.
Cheers,
--
Raphaël
On Mon, Jun 27, 2016 at 9:19 AM, Robert Hänggi <
Post by Robert Hänggi
Hi
Incidentally, I've just stumbled over a real-life example where this
alignment would really be of great use to me.
I'm modelling a CD4 demodulation plug-in.
http://forum.audacityteam.org/viewtopic.php?p=307553#p307553
There are also two test (calibration) recordings in this specific
post.
In essence, four tracks are embedded in a single stereo track.
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
The aim is to reverse-engineer what is in a hardware phono
demodulator.
I can demodulate the signal, however, there are some difficulties in
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Base left=LFront + LBack (for normal stereo playback)
FM Left= LFront - LBack
(ditto for right)
Thus, I can't simply align them until they cancel.
What's more, the frequencies do not match exactly because we
have
RIAA
in combination with a noise reduction expander, a delay caused by the
low/high pass filter etc.
In summary, the alignment had to be very exact but at the same time
insensitive to noise, phase & amplitude deviations, and on and on...
For the moment, I will use cross-correlation and least square fitting
for certain "anchor" points.
I look forward to seeing the aligning feature someday
implemented in
Audacity. Good luck.
Cheers
Robert
Post by Roger Dannenberg
Excellent point. Also, aligning anything to a stereo track will
generate
similar problems. I would suggest that if you're recording with
multiple
microphones and devices, you're guaranteed to hit phase and multiple
source problems. In the spirit of the "principle of least
surprise"
I
would expect an alignment effect to just do a reasonable job given
the
sources. E.g. if acoustic sources are spread over 10 meters (~30ms at
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
the speed of sound), I'd hope individual sources would be
aligned
within
30ms. If there were a single source, I'd hope for much better.
Another possibility is aligning to multiple tracks
representing the
same
collection of sound sources recorded from different
locations. It's
subtly different from aligning to a single track.
-Roger
Post by James Crook
Something else to think about is what happens if you attempt to
align
two mono tracks that happen actually to be left and right audio of a
Post by Raphaël Marinier
Post by Vaughan Johnson
Post by James Crook
Post by Robert Hänggi
Post by Roger Dannenberg
Post by James Crook
stereo track.
Federico Miyara
2017-06-20 00:22:08 UTC
Permalink
James,
Post by James Crook
In the case of audio you HAVE to identify 'stretchy' regions of
audio. Silence, white noise, sustained vowel sounds are all good
candidates.
One difficulty is that frequently those candidates are not isolated but
occur simmultaneously with non-stretchy events such as transients
Post by James Crook
Post by Federico Miyara
This method, however, requires some sort of previous translation of
the audio features into symbols, a sort of WAV to MID conversion.
Not strictly so. The method requires that you can score the
similarity/difference between two short segments of sound. This may
or may not involve a 'translation' into a more symbolic representation.
Also the 'translation' does not have to be locked in, in the sense
that you can provide multiple alternative translations, and the
alignment selects between them in doing the alignment.
You are right. When I said symbols I really meant (quantitative)
features. Those segments are replaced by some features previously
detected by a front-end processor, and those features are compared. One
of the important features is pitch, other may be spectral features such
as formants, cepstrum, etc.

Regards,

Federico

Roger Dannenberg
2017-06-19 13:34:41 UTC
Permalink
Post by Federico Miyara
I'm curious about what other applications performance alignmentwould have...
Federico,
Are you familiar with my work on computer accompaniment? This is
closely related to performance alignment (one performance is typically
in a symbolic representation). See SmartMusic, which was based on my
patent, and there are at lest a couple of other products for computers
accompanying live musicians and/or turning pages on music displays.
Another application is music search, e.g.

Hu, Dannenberg, and Tzanetakis. “Polyphonic Audio Matching and Alignment
for Music Retrieval
<http://www.cs.cmu.edu/%7Erbd/bib-musund.html#waspaa03alignment>,” in
/2003 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics/, New York: IEEE (2003), pp. 185-188.

Also, we've used alignment to scores to find note onsets and use
that as training data for onset detection:

Hu and Dannenberg, “Bootstrap Learning for Accurate Onset Detection
<http://www.cs.cmu.edu/%7Erbd/bib-beattrack.html#bootstrap>," /Machine
Learning/ 65(2-3) (December 2006), pp. 457-471.

The idea of using scores and alignment to guide signal processing tasks
is sometimes called "score-informed", e.g.

Woodruff, Pardo, and Dannenberg, “Remixing Stereo Music with
Score-Informed Source Separation
<http://www.cs.cmu.edu/%7Erbd/subjbib2.html#remix-ismir06>,” in
/Proceedings of the 7th International Conference on Music Information
Retrieval/, Victoria, BC, Canada, October 2006, pp. 314-319.

Finally, there has been a lot of discussion and thinking about using
alignment of multiple takes in audio editors (including Audacity) to
line up multiple "takes" where a click track is not present, as a first
step to comparing takes, selecting the best elements, and producing an
edited combination of takes:

Liu, Dannenberg, and Cai, “The Intelligent Music Editor: Towards an
Automated Platform for Music Analysis and Editing
<http://www.cs.cmu.edu/%7Erbd/bib-musund.html#IME-ICIC-2010>,” in
/Proceedings of the Seventh International Conference on Intelligent
Computing/, Cairo, Egypt, December 2010, pp. 123-131.
Federico Miyara
2017-06-19 23:36:29 UTC
Permalink
Roger,

Thank you, I wasn't aware of these quite interesting applications!

Regards,

Federico
Post by Roger Dannenberg
Post by Federico Miyara
I'm curious about what other applications performance alignmentwould have...
Federico,
Are you familiar with my work on computer accompaniment? This is
closely related to performance alignment (one performance is typically
in a symbolic representation). See SmartMusic, which was based on my
patent, and there are at lest a couple of other products for computers
accompanying live musicians and/or turning pages on music displays.
Another application is music search, e.g.
Hu, Dannenberg, and Tzanetakis. “Polyphonic Audio Matching and
Alignment for Music Retrieval
<http://www.cs.cmu.edu/%7Erbd/bib-musund.html#waspaa03alignment>,” in
/2003 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics/, New York: IEEE (2003), pp. 185-188.
Also, we've used alignment to scores to find note onsets and use
Hu and Dannenberg, “Bootstrap Learning for Accurate Onset Detection
<http://www.cs.cmu.edu/%7Erbd/bib-beattrack.html#bootstrap>," /Machine
Learning/ 65(2-3) (December 2006), pp. 457-471.
The idea of using scores and alignment to guide signal processing
tasks is sometimes called "score-informed", e.g.
Woodruff, Pardo, and Dannenberg, “Remixing Stereo Music with
Score-Informed Source Separation
<http://www.cs.cmu.edu/%7Erbd/subjbib2.html#remix-ismir06>,” in
/Proceedings of the 7th International Conference on Music Information
Retrieval/, Victoria, BC, Canada, October 2006, pp. 314-319.
Finally, there has been a lot of discussion and thinking about using
alignment of multiple takes in audio editors (including Audacity) to
line up multiple "takes" where a click track is not present, as a
first step to comparing takes, selecting the best elements, and
Liu, Dannenberg, and Cai, “The Intelligent Music Editor: Towards an
Automated Platform for Music Analysis and Editing
<http://www.cs.cmu.edu/%7Erbd/bib-musund.html#IME-ICIC-2010>,” in
/Proceedings of the Seventh International Conference on Intelligent
Computing/, Cairo, Egypt, December 2010, pp. 123-131.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Federico Miyara
2016-06-26 23:10:53 UTC
Permalink
Raphael,

If the sample rate is derived from a crystal oscillator (as I think is
the vast majority of the A/D converters), in the following link

http://kunz-pc.sce.carleton.ca/thesis/CrystalOscillators.pdf

are listed a number of causes of frequency drift, for instance
temperature, warm-up, hysteresis, aging. Temperature variations seem to
be the most relevant drift cause for short time drift.

In a very worst case (a very cheap crystal) we may be around the 0.01 %
mentioned by Roger Dannenberg (though he may have been talking of
nominal frequency offset errors). Assuming that the temperature
variation is bounded to about 20 ºC during recording, we would have a
variation of at most 0.005 %. This means a drift of about 90 ms per hour
(assuming steady increase of temperature with time), so the effect may
be actually relevant.

Using not-so-cheap converters which include temperature-compensated
crystals, probably we are at least one order of magnitude below, i.e.,
about 10 ms per hour. It may be still too much for certain applications.

This would indicate that it is worth to do a dynamic syncronization.

Note that as figure 3 of the linked article suggests, due to
manufacturing tolerance of the cutting angle of the crystal, two
particular recorders may have opposite drifts with temperature. The best
one should be use as the master.

Regards,

Federico
Post by Raphaël Marinier
Varying sampling rate if indeed an issue that will need to be taken
care of. Do the actual frequencies of multiple devices tend to only
differ by a constant multiplier (e.g. constant 44100 vs constant
44150), or is it common to have small variations of sampling rate in a
recording from a single device (e.g. device first records at 44100,
and then drifts to 44150)? The former is of course easier to solve.
James, thanks for the background and advice.
Indeed the "Audio Diff" proposal is more general. It also seems quite
harder, at least because of the variations in the way of playing,
speed and potential gaps, as you mentioned, and because of all the UI
questions around the handling of imprecise and partial matches, time
expansion, errors, etc.. Also, the algorithms will of course be more
generic and complex than for aligning two recordings of the same
performance. I had a quick look at the MATCH paper
<http://www.eecs.qmul.ac.uk/%7Esimond/pub/2005/ismir05.pdf>, and the
max errors for commercial recordings on page 5 shows that the
algorithm is far from perfect.
I'll have a look into the MATCH plugin and do some tests. Do you think
there would be space for both features: (1) Simple alignment of N
recordings of the same sound (my original proposal) (2) Audio Diff,
with advanced UI to visualize and work with diffs? Is there any other
software doing (2), so that we can have an idea of the user experience?
Raphaël
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same
event, not time-synced, and wants to
align them together. This happens for instance when the two
tracks come from multiple devices
(e.g. video camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make
sure they align, which is cumbersome
and imprecise.
Well, time shift is not the only problem, since most recordings are not at the
same frequency even if they have the same nominal frequency. 44100 and 48000
are obvious, but 44100 and 44150 are far more possible with standard consumer
grade sound cards. Of course one could break up the item into blocks, and time
shift each one. (for example, it would take about 800 sec for the above two
frequencies to be out by 1 sec in their time sync, so timeshifting once a
second could be done. But even then a dropping or adding of 50 frames would
surely be noticeable.) Ie, one should also do frequency shifting as well if it
were to work. Once could of course do time shift at the beginning and the end
of a block and use the difference to also impliment a freq shift.
I've researched a bit the subject, and I think it would be
doable to implement auto-syncing of
tracks in an efficient way using a combination of audio
fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
<https://www.ee.columbia.edu/%7Edpwe/papers/Wang03-shazam.pdf>) for
approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect.
Would this contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
Raphaël
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has
something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Roger Dannenberg
2016-06-27 00:27:16 UTC
Permalink
I don't have extensive experience or measurements, but I believe
frequencies tend to differ mainly by a scale factor because they're all
crystal controlled, and the error comes from uncalibrated inexpensive
crystals. However, inexpensive devices are also not thermally
compensated, so if you turn on a cold cheap converter and it gets hot,
you should expect a small drift over that time period while it warms up.
Once things warm up, they'll still drift according to power supplies,
phase of the moon, etc., but I think the variation will be an order of
magnitude less than the calibration and warm-up effects. -Roger
Post by Raphaël Marinier
Varying sampling rate if indeed an issue that will need to be taken
care of. Do the actual frequencies of multiple devices tend to only
differ by a constant multiplier (e.g. constant 44100 vs constant
44150), or is it common to have small variations of sampling rate in a
recording from a single device (e.g. device first records at 44100,
and then drifts to 44150)? The former is of course easier to solve.
James, thanks for the background and advice.
Indeed the "Audio Diff" proposal is more general. It also seems quite
harder, at least because of the variations in the way of playing,
speed and potential gaps, as you mentioned, and because of all the UI
questions around the handling of imprecise and partial matches, time
expansion, errors, etc.. Also, the algorithms will of course be more
generic and complex than for aligning two recordings of the same
performance. I had a quick look at the MATCH paper
<http://www.eecs.qmul.ac.uk/%7Esimond/pub/2005/ismir05.pdf>, and the
max errors for commercial recordings on page 5 shows that the
algorithm is far from perfect.
I'll have a look into the MATCH plugin and do some tests. Do you think
there would be space for both features: (1) Simple alignment of N
recordings of the same sound (my original proposal) (2) Audio Diff,
with advanced UI to visualize and work with diffs? Is there any other
software doing (2), so that we can have an idea of the user experience?
Raphaël
Hi Audacity developers,
One feature that I have been missing in Audacity is automatic
time-syncing of two audio tracks.
The use case is when one has multiple recordings of the same
event, not time-synced, and wants to
align them together. This happens for instance when the two
tracks come from multiple devices
(e.g. video camera and portable audio recorder).
Right now, the user has to manually time-shift tracks to make
sure they align, which is cumbersome
and imprecise.
Well, time shift is not the only problem, since most recordings are not at the
same frequency even if they have the same nominal frequency. 44100 and 48000
are obvious, but 44100 and 44150 are far more possible with standard consumer
grade sound cards. Of course one could break up the item into blocks, and time
shift each one. (for example, it would take about 800 sec for the above two
frequencies to be out by 1 sec in their time sync, so timeshifting once a
second could be done. But even then a dropping or adding of 50 frames would
surely be noticeable.) Ie, one should also do frequency shifting as well if it
were to work. Once could of course do time shift at the beginning and the end
of a block and use the difference to also impliment a freq shift.
I've researched a bit the subject, and I think it would be
doable to implement auto-syncing of
tracks in an efficient way using a combination of audio
fingerprinting (see for instance
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
<https://www.ee.columbia.edu/%7Edpwe/papers/Wang03-shazam.pdf>)
for approximate syncing, and
maximization of cross-correlation for the fine-tuning.
I could implement such feature in audacity as a new effect.
Would this contribution be welcome in
Audacity? Is it possible that the output of an effect be a "time-shift"?
Thanks,
Raphaël
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has
something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
audacity-devel mailing list
https://lists.sourceforge.net/lists/listinfo/audacity-devel
Loading...