Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during fbank computing after the 1.9 release #835

Closed
teowenshen opened this issue Oct 4, 2022 · 2 comments · Fixed by #844
Closed

Error during fbank computing after the 1.9 release #835

teowenshen opened this issue Oct 4, 2022 · 2 comments · Fixed by #844

Comments

@teowenshen
Copy link
Contributor

When computing fbank, the newer version of lhotse tripped at compute_and_store_features, inside the compute_fbank_xxx.py file. I suspect it might have something to do with the cut.py refactor.

The error message is below.

root@2313535c79e8:/workspace/icefall/egs/csj/ASR# python local/compute_fbank_tedxjp10k.py --manifest-dir data/manifests --fbank-dir /mnt/minami_data_server/t2131178/corpus/TEDxJP-10K_v1.1/fbank
2022-10-04 10:22:37,936 INFO [compute_fbank_tedxjp10k.py:37] Manifests read.
2022-10-04 10:22:37,951 INFO [compute_fbank_tedxjp10k.py:45] Processing new
/opt/conda/lib/python3.7/site-packages/lhotse/lazy.py:395: UserWarning: A lambda was passed to LazyMapper: it may prevent you from forking this process. If you experience issues with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead.
"A lambda was passed to LazyMapper: it may prevent you from forking this process. "
Traceback (most recent call last):
File "local/compute_fbank_tedxjp10k.py", line 114, in
main()
File "local/compute_fbank_tedxjp10k.py", line 106, in main
num_mel_bins = 80
File "local/compute_fbank_tedxjp10k.py", line 65, in compute_fbank_csj
storage_type=ChunkedLilcomHdf5Writer
File "/opt/conda/lib/python3.7/site-packages/lhotse/cut/set.py", line 1455, in compute_and_store_features
cut_sets = self.split(num_jobs, shuffle=True)
File "/opt/conda/lib/python3.7/site-packages/lhotse/cut/set.py", line 541, in split
self, num_splits=num_splits, shuffle=shuffle, drop_last=drop_last
File "/opt/conda/lib/python3.7/site-packages/lhotse/utils.py", line 331, in split_sequence
seq = list(seq)
File "/opt/conda/lib/python3.7/site-packages/lhotse/cut/set.py", line 1988, in len
return len(self.cuts)
File "/opt/conda/lib/python3.7/site-packages/lhotse/lazy.py", line 231, in len
return sum(len(it) for it in self.iterators)
File "/opt/conda/lib/python3.7/site-packages/lhotse/lazy.py", line 231, in
return sum(len(it) for it in self.iterators)
File "/opt/conda/lib/python3.7/site-packages/lhotse/lazy.py", line 432, in len
"LazyFlattener does not support len because it would require "
NotImplementedError: LazyFlattener does not support len because it would require iterating over the whole iterator, which is not possible in a lazy fashion. If you really need to know the length, convert to eager mode first using .to_eager(). Note that this will require loading the whole iterator into memory.

Below is my code. My supervisions and recordings are simple json files, so they are not supposed to be loaded lazily. I am not sure if there is a way to circumvent this lazy loading.

    manifests = read_manifests_if_cached(
        dataset_parts=FULL_DATA_PARTS,
        output_dir=manifest_dir,
        prefix="tedxjp10k",
        suffix='json'
    )
    assert manifests 
    
    logging.info(f"Manifests read.")    
    
    with get_executor() as ex: # Initialise the executor only once 
        for partition, m in manifests.items():
            if (manifest_dir / f"tedxjp10k_cuts_{partition}.jsonl.gz").is_file():
                logging.info(f"{partition} already exists - skipping.")
                continue
            
            logging.info(f"Processing {partition}")
            cut_set = CutSet.from_manifests(
                recordings=m["recordings"],
                supervisions=m["supervisions"]
            )

            cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)
            
            if 'train' in partition:
                cut_set = (
                    cut_set 
                    + cut_set.perturb_speed(0.9)
                    + cut_set.perturb_speed(1.1)
                )
            
            cut_set = cut_set.compute_and_store_features(
                extractor=extractor,
                storage_path=(fbank_dir / f"feats_{partition}").as_posix(),
                num_jobs=num_jobs if ex is None else 80,
                executor=ex,
                storage_type=ChunkedLilcomHdf5Writer
            )

Below is my lhotse verison.
lhotse 1.9.0.dev0+git.97bf4b0.clean pypi_0 pypi

It is okay if this cannot be easily resolved. I am using an older docker image that did not incorporate the cut.py refactor to compute fbanks.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 4, 2022

Ah, I see what is the issue. I should fix it in Lhotse so that compute_and_store_features works with lazy manifests. In the meantime, you can add this line to fix it on your end: cut_set = cut_set.to_eager() (just before computing features).

@teowenshen
Copy link
Contributor Author

Thanks! Converting it to eager worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants