Chapter 12 Where should you run automated analyses

If you use LENA, most often you will have purchased one of the licenses that allows you to do your analyses on the cloud. This is great because it means that you don’t have to have a particularly powerful computer, and you’ll probably benefit from software and algorithm updates without even noticing. One downside, however, is that certain laws restrict storage and sharing of data across national boundaries, so if you are not based in the USA, you should consult local regulations to see if it is nonetheless possible for you to use LENA’s cloud-based service.

Also within LENA, another option is to purchase their XP license. They do not recommend this because this license only works on machines running Windows 7 or 10, and it is not being updated. This means you either need to have a computer that always runs this operating system, or use a virtual machine system (see Resources for an explanation), which in our experience is not always easy to set up.

If you are running open-source software, like the ACLEW options we discuss in another video, you have a choice between running your analyses in the cloud, in a cluster, or locally in your computer. If you’re not sure what these words mean, check out our resources for more information – but in a nutshell, when I talk about cloud-based services, I’m talking about Amazon Web Services, OVH, and other companies that have made their computing power available over the web. That means that in order to use them, you need to transfer your data to the cloud. This does not mean that you are sharing it or making it public. That transfer can be secure, and the storage can also be secure. So ask around in your institution before you decide against this option.

The second word I used was “cluster” – in a nutshell, this word just means a group of computers hooked up together. Your institution or your country may have cluster services which you can consider. In this case also, you’ll have to make sure that the transfer and the storage are secure. For instance, you may want to ask whether anyone else will have access to the data while it is stored there.

Regarding cloud-based services, these do require a bit of technical expertise to set up, and you do need to check local regulations, but we think it is very likely that if you are based in the Global North, you’ll be able to find a cloud-based service that has everything you need. We know less about the Global South, but if your local regulations allow transfer of data to Europe, you can probably use a European-based cloud service which should fulfill all regulations and ethical constraints. The Echolalia team is based in our same department, and are also working on this topic, so we asked them about this. They looked most thoroughly into using Amazon Web Services, or AWS for short. If you are in Europe, this is an option that is GDPR-compatible, and in the USA it is HIPAA-compatible. In other locations, please check local regulations, or get in touch to see if we have more information on this.

If you cannot use cloud-based services, the next best alternative is to use a local cluster. Many institutions have these, so you can reach out to your local informatics department to ask. This option also requires a bit of technical background, so you probably want to collaborate with someone who knows about this to get the analysis pipeline set up. Once everything is installed, running the ACLEW analysis involves simply copy-pasting a series of commands, and so you just need minimal knowledge of using the Unix shell. If you don’t know what that is, there are short courses – in our experience, people who know another computer language (like python or R) learn bash basics in less than an hour. And if you don’t know other languages, it may take you 3-4 hours, so it’s still not very time consuming. We provide a link to a tutorial in the resources section.

The big advantage of those two options is that you’ll be relying on powerful machines which will make processing faster. If neither of those work, you’ll be left with only one option, which is running these tools in your personal computer. You may think that this is a good option because then you don’t need to move your data around. However, as we explain on the video on storage and backup, it is a very bad idea to only have one copy of your data, as it may be accidentally destroyed or damaged. So you need several copies anyways. Running the software in your personal computer is not a great option for at least two reasons.

First, you still need quite a bit of knowledge to get everything properly installed. We have had a myriad of problems trying to get this done, as each machine has a slightly different operating system and software already installed. Plus you still need some technical knowledge – not just using a shell, as in the case of the cluster (assuming that someone else will help you install things in the cluster). To get these tools installed in your machine will probably require general informatics knowledge, and lots of patience, as you look up errors and try out recommended solutions.

The second disadvantage is that even when you’ve installed everything, the trouble is not over as it will take quite a bit of time to actually process the data. We haven’t thoroughly benchmarked ALICE but this table shows you how long it takes to process files using the voice type classifier. Batch size is the number of sequences your computer will process in parallel. This can decrease the running time but increase the memory consumption of the program. Personal computers typically have CPUs, not GPUs. In a nutshell, it’ll take about 4h to do a 16h recording in your personal computer. The precise numbers, however, will change depending on the specifications of your personal computer.

batch_size GPU CPU
32 1/35 0.27
64 1/45 0.26
128 OOM* 0.25
256 OOM 0.24
Note:
*OOM stands for out of memory.

Of course, if you are in this situation, you can always try! We do know some people who were just patient and were able to use their macbook to run VTC on their 100+ files in about two weeks. After all, you will ideally only run this process once, so it may make sense to do this if the alternative of using cloud- or cluster-based computing will take you even longer to set up.