Chapter 6 Tutorial of our ChildProject software

The ChildProject package is one part of a three-part solution to the difficulties faced when working with long-form recordings. The other two parts are Datalad and git-annex, which we do not detail here.

ChildProject is a package to manage daylong recordings. It is written in python, but it can also be used from the command line so that no knowledge of python is required. However, notice that this does presuppose that you know your way around a terminal – if you have no idea what that means, we strongly encourage you to take the online free tutorial by Software Carpentry in the resources section. As we said in another video, people who know another computer language (like python or R) learn bash basics in less than an hour. And if you don’t know other languages, it may take you 3-4 hours, so it’s still not very time consuming.

Within ChildProject, we have defined standards regarding the structure of the data, which we explained in previous videos. The package helps enforce these standards with tests. It also provides tools to convert annotations from a variety of input formats (LENA, ELAN, Praat, ALICE/VTC, etc.) And, it has more functionalities, including batch audio conversion, sampling of the audio data for annotation, generation of files for annotation using ELAN, anonymization of LENA’s .its files, and more!

All of this power comes from having defined very clear standards as to how your dataset is structured. This means that you need to organize it in a specific way. We explain how to organize your data in the Video 10. With ChildProject, you can check whether you have actually organized correctly by running a simple command: childproject validate.

The package also provides some commands to import annotations into a standardized .csv format. With these commands, you can get the package to convert raw annotations (from humans, LENA, etc.) to CSV dataframes, and index them in metadata/annotations.csv in a way that makes it easy to then do other manipulations, like finding which sections of the audio are in common between two human annotators, or between LENA and a human annotator, for instance to calculate reliability.

Another reason to adopt our framework is not just that you’ll be able to perform such computations, and enrich your dataset, but also that we give you a framework in which you can keep track of the changes you make to your dataset. For this, we use DataLad.

DataLad is a decentralized system for integrated discovery, management, and publication of digital objects of science. It builds on git-annex, which builds on git. In a nutshell, git is a system for version control, which keeps a history of all the files you put in a place and all the changes made to them. If you’ve used the history in Google Drive or Dropbox, it’s a bit like that, except that you have more control over how the versions are defined, and you can add comments as to how they differ from previous ones.

One limitation that git has is that it does not handle large or binary files very well – it has a hard time finding changes in these kinds of files. So git-annex is an extension of git that handles large files.

DataLad is built on git-annex, and adds abstraction and more features, like analysis reproducibility, dataset nesting, and others. We won’t cover all of these in this series because once you can understand the principles of dataset nesting, you’re ready to take on purely written instructions. So for these more “advanced” operations, we recommend checking the ChildProject documentation and the Discussion section, both of which are linked in the Resources section.

One last thing: if you use our framework, you will also find really easy-to-follow instructions to create a private archive on GIN, the G-node infrastructure, which is a university-led project trying to help neuroscientists use modern data management techniques. We have been lucky enough that they have allowed us to host our datasets with them. If these grow very very large, we may need to pay some costs. But for now, it is an amazing resource because it has a git-annex-compatible interface, the servers are in Germany, and are GDPR-compliant.

6.1 Resources