10.1 Creating many_lists_lenght from many_lists

many_lists is a multi-dimensional list that was created in one of the earlier codeblocks in your Google Colab notebook under week 7’s lab session:

files = ['AEE31160.1.fa', 'NP_194967.1.fa', 'NP_057185.1.fa', 'NP_171654.1.fa'] #these are our file IDs
many_lists = [] #this is a list that will contain the data from each file

for i in files:
    currentFile = [] #This temporary variable will store result from each file, and is getting reset at the beginning of each for loop
    curFile = open(i, 'r').readlines() #open the file as a list of list. Each list element is a row in the file
    currentFile += [curFile[0].split(" ")[0][1:]] #Grab the protein ID (accession), add the string to currentFile
    currentFile += [curFile[0].split(".1 ")[1].split(" [")[0]] #Grab annotation...
    currentFile += [curFile[0].split("[")[1][:-2]] #Grab the organism...

    temp = ''
    for row in curFile[1:]: #the protein sequence starts from row 2 and spans multiple rows, hence [1:] (from 2nd line on, do...)
        temp+=row.rstrip() #each sequence in a row ends with a newline \n character, which we remove with .rstrip()

    currentFile+=[temp] #we add the whole protein sequence to currentFile
    ##currentFile now contains [accession, annotation, organism, sequence]
    many_lists.append(currentFile) #we then save the result from current file to master list


print('This is the many_lists list')
print(many_lists)

And many_lists takes on the following form:

[[accession, annotation, organism, sequence], [accession, annotation, organism, sequence], ...]

Nonetheless, here’s the problem that we need to solve: we want to create another variable many_lists_lenght that takes on the following form:

[[accession, sequence length, sequence], [accession, sequence length, sequence], ...]

In other words, there is no need to modify the many_lists variable itself (not to mention that you will also need to re-use this variable later on)! Rather, we can apply our knowledge of list indices to get many_lists_lenght!

From inspection, it looks like we need to extract the first and the last element of each sublist of many_lists - we can do this using a for loop:

for i in many_lists:
  accession, sequence, sequenceLength = i[0], i[-1], len(i[-1])
  # Rest of my code here...

But before that, let’s not forget to define the many_lists_lenght variable first:

many_lists_lenght = []

for i in many_lists:
  accession, sequence, sequenceLength = i[0], i[-1], len(i[-1])
  # Rest of my code here...

And since many_lists_lenght is a list of lists (i.e., like many_lists), we can also create a temporary list tempList to store the contents of the sublists of many_lists_lenght. We can use the .append() method from the list class to do so, after which we can then re-use the same method to store tempList into many_lists_lenght:

many_lists_lenght = []

for i in many_lists:
  tempList = []
  accession, sequence, sequenceLength = i[0], i[-1], len(i[-1])
  tempList.append(accession)
  tempList.append(sequenceLength) 
  tempList.append(sequence)
  many_lists_lenght.append(tempList)

And there is the first part of the problem done! Granted - the code above is a little verbose, but it does get the job done!

10.1.1 A more concise approach

Interestingly, the same problem can also be solved in one line of code:

many_lists_length = [[i[0], len(i[-1]), i[-1]] for i in many_lists] 

Here, the general idea described above is still the same, albeit list comprehension was used to make things more concise!