Let’s add one final complication. Suppose that molecules didn’t have END markers but instead just a COMPND line followed by one or more ATOM lines. How would we read multiple molecules from a single file in that case?
| COMPND AMMONIA |
| ATOM 1 N 0.257 -0.363 0.000 |
| ATOM 2 H 0.257 0.727 0.000 |
| ATOM 3 H 0.771 -0.727 0.890 |
| ATOM 4 H 0.771 -0.727 -0.890 |
| COMPND METHANOL |
| ATOM 1 C -0.748 -0.015 0.024 |
| ATOM 2 O 0.558 0.420 -0.278 |
| ATOM 3 H -1.293 -0.202 -0.901 |
| ATOM 4 H -1.263 0.754 0.600 |
| ATOM 5 H -0.699 -0.934 0.609 |
| ATOM 6 H 0.716 1.404 0.137 |
At first glance, it doesn’t seem much different from the problem we just solved: read_molecule could extract the molecule’s name from the COMPND line and then read ATOM lines until it got either an empty string signaling the end of the file or another COMPND line signaling the start of the next molecule. But once it has read that COMPND line, the line isn’t available for the next call to read_molecule, so how can we get the name of the second molecule (and all the ones following it)?
To solve this problem, our functions must always “look ahead” one line. Let’s start with the function that reads multiple molecules:
| from typing import TextIO |
| |
| def read_all_molecules(reader: TextIO) -> list: |
| """Read zero or more molecules from reader, |
| returning a list of the molecules read. |
| """ |
| |
| result = [] |
| line = reader.readline() |
| while line: |
| molecule, line = read_molecule(reader, line) |
| result.append(molecule) |
| |
| return result |
This function begins by reading the first line of the file. Provided that line is not the empty string (that is, the file being read is not empty), it passes both the opened file to read from and the line into read_molecule, which is supposed to return two things: the next molecule in the file and the first line immediately after the end of that molecule (or an empty string if the end of the file has been reached).
This simple description is enough to get us started writing the read_molecule function. The first thing it has to do is check that line is actually the start of a molecule. It then reads lines from reader one at a time, looking for one of three situations:
The end of the file, which signals the end of both the current molecule and the file
Another COMPND line, which signals the end of this molecule and the start of the next one
An ATOM, which is to be added to the current molecule
The most important thing is that when this function returns, it returns both the molecule and the next line so that its caller can keep processing. The result is probably the most complicated function we have seen so far, but understanding the idea behind it will help you know how it works:
| from typing import TextIO |
| |
| def read_molecule(reader: TextIO, line: str) -> list: |
| """Read a molecule from reader, where line refers to the first line of |
| the molecule to be read. Return the molecule and the first line after |
| it (or the empty string if the end of file has been reached). |
| """ |
| |
| fields = line.split() |
| molecule = [fields[1]] |
| |
| |
| line = reader.readline() |
| while line and not line.startswith('COMPND'): |
| fields = line.split() |
| if fields[0] == 'ATOM': |
| key, num, atom_type, x, y, z = fields |
| molecule.append([atom_type, x, y, z]) |
| line = reader.readline() |
| |
| return molecule, line |