Wednesday 4 January 2017

How to import data with genfromtxt?

Numpy provides several functions to create arrays from tabular data.
genfromtxt runs two main loops:
1.The first loop converts each line of the file in a sequence of strings.
2.The second loop converts each string to the appropriate data type.

Syntax:
numpy.genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None)

This mechanism is slower than a single loop, but it gives more flexibility. In particular, "genfromtxt" is able to take missing data into account, when other faster and simpler functions like "loadtxt" cannot.

Process:
1.Defining the Input:
The input file can be a text file or an archive

2.Splitting the lines into columns:
The delimiter argument
Example:
>>> data = "1, 2, 3\n4, 5, 6"
>>> np.genfromtxt(StringIO(data), delimiter=',')
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])


>>> data = "  1  2  3\n  4  5 67\n890123  4"
>>> np.genfromtxt(StringIO(data), delimiter=3)
array([[   1.,    2.,    3.],
       [   4.,    5.,   67.],
       [ 890.,  123.,    4.]])


>>> data = "123456789\n   4  7 9\n   4567 9"
>>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))
array([[ 1234.,   567.,    89.],
       [    4.,     7.,     9.],
       [    4.,   567.,     9.]])


The autostrip argument

>>> data = "1, abc , 2\n 3, xxx, 4"
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|S5")
array([['1', ' abc ', ' 2'],
       ['3', ' xxx', ' 4']], 
      dtype='|S5')
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|S5", autostrip=True)
array([['1', 'abc', '2'],
       ['3', 'xxx', '4']], 
      dtype='|S5')


The comments argument
>> data = """#
... # Skip me!
... #Skip me too!
... 1,2
... 3,4
... 5,6
... 7,8
... #And here comes the last line
... 9,0
... """
>>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.],
       [ 7.,  8.],
       [ 9.,  0.]])


Skipping lines and choosing columns
The skip_header and skip_footer arguments
The presence of a header in the file can hinder data processing. In that case, we need to use the skip_header optional argument. The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. Similarly, we can skip the last n lines of the file by using the skip_footer attribute and giving it a value of n:

 data = "\n".join(str(i) for i in range(10))
>>> np.genfromtxt(StringIO(data),)
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])
>>> np.genfromtxt(StringIO(data),
...              skip_header=3, skip_footer=5)
array([ 3.,  4.])


The usecols argument:
In some cases, we are not interested in all the columns of the data but only a few of them. We can select which columns to import with the usecols argument. This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. Remember that by convention, the first column has an index of 0. Negative integers behave the same as regular Python negative indexes.

>>> data = "1 2 3\n4 5 6"
>>> data
'1 2 3\n4 5 6'
>>> np.genfromtxt(StringIO(data), usecols=(0, -1))
array([[ 1.,  3.],
       [ 4.,  6.]])

If the columns have names, we can also select which columns to import by giving their name to the usecols argument, either as a sequence of strings or a comma-separated string:

>>> data = "1 2 3\n4 5 6"
>>> np.genfromtxt(StringIO(data),names="a, b, c", usecols=("a", "c"))
array([(1.0, 3.0), (4.0, 6.0)], 
      dtype=[('a', '<f8'), ('c', '<f8')])



Setting the names
The names argument

>>> data = StringIO("1 2 3\n 4 5 6")
>>> data 
<StringIO.StringIO instance at 0x7f38ac086680>
>>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])
array([(1, 2, 3), (4, 5, 6)], 
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])


>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, names="A, B, C")
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])


>>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
>>> np.genfromtxt(data, skip_header=1, names=True)
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])


The defaultfmt argument
If names=None but a structured dtype is expected, names are defined with the standard NumPy default of "f%i", yielding names like f0, f1 and so forth:

>>> data = StringIO("1 2 3\n 4 5 6")
>>> np.genfromtxt(data, dtype=(int, float, int))
array([(1, 2.0, 3), (4, 5.0, 6)], 
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

>>> data = StringIO("1 2 3\n 4 5 6")

>> data
<StringIO.StringIO instance at 0x7f38a709fc20>
>>> np.genfromtxt(data, dtype=(int, float, int), names="a")
array([(1, 2.0, 3), (4, 5.0, 6)], 
      dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')])