maphew

Pandas - append

Appending to a dataframe makes a copy every time, so big performance hit when used iteratively. Using a dictionary is fastest, 

It's been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use. In the case of adding a lot of rows to dataframe I interested in speed performance . So I tried 3 most popular methods and checked their speed. 

SPEED PERFORMANCE 

  1. Using .append ( NPE's answer
  2. Using .loc ( fred's answer and FooBar's answer
  3. Using dict and create DataFrame in the end ( ShikharDua's answer

Results (in secs): 

Adding 1000 rows 5000 rows 10000 rows 
.append 1.04 4.84 9.56 
.loc 1.16 5.59 11.50 
dict 0.23 0.26 0.34 

From < https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe

It is worth noting that concat() (and therefore append() ) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. 

frames = [ process_your_file(f) for f in files ] 
result = pd.concat(frames) 

From < https://pandas.pydata.org/pandas-docs/stable/merging.html

Expressed as for loop:

csv_data = pd.DataFrame() 
for csv in glob.iglob(here + '/logs/**/*.csv', recursive=True): 
csv_data.append(pd.read_csv(csv, 
error_bad_lines=False, 
warn_bad_lines=True, 
index_col=False, 
ignore_index=True)) 

As list comprehension:

csv_data = [ 
pd.read_csv(csv, error_bad_lines=False, warn_bad_lines=True, index_col=False) 
for csv in glob.iglob(here + '/logs/**/*.csv', recursive=True) 
] 
merged_csv = pd.concat(csv_data, ignore_index=True) 

and, the Spyder python IDE is awesome. I'm a convert. :) Winning killer feature over the many others I've used to date: interactive live variable explorer

Article Image