Import too large csv data file with strings (2024)

4 views (last 30 days)

Show older comments

Christos Antonakopoulos on 16 Nov 2015

  • Link

    Direct link to this question

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings

  • Link

    Direct link to this question

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings

Commented: Jenny Smith on 19 Jul 2018

Accepted Answer: Guillaume

Open in MATLAB Online

My file is about 72 MB, almost 850000 rows and on average 7 columns, so some times the number of columns changes. Data is mostly comprised of strings so i used the:

http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport

as

name= 'etch.csv';

[C1, C2, C3, C4, C5, C6, C7] = csvimport(name, 'columns', [1:7], 'noHeader', true, 'delimiter', ';' );

(i am interested only in the 7 columns even there were cases with more data) This works perfectly for small data sets. For my case it took me almost 30 minutes or even more. Any idea for something better? Thank you

PS My data type is:

1: Device Name,Category,Date,Time,Source,Message,Condition,Name,Act

2: string1,string2,mm/dd/yyyy,hh:mm:ss.sss,string,string,string,1 or 0

.....

850000: and it goes on as line 2

last column most of the times has no data but does not interest me

2 Comments

Show NoneHide None

Mohammad Abouali on 16 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323558

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323558

have you tried readtable?

Christos Antonakopoulos on 17 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323618

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323618

Sign in to comment.

Sign in to answer this question.

Accepted Answer

Guillaume on 17 Nov 2015

  • Link

    Direct link to this answer

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200123

  • Link

    Direct link to this answer

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200123

Open in MATLAB Online

No matter what, you're bound by the reading speed of matlab. Probably the fastest way to read the file is to rea it all once with fileread. You can then split the lines with strsplit. It is then a choice of applying either of textscan, strsplit or regexp on each line. You would have to see which is faster.

Here is how I would do it using regexp:

filecontent = fileread('etch.csv');

filelines = strsplit(filecontent, {'\r', '\n'}); %split at line ending. Copes with linux and windows termination

fields = regexp(filelines, '^([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);', 'tokens', 'once'); %only keep the first seven fields

fields = vertcat(fields{:})

The above takes about 3 seconds on my machine to read 85000 rows (only 8 MB of text though).

One thing it hasn't done is parse the date. This is fairly trivial to do with datetime if needed and takes no time at all.

4 Comments

Show 2 older commentsHide 2 older comments

Christos Antonakopoulos on 17 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323656

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323656

Thank you, your code exactly worked for my 850000 lines i needed about 70 seconds, when before i needed 30 minutes. I will check the other commands to see which is the quickest.

Guillaume on 17 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323688

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323688

Open in MATLAB Online

Oh, I was out by a factor of 10 on my test data. That explains the difference in size.

From my testing, the longest operation is the strsplit into individual lines. I've just realised that this operation is not actually needed and that the exact same regular expression I wrote can be used on the whole file. You just need to change one option of the regex:

filecontent = fileread('etch.csv');

fields = regexp(filecontent, '^([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);', 'tokens', 'lineanchors');

fields = vertcat(fields{:})

On my machine, it halves the processed time (5 seconds vs 11 seconds for 850000 lines, you need a better computer!).

Also note that the vertcat is also an expensive operation. If you're happy to access your data as a cell array (lines) of cell arrays (tokens), then you can dispense with it.

Finally, note that if a line has 7 or less fields, the regex won't match. That can be worked out by modifying the regex at the expense of more processing time.

Christos Antonakopoulos on 18 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323862

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323862

I see, yes you are right my time was also reduced, but still i need a better pc. Thank you again

Jenny Smith on 19 Jul 2018

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_591065

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_591065

Hello, I am trying to follow this thread and I'm reading through the regex documentation... I don't understand what you are doing with this expression with [^;]* I have a very similar problem, my text is separated by commas and I have seven columns, and I am trying to understand how to use this function similarly.

Sign in to comment.

More Answers (1)

dpb on 16 Nov 2015

  • Link

    Direct link to this answer

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200062

  • Link

    Direct link to this answer

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#answer_200062

Open in MATLAB Online

Can't do anything w/o at least a sample of the data file with whatever warts there are as far as missing fields, but why not go to the root i/o routines directly? For larger fields, "as near to the metal as you can get" is bound to be the ploy.

fmt='%s %s %2d/%2d/%2d %2d:%2d:%2d %s %s %s %*[^\n]';

d = textscan(fid,fmt,'delimiter',',','headerlines',1);

The result above will be a cell array of 7xN; if you do want the various variables then try same format string with textread instead.

Note there's a new %d formatting string with latest release to parse dates on input directly; I don't have past R2012b so return the m/d/y and h/m/s as numerics above. If you do want to retain the strings instead and do the conversion later (or perhaps don't need them any other way) it should be obvious where to replace the formatting to do so.

2 Comments

Show NoneHide None

dpb on 16 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323525

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323525

ADDENDUM OBTW, it might turn out to be faster to use a looping construct and read a smaller subset of the file each pass rather than the whole thing at once...with textscan you can pick up from previous read automagically; textread in this regards always closes the file so it would have to reopen it every time with an updated 'headerlines' argument; probably a losing proposition.

I don't know if this would help or not; you'd just have to 'spearmint to see if less memory requirements per read operation would outperform the alternate.

Christos Antonakopoulos on 17 Nov 2015

Direct link to this comment

https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323620

  • Link

    Direct link to this comment

    https://matlabcentral.mathworks.com/matlabcentral/answers/255103-import-too-large-csv-data-file-with-strings#comment_323620

Edited: Stephen23 on 17 Nov 2015

Open in MATLAB Online

Device Name;Category;Date;Time;Source;Message;Condition Name;Act;Ack;Ena

CCT AC800 PEC Local;Event;08/26/2010;16:47:09.9550;PEC_MSG_25_10;SerialCommFault;Active;1;1;1

CCT AC800 PEC Local;Trip;08/26/2010;16:46:50.2530;PEC_MSG_1_08;LineUndervoltage;Active;1;1;1

CCT AC800 PEC Local;Trip;08/26/2010;16:46:50.2530;PEC_MSG_1_11;LineUnderfrequency;Active;1;1;1

CCT AC800 PEC Local;Trip;08/26/2010;16:47:09.9550;PEC_MSG_26_10;WaterPressure Fault;Active;1;0;1

That are exactly the first 5 lines, i am not interested on the last 3 columns though. As i said there are cases, in which my rows have less than 10 or more than 10 columns, that is why with csvimport function i had my problem solved since those cases were solved through padding or truncation.

Sign in to comment.

Sign in to answer this question.

See Also

Categories

MATLABData Import and AnalysisLarge Files and Big Data

Find more on Large Files and Big Data in Help Center and File Exchange

Tags

  • import csv files
  • strings

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

An Error Occurred

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.


Import too large csv data file with strings (12)

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list

Americas

  • América Latina (Español)
  • Canada (English)
  • United States (English)

Europe

  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • Switzerland
    • Deutsch
    • English
    • Français
  • United Kingdom(English)

Asia Pacific

Contact your local office

Import too large csv data file with strings (2024)
Top Articles
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 6189

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.