Efficiently count word frequencies in python

Answer by Pradeep Singh for Efficiently count word frequencies in python

January 26, 2020, 1:10 am

Combining every ones else's views and some of my own :)Here is what I have for youfrom collections import Counterfrom nltk.tokenize import RegexpTokenizerfrom nltk.corpus import stopwordstext='''Note...

View Article

Answer by Murtadha Alrahbi for Efficiently count word frequencies in python

February 28, 2019, 11:05 pm

you can try with sklearnfrom sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() data=['i am student','the student suffers a lot'] transformed_data...

View Article

Answer by nat gillin for Efficiently count word frequencies in python

October 6, 2016, 1:59 am

Here's some benchmark. It'll look strange but the crudest code wins.[code]:from collections import Counter, defaultdictimport io, timeimport numpy as npfrom sklearn.feature_extraction.text import...

View Article

Answer by Nizam Mohamed for Efficiently count word frequencies in python

March 13, 2016, 12:52 pm

Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translate expects its second argument to be a byte string, I utf-8 encode punctuation. After removing...

View Article

Answer by alvas for Efficiently count word frequencies in python

October 6, 2016, 12:22 am

A memory efficient and accurate way is to make use of CountVectorizer in scikit (for ngram extraction)NLTK for word_tokenizenumpy matrix sum to collect the countscollections.Counter for collecting the...

View Article

Answer by ShadowRanger for Efficiently count word frequencies in python

November 24, 2023, 9:55 am

The most succinct approach is to use the tools Python gives you.from future_builtins import map # Only on Python 2from collections import Counterfrom itertools import chaindef countInFile(filename):...

View Article

Answer by Goodies for Efficiently count word frequencies in python

March 7, 2016, 6:14 pm

This should suffice.def countinfile(filename): d = {} with open(filename, "r") as fin: for line in fin: words = line.strip().split() for word in words: try: d[word] += 1 except KeyError: d[word] = 1...

View Article

Answer by Stephen Grimes for Efficiently count word frequencies in python

March 7, 2016, 6:10 pm

Skip CountVectorizer and scikit-learn.The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into...

View Article