Getting started with nltk Tokenizing Stop Words Stemming Frequency Distributions POS Tagging

Stop Words

Filtering out stop words

NLTK has by default a bunch of words that it considers to be stop words. It can be accessed via the NLTK corpus with:

from nltk.corpus import stopwords

To check the list of stop words stored for english language :

stop_words = set(stopwords.words("english"))
print(stop_words)

Example to incorporate the stop_words set to remove the stop words from a given text:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(word_tokens)
print(filtered_sentence)

Contributors

Topic Id: 8750

Example Ids: 27285

This site is not affiliated with any of the contributors.