Concept:
TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical technique used in:
- Natural Language Processing (NLP)
- Information retrieval
- Text mining
It helps determine how important a word is in a document compared to a collection of documents (corpus).
Step 1: Term Frequency (TF).
- Measures how often a word appears in a document.
- Higher frequency → More importance in that document.
\[
TF = \frac{\text{Number of times term appears in document}}{\text{Total number of terms in document}}
\]
Step 2: Inverse Document Frequency (IDF).
- Measures how rare a word is across all documents.
- Rare words get higher importance.
- Common words (like "the", "is") get lower weight.
\[
IDF = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the term}} \right)
\]
Step 3: TF-IDF Calculation.
\[
TF\text{-}IDF = TF \times IDF
\]
This gives a score indicating the importance of a word in a document.
Step 4: Purpose of TF-IDF.
- Identifies important keywords in documents.
- Removes common but less meaningful words.
- Converts text into numerical form for machine learning.
Step 5: Applications.
- Search engines (ranking results)
- Document classification
- Chatbots and NLP models
- Spam detection
Conclusion:
TF-IDF is used to evaluate the importance of words in text by combining frequency within a document and rarity across documents, making it a key technique in NLP and text analysis.