NLP Portfolio¶

This portfolio summarizes various skills learned and used in projects throughout the course.

1. NLP Techniques Implemented¶

Converting text to tokens by splitting on spaces, removing punctuation and leading spaces, removing stop words, and converting text to lowercase. (nlp-02-text-processing)
Frequency analysis of individual tokens and bigrams using tables and histograms. (nlp-03-text-exploration)
API-based text analysis (and JSON) using X API to extract X posts and potterapi to gather a list of Harry Potter spells and their descriptions. 2^nd and 4^th projects
Web scraping / content extraction from HTML using the Python Requests library to retrieve the HTML and Beautifulsoup for parsing. See nlp-01-getting-started for Wikipedia articles and nlp-05-web-documents for working with arxiv paper abstracts.

2. Systems and Data Sources¶

Fetched HTML with Python Requests library
Parsed HTML with BeautifulSoup library
Fetched JSON from API endpoints
Extracted text from a PDF file using pypdf

3. Pipeline Structure (EVTL)¶

Extract - Fetch data from source
Validate - Check that data structure is as expected
Transform - Perform NLP processing steps, including tokenization and finding token counts
Load (to sink) - Create outputs, inlcuding csv file of transformed data and image files containing graphs (histogram, word cloud)

4. Signals and Analysis Methods¶

Word frequency: computed token totals to identify most common keywords in corpus
Co-occurrence (bigrams): found most common repeating pairs of words, giving more information about context

5. Representative Work¶

X API Data Extraction and Analysis ¶

Uses the X API to capture most recent tweets that contain "AI" and identifies tokens with the highest frequency.

Insights: Posts did not have many keywords related to AI in common, but words related to addressing an audience, such as "you", were much more common.

PDF Parsing Pipeline ¶

Modified an existing pipeline to take PDFs from arxiv as the input. Headings are parsed to divide the paper by headings so that frequencies for different sections can be compared.

Insights: As one would expect, words related to the paper's domain were common. Further work can be done to verify that the pipeline also works for arxiv PDFs of varying formats.

6. Skills¶

Utilize different APIs to retrieve data in JSON format
Use Python Requests library for web scraping
Parse returned data into tokens for analysis
Generate visualizations for token frequency
Communcate results via console logging and markdown text