NLP Hacking on HN Job listings

One of the nifty parts of Hacker News (HN) is the once per month jobs thread. Between the "Who is hiring?" and "Who wants to be hired?" threads you can see a wide spread of different companies, different positions, and more recently a plethora of remote positions. The threads themselves are quite long and they stretch into several pages with almost a thousand companies getting involved this current month. Searching through job listings though can pose some headaches. Several websites have been made by HN users, but from my perspective there’s always a nagging voice saying "you aren’t searching with the right keywords" or when using a few general keywords there’s so much noise in the results that you’re likely to overlook a possible company which could be a good fit.

This month to at least partially address my concern about my previous keyword searches. To figure out what I wanted my searches to return I had to go through all of the posts. I did so using the thread collapsing feature built into the website, hiding any role which for me was a hard no. At the end of a few hours worth of reading there was a large pile of hard no’s and some tentative maybes. Results certainly could be trimmed down further, however it’s unfair to compare a text search on the job posts if external information is getting used in the decision. With what amounts to a reasonable amount of labeled text data I thought it could be fun to apply some light natural language processing (NLP) to the data. While I work on plenty of machine learning tasks my primary focus isn’t NLP, so don’t expect sentiment analysis or complicated models from looking through this month’s "Who is hiring?" thread.

Let’s start with taking a look at keywords and n-grams from the overall dataset. N-grams are just collecting the words in order to make length N sequences. While there’s more nuance involved than that definition, it’s sufficient for this quick modeling.

Across each job post we can strip out any HTML tags, remove punctuation, optionally remove common connector words (i.e. stop words), and get out frequencies of any of the resulting tokens or token sequences (i.e. ngrams). With ranked frequencies we can start to see patterns in the overall structure. For instance, if you do go through posts in the thread you might notice that some of the language is relatively repetitive and that’s loosely reflected in 4-gram frequencies as shown below.

Rank 1,  136x, ["we", "are", "looking", "for"]
Rank 2,   37x, ["our", "mission", "is", "to"]
Rank 3,   27x, ["to", "learn", "more", "about"]
Rank 4,   25x, ["reach", "out", "to", "me"]
Rank 5,   25x, ["if", "you", "are", "interested"]
Rank 6,   22x, ["senior", "full", "stack", "engineer"]
Rank 7,   20x, ["if", "you", "have", "any"]
Rank 8,   20x, ["we", "are", "hiring", "for"]
Rank 9,   20x, ["engineer", "full", "time", "remote"]
Rank 10,  19x, ["to", "join", "our", "team"]

"we are looking for" is a pretty common opener which gets recycled by a number of users making up 136 repetitions in this month’s thread. Of note the rank 6 and 9 entries about full stack engineers and remote work did corroborate two of my two major observations about this month’s thread before doing any analysis:

There are a lot of people hiring full stack engineers
There is a huge shift towards remote work compared to past threads

While further analysis does make it seem like there’s more of a frontend/backend split, full-stack is still a very popular job title within the thread.

We can shifting over to 3-gram frequency distribution to get a closer look at shorter phrases which is where a lot of the job titles start appearing. Additionally, to trim some of the noise we can also consolidate a few words which were previously split during tokenization (e.g. full stack → full-stack, full time → full-time, etc). Below the job roles picked up in the thread are highlighted. Quite a few locations are picked up at this level as well with 22 instances of "san-francisco CA" appearing in the thread.

Rank 1, #35 - ["senior", "backend", "engineer"]
Rank 2, #26 - ["senior", "frontend", "engineer"]
Rank 3, #24 - ["senior", "full-stack", "engineer"]
Rank 4, #22 - ["san", "francisco", "ca"]
Rank 5, #22 - ["work", "life", "balance"]
Rank 6, #20 - ["or", "remote", "usa"]
Rank 7, #18 - ["remote", "usa", "full-time"]
Rank 8, #18 - ["remote", "usa", "only"]
Rank 9, #15 - ["onsite", "or", "remote"]
Rank 10, #15 - ["full-time", "remote", "usa"]
Rank 11, #14 - ["remote", "or", "onsite"]
Rank 12, #14 - ["engineer", "senior", "backend"]
Rank 13, #14 - ["backend", "engineer", "senior"]
Rank 14, #13 - ["senior", "product", "designer"]
Rank 15, #13 - ["engineer", "remote", "usa"]
Rank 16, #13 - ["senior", "devops", "engineer"]
Rank 17, #13 - ["tech", "stack", "includes"]
Rank 18, #13 - ["senior", "product", "manager"]
Rank 19, #13 - ["engineer", "full-time", "remote"]
Rank 20, #13 - ["learn", "engineering", "culture"]
Rank 21, #12 - ["react", "react", "native"]
Rank 22, #12 - ["remote", "usa", "canada"]
Rank 23, #11 - ["engineer", "san", "francisco"]
Rank 24, #11 - ["or", "remote", "full-time"]
Rank 25, #11 - ["stack", "ruby", "rails"]
Rank 26, #11 - ["full-time", "remote", "or"]
Rank 27, #11 - ["culture", "right", "place"]
Rank 28, #11 - ["engineering", "culture", "right"]
Rank 29, #10 - ["senior", "ios", "engineer"]
Rank 30, #10 - ["lead", "product", "designer"]
Rank 31, #9 - ["multiple", "uk", "locations"]
Rank 32, #9 - ["3+", "years", "experience"]
Rank 33, #9 - ["engineer", "backend", "engineer"]
Rank 34, #9 - ["frontend", "engineer", "senior"]
Rank 35, #9 - ["node", "js", "typescript"]
Rank 36, #9 - ["software-engineer", "remote", "full-time"]
Rank 37, #9 - ["place", "values", "profile"]
Rank 38, #9 - ["right", "place", "values"]
Rank 39, #8 - ["engineer", "senior", "software-engineer"]
Rank 40, #8 - ["5+", "years", "experience"]
Rank 41, #8 - ["new", "york", "ny"]
Rank 42, #8 - ["nyc", "or", "remote"]
Rank 43, #8 - ["node", "js", "react"]
Rank 44, #8 - ["remote", "usa", "or"]
Rank 45, #8 - ["senior", "software-engineer", "backend"]
Rank 46, #8 - ["tech", "stack", "python"]
Rank 47, #8 - ["full-stack", "engineer", "full-time"]
Rank 48, #7 - ["lead", "frontend", "engineer"]
Rank 49, #7 - ["google", "cloud", "platform"]
Rank 50, #7 - ["engineer", "remote", "full-time"]

Similar to the trigrams, the bigrams capture a fair portion of the job titles, though the bigrams have additional utility in characterizing the positions or the company. You can find a number of examples of culture being discussed in the job posts, such as the 23 instances talking about work/life balance.

Rank 1, #120 - ["tech", "stack"]
Rank 2, #105 - ["remote", "usa"]
Rank 3, #93 - ["full-time", "remote"]
Rank 4, #88 - ["remote", "full-time"]
Rank 5, #86 - ["or", "remote"]
Rank 6, #80 - ["backend", "engineer"]
Rank 7, #77 - ["senior", "software-engineer"]
Rank 8, #77 - ["full-stack", "engineer"]
Rank 9, #77 - ["san", "francisco"]
Rank 10, #65 - ["frontend", "engineer"]
Rank 11, #65 - ["engineering", "team"]
Rank 12, #65 - ["engineer", "senior"]
Rank 13, #61 - ["node", "js"]
Rank 14, #59 - ["product", "manager"]
Rank 15, #58 - ["open", "roles"]
Rank 16, #55 - ["senior", "backend"]
Rank 17, #54 - ["fully", "remote"]
Rank 18, #52 - ["ruby", "rails"]
Rank 19, #48 - ["engineering", "manager"]
Rank 20, #45 - ["react", "native"]
Rank 21, #42 - ["engineer", "remote"]
Rank 22, #42 - ["product", "designer"]
Rank 23, #42 - ["help", "build"]
Rank 24, #41 - ["remote", "or"]
Rank 25, #39 - ["bay", "area"]
Rank 26, #39 - ["senior", "full-stack"]
Rank 27, #38 - ["devops", "engineer"]
Rank 28, #37 - ["new", "york"]
Rank 29, #36 - ["years", "experience"]
Rank 30, #36 - ["senior", "frontend"]
Rank 31, #35 - ["full-time", "onsite"]
Rank 32, #33 - ["frontend", "backend"]
Rank 33, #31 - ["cutting", "edge"]
Rank 34, #31 - ["remote", "ok"]
Rank 35, #30 - ["typescript", "react"]
Rank 36, #30 - ["team", "members"]
Rank 37, #30 - ["multiple", "positions"]
Rank 38, #30 - ["open", "positions"]
Rank 39, #30 - ["early", "stage"]
Rank 40, #29 - ["senior", "product"]
Rank 41, #29 - ["software", "development"]
Rank 42, #28 - ["real", "time"]
Rank 43, #27 - ["remote", "first"]
Rank 44, #27 - ["around", "world"]
Rank 45, #27 - ["usa", "remote"]
Rank 46, #27 - ["engineer", "full-time"]
Rank 47, #26 - ["small", "team"]
Rank 48, #26 - ["currently", "hiring"]
Rank 49, #26 - ["software-engineer", "remote"]
Rank 50, #25 - ["full-stack", "developer"]
Rank 51, #25 - ["stack", "includes"]
Rank 52, #25 - ["engineering", "culture"]
Rank 53, #24 - ["security", "engineer"]
Rank 54, #24 - ["berlin", "germany"]
Rank 55, #24 - ["world", "class"]
Rank 56, #24 - ["100%", "remote"]
Rank 57, #23 - ["life", "balance"]
Rank 58, #23 - ["work", "life"]
Rank 59, #23 - ["usa", "full-time"]
Rank 60, #23 - ["backend", "engineers"]
Rank 61, #23 - ["react", "typescript"]
Rank 62, #23 - ["send", "resume"]
Rank 63, #22 - ["los", "angeles"]
Rank 64, #22 - ["every", "day"]
Rank 65, #22 - ["francisco", "ca"]
Rank 66, #22 - ["or", "onsite"]
Rank 67, #22 - ["has", "been"]
Rank 68, #22 - ["fast", "growing"]
Rank 69, #21 - ["experience", "building"]
Rank 70, #21 - ["usa", "only"]

After eliminating filler words, the third most frequent individual word is "Remote", so certainly looks like a shift for the industry :)

Rank 1, #1425 - "engineer"
Rank 2, #819 - "team"
Rank 3, #810 - "remote"
Rank 4, #605 - "product"
Rank 5, #600 - "or"
Rank 6, #596 - "senior"
Rank 7, #574 - "build"
Rank 8, #522 - "full-time"
Rank 9, #522 - "work"
Rank 10, #429 - "experience"
Rank 11, #387 - "company"
Rank 12, #386 - "software-engineer"
Rank 13, #375 - "data"
Rank 14, #349 - "developer"
Rank 15, #345 - "role"
Rank 16, #342 - "backend"
Rank 17, #317 - "software"
Rank 18, #303 - "frontend"
Rank 19, #293 - "help"
Rank 20, #289 - "platform"
Rank 21, #286 - "their"
Rank 22, #281 - "react"
Rank 23, #274 - "technology"
Rank 24, #264 - "stack"
Rank 25, #259 - "people"
Rank 26, #259 - "manager"
Rank 27, #257 - "usa"
Rank 28, #255 - "apply"
Rank 29, #250 - "but"
Rank 30, #248 - "full-stack"
Rank 31, #246 - "hiring"
Rank 32, #240 - "new"
Rank 33, #219 - "position"
Rank 34, #218 - "customer"
Rank 35, #218 - "tech"
Rank 36, #216 - "working"
Rank 37, #214 - "development"
Rank 38, #205 - "time"
Rank 39, #204 - "python"
Rank 40, #198 - "email"
Rank 41, #185 - "open"
Rank 42, #178 - "aws"
Rank 43, #178 - "onsite"
Rank 44, #170 - "world"
Rank 45, #170 - "other"
Rank 46, #169 - "across"
Rank 47, #168 - "systems"
Rank 48, #167 - "cloud"
Rank 125, #84 - "research"
Rank 135, #81 - "machine-learning"
Rank 204, #63 - "data-engineer"
Rank 227, #58 - "healthcare"

Instead of looking at all of the job posts we can narrow our scope. Let’s see what happens when I switch over to the roles which caught my eye on a pass through this month’s thread. Downloading the webpages it’s possible to parse out which postings were hidden and which ones were still visible. It’s a much smaller set, but it’s the one that I would personally be interested in searching for. Right away though there’s a shift, work life balances are referenced more often, California only roles are depricated in favor of global roles, and machine learning roles are favored over web-dev.

Rank 1, #9 - ["multiple", "uk", "location"]
Rank 2, #5 - ["work", "life", "balance"]
Rank 3, #4 - ["engineer", "multiple", "uk"]
Rank 4, #4 - ["senior", "machine-learning", "engineer"]
Rank 5, #3 - ["engineer", "build", "manage"]
Rank 6, #3 - ["senior", "full-stack", "engineer"]
Rank 7, #3 - ["full-stack", "engineer", "senior"]
Rank 8, #2 - ["frontend", "engineer", "senior"]
Rank 9, #2 - ["sf", "nyc", "norway"]
Rank 10, #2 - ["usa", "sf", "nyc"]

The remote bias of the original data continues and some low-ranking terms from the previous stage like deep learning start appearing. Just to help identify how well my previous searches were to capture positions of interest, I’ve highlighted any Ngram which roughly corresponds to search terms used in the past.

Rank 1, #10 - ["remote", "usa"]
Rank 2, #10 - ["senior", "software-engineer"]
Rank 3, #9 - ["uk", "locations"]
Rank 4, #9 - ["multiple", "uk"]
Rank 5, #9 - ["engineer", "senior"]
Rank 6, #9 - ["remote", "full-time"]
Rank 7, #8 - ["full-time", "remote"]
Rank 8, #8 - ["deep", "learning"]
Rank 9, #7 - ["open", "roles"]
Rank 10, #7 - ["tech", "stack"]
Rank 11, #7 - ["senior", "data-engineer"]
Rank 12, #6 - ["site-reliability", "engineer"]
Rank 13, #6 - ["backend", "engineer"]
Rank 14, #6 - ["machine-learning", "engineer"]
Rank 15, #5 - ["san", "francisco"]
Rank 16, #5 - ["engineer", "python"]
Rank 17, #5 - ["life", "balance"]
Rank 18, #5 - ["work", "life"]
Rank 19, #5 - ["apply", "online"]
Rank 20, #4 - ["engineer", "multiple"]
Rank 21, #4 - ["engineering", "manager"]
Rank 22, #4 - ["or", "remote"]
Rank 23, #4 - ["software-engineer", "senior"]
Rank 24, #4 - ["senior", "software"]
Rank 25, #4 - ["senior", "data-science"]
Rank 26, #4 - ["open", "positions"]
Rank 27, #4 - ["time", "off"]
Rank 28, #4 - ["cutting", "edge"]
Rank 29, #4 - ["help", "build"]
Rank 30, #4 - ["development", "experience"]
Rank 31, #4 - ["devops", "engineer"]
Rank 32, #4 - ["engineer", "build"]
Rank 33, #4 - ["long", "term"]
Rank 34, #4 - ["full-stack", "engineer"]
Rank 35, #4 - ["senior", "machine-learning"]
Rank 36, #4 - ["climate", "change"]
Rank 37, #4 - ["data-science", "senior"]
Rank 38, #4 - ["fully", "remote"]
Rank 39, #4 - ["send", "resume"]
Rank 40, #3 - ["toronto", "senior"]

The final individual tokens aren’t too far off from what I’d search for previously, though keep in mind that terms like "machine-learning" could have in the original text been "ML", "machine learning", or "machine-learning". "data-science" could have been "data scientist", "data-science", … as well, which makes simpler searches a bit more cumbersome. I’d expect some better results using transforms on keywords or key phrases, though how much better is hard (for me) to quantify.

Rank 1, #67 - "engineer"
Rank 2, #66 - "senior"
Rank 3, #57 - "data"
Rank 4, #56 - "remote"
Rank 5, #45 - "work"
Rank 6, #41 - "or"
Rank 7, #35 - "product"
Rank 8, #34 - "full-time"
Rank 9, #32 - "team"
Rank 10, #30 - "software-engineer"
Rank 11, #30 - "experience"
Rank 12, #29 - "engineering"
Rank 13, #26 - "help"
Rank 14, #26 - "but"
Rank 15, #25 - "software"
Rank 16, #25 - "email"
Rank 17, #25 - "engineers"
Rank 18, #24 - "data-science"
Rank 19, #23 - "python"
Rank 20, #23 - "machine-learning"
Rank 21, #22 - "build"
Rank 22, #21 - "usa"
Rank 23, #21 - "stack"
Rank 24, #20 - "roles"
Rank 25, #20 - "interested"
Rank 26, #20 - "building"
Rank 27, #20 - "apply"
Rank 28, #20 - "backend"
Rank 29, #20 - "frontend"
Rank 30, #20 - "people"
Rank 66, #14 - "research"
Rank 167, #7 - "computer-vision"
Rank 168, #7 - "machine"

Up to this point I’ve mostly referenced what’s inside of the interesting roles, however I haven’t mentioned what’s missing. Using the relative frequency of various tokens/bigrams we can see that some technologies don’t seem to be in the job posts that were selected. Personally I use some of those technologies like docker and I find some technologies like Rust interesting, so their omission isn’t based on me explicitly excluding them. Of note though, the below results are very much cherry picked as it’s not too interesting to see that `business' is said 5x more often in other posts than the selected ones. I’d say one of the most robust elements for future searches is actually `covid' which appeared almost exclusively with job postings which planned on returning to the office when the new post-covid norm was established.

1 in 6121 (self) vs 1 in 1231 (BG) ["node", "js"]
1 in 6121 (self) vs 1 in 3754 (BG) ["vue", "js"]
1 in Inf (self) vs 1 in 4417 (BG) ["js", "react"]
1 in Inf (self) vs 1 in 4693 (BG) ["ios", "engineer"]
1 in Inf (self) vs 1 in 4218 (BG) crypto
1 in Inf (self) vs 1 in 2920 (BG) covid
1 in 3115 (self) vs 1 in 2233 (BG) vue
1 in 3115 (self) vs 1 in 2169 (BG) blockchain
1 in 3115 (self) vs 1 in 2052 (BG) elixir
1 in 6231 (self) vs 1 in 1765 (BG) rust
1 in 6231 (self) vs 1 in 1725 (BG) saas
1 in 3115 (self) vs 1 in 1432 (BG) equity
1 in 6231 (self) vs 1 in 1725 (BG) postgresql
1 in 3115 (self) vs 1 in 1355 (BG) docker
1 in 3115 (self) vs 1 in 1332 (BG) graphql

So, that’s a wrap for this hacky-NLP look into the jobs listings. All of the code used to sift through the data was Ruby using nokogiri to parse the html and a load of regular expressions to manipulate the words. Ideally next time I break down some data with python using a library suited for the task, but I’m skeptical the results will be too different based upon some quick tests using TextBlob and looking at NLTK as the tooling is built for more structured long form text content. I do expect that the libraries solve the problem of singularization which was a minor headache when working with the data using DIY ruby code. I was amused to find that ActiveSupport/Inflector was recommended when searching for a library to do the job, yet it falls flat on some of the basics like singular('this') should be 'this' and not 'thi', though I do recognize that I’m putting in a pronoun instead of a noun.

One problem which would almost certainly be solved with a competent NLP library would be Ngram behavior with respect to the ends of sentences. For a corpus like "We’re looking for Software Engineers. Data Engineers can apply at:", the 3-gram sequence should include [for Software Engineers], then [Data Engineers can] right afterwards. My implementation also considers the invalid trigrams of [Software Engineers Data] and [Engineers Data Engineers], which cross the period in-between the sentences. The lack of handling boundaries account for some of the odd sequences observed earlier in this post, but it doesn’t really change the analysis much.