Camgirlpedia Tag Finder

The surprisingly difficult journey to make a useful tag suggestion algorithm.

Created: ,
Last Updated:

The original purpose of Camgirlpedia was to list detailed statistics for almost every tag on a variety of cam-sites. This is useful information if you already know which tags you prefer to use, but what if you want to find tags that fit your niche?

This problem has now been solved by Camgirlpedia’s newest tool, Tag Finder. With Tag Finder, you can input any tag that you feel describes you, and a set of tags that may suite a similar niche will be generated.

While this may sound like a simple enough task, multiple iterations were required to create a usable beta. This article will detail some key findings from the development process.

What makes for a good suggestion?

The first question that must be asked when creating a suggestion algorithm, is “what metric can be used to assess the quality of the suggestions generated”? While this is an open-ended question, for version 1 of Tag Finder, the following requirements were determined.

  1. All tags generated must be real; malformed or nonsensical tags should not be present.
  2. Tags should be, in a describable way, related to one another. It is not enough to simply list every single tag, nor list a random subset of tags.
  3. There should be a reasonable explanation as to the order in which tags are listed. It is not necessary that the first tag is “more correct” than the second, but some reasoning should be describable, randomness of order will not be acceptable.
  4. Tags suggestions should be specific to gender/category (Female, Male, Trans, Couple) if possible.

The Process.

The next step was to design the process by which suggestions would be generated. There were a number of viable mechanisms by which this could be accomplished and while machine learning was an appealing option, a more basic approach was deemed appropriate for version 1, so as not to introduce unnecessary complication.

The fundamental approach relied on the idea that at least some models must be aware of more than one tag that fits their niche, so associations between tags may simply be found by logging the tags used by all models. A higher number of occurrences of a particular association between tags, would therefore increase the odds that these tags were tightly coupled within a niche, and should therefore be suggested more strongly.

This reasoning satisfied all of the requirements, so it was used as the basis of iteration 1.

The steps to create Tag Finder were:

  1. Collect tags.
  2. Parse and analyze the collected tags into a useable format.
  3. Design the algorithm to generate suggestions.

Tag Collection

To simplify the task, tag collection would occur only from Chaturbate, as the list of possible tags is partially limited by the site (unlike MyFreeCams) but still generous enough in the number of possible tags that most niches could be covered (unlike Streamate).

Collection took place over the course of a week at various hours of the day, so as to hopefully avoid bias in the dataset as might occur if collection took place only at certain hours (e.g. if Europeans were most active at certain hours during which collection exclusively occurred, the dataset may be skewed and, as such, suboptimal for Americans).

Tag Parsing

Initially tags were naively parsed as follows:

If a model had tags A, B, and C selected for their stream, an entry for tag A would have sub-entries B and C incremented by 1. Similarly, entry B would have sub-entries A and C incremented by 1, and entry C would have sub-entries A and B incremented by 1. This process is demonstrated in Figure 1 below.

Figure showing example of iteration 1 data storage process for 2 models.
Figure 1: Demonstration of the process by which tag associations were captured. Assume that the top image showing Jane Doe is the first set of model tags captured, and Kim Roe is the second.

Given the assumption that “tags with the most associations must be the most similar,” this appeared to be an appropriate method of storing and parsing data. This would later prove to be incorrect.

Iteration 1

Iteration 1 was constructed from the data collected, however upon testing, it was instantly apparent that there was a flaw in the algorithm. More popular tags get used more often, regardless of whether they fit a niche. For example, the “lovense” tag is one of the most used tags, so almost every other tag would have strong associations with it, regardless of whether these associations were logical or useful.

This rendered the entire algorithm near useless, as it was heavily skewed to suggest the most popular tags, which was not the point, as the most popular tags could already be found using Camgirlpedia.

Iteration 2

Not wanting to recollect the dataset, iteration 2 attempted to salvage usability from the data that had been collected by taking into account values including the total number of associations of each tag, and the total number of times a tag was present in the collected data.

Unfortunately, due to the nature by which associations had been represented in the collected data, the actual number of times a tag was present in the dataset was impossible to ascertain.

For example, if entry A = {B: 1, C:1}, there is no way to tell whether one model had tags A and B, and another had tags B and C, or whether A, B and C all came from one model.

As such, after multiple attempts to create a useable algorithm from the dataset, none gave satisfactory results, and thus it was decided that all tags must be recollected.

Iteration 3

Again, tags were collected from all models over the course of a number of days at different hours to avoid biasing the dataset. This time, however, an entry was created to include all tags per model, per collection run. This resulted in a bigger dataset, but far more useable data.

After some aggregation of the data, and surprisingly little time refining a new, frequency-based algorithm, a new algorithm was created as follows:

\(Suggestion\ Index=Popularity\ of\ seed\ tag\ \ast\frac{\#\ Coincidences\ Seed\ tag\ \&\ Suggestion\ Tag}{\#\ Occurrences\ of\ seed\ tag}-\ \frac{\sum_{0}^{n}\frac{\#\ Coincidences\ Seed\ tag\ \&\ non-suggestion\ Tag}{\#\ Occurences\ of\ non-seed\ tag}}{n}\)

This created both intuitively appropriate suggestions, and also allowed for popularity of a tag to be used as a scale factor to generate an order that “looked right”.

Final Thoughts and Lessons Learned.

This was an interesting challenge that was intended to be quick and simple to implement but ended up taking far longer than expected to achieve a useable result, however that result, once reached, was far more than satisfactory.

The biggest lesson learned is to ALWAYS store raw data, especially during development. Data can always be aggregated and formatted after collection, but source data CANNOT be recreated from aggregated data so if you have to make a mistake, the prior is far more forgiving than the latter.

Follow @LastechLabs on twitter for the latest updates to camgirlpedia and many other exciting projects.