Tracing Analysis of User Purchasing

I. Project Source Code

          Project source code URL: (This project is not open source at present.)

II. Introduction

          The patterns of user behavior are researched by statistics of tracing analysis of user purchasing. The session based logs are analyzed by hour, day, week, month and quarter to discover patterns of user’s purchasing preference. The ratio of user is summarized to get distribution of population.
          Typical patterns of user behavior are labeled, for building user profile.

2.1 Preference Attributes of Commodity Three-Level Category

          The confidence degree reflects the user’s potential preference. A user browsed a lot of commodities, for example, he filtered for more than 100 times, so there is a confidence degree about 99% which indicate the specific preference. However, only according to the 100 times filters, it cannot ensure 99.9%, maybe only 99.5% confidence degree can indicate the preference. Much more attentions should be paid to search word, preference and filter word, ranking of commodity lists, and clicks or browses in the searching result lists, because there are common attributes within the product detail page lists. Or a counterexample, the user stayed on the list for a very short time, only filtering for effective information, regardless noise. (Note: The dwell-time is a statistical instantaneous time. The time in a session mainly used to judge the mistaken click and filtering, not to judge the long-term of user behaviors, that is, the compound dwell-time within many sessions is meaningless.)
          In the aspect of user profile, the product keyword and the key attribute word are extracted, including automatic mining of common attributes on user’s repurchase cycle, product word, browsing, click and product attributes.

2.2 Paths of User’s Added-Cart and Order

          According to the single user, observe which entrance the user come in. Finally, the macro-survey is made as statistics of all user results, observation of the total population proportion.

2.3 Paths of User’s Click and Follow

          This part of the user profile is user behavior, not related to order.

III. Schemes

           The session is used as a unit to record the user’s access behavior completely, and the full path of user’s access is outlined completely, and according to these, the data is analyzed.
          Steps:
(1) By reading the output data edited by Session Pruner, all the sessions of user and their purchasing paths could be found by the statistics of logs for a long period of time, such as more than half a year, and whether there are certain patterns and rules can be summarized.
(2) Two ways
          (a) Define and mark the typical patterns existing in some users, and mark and portray them accordingly.
          (b) The similar as above, with the deep statistical analysis of each user’s session data by the time axis, some repeated patterns can be found, and should be defined and marked.
(3) According to these inherent patterns, the recommendation results and algorithms are optimized.

IV. Data Analysis

          Time: 1 day, 3 days, 7 days. The number of skus within dwell time, average time and total time was counted.
          Features: It is multiplied by time window as 1day, 3days, 7days.

4.1 Experimental Data
Data Source Time Data Size
browse and add-cart data 1 month 1.1 GB
order data 3 month 3.0 GB
4.2 Methods and Results

          (1) The skus searched and purchased are aggregated into four-tuples, and the frequency of the four-tuples is counted. The tuples are clustered based on Jaccard similarity.
          (2) These four tuples in method 1 are used as corpus to train word to vector model, and then are clustered by K-means method.
          (3) The product words purchased by the same user for six months are collected as corpus to train word to vector model, and then are clustered by K-means method.
          (4) Due to the strong dependence of K-means method on the selection of initial nodes, so the algorithms are executed for many times to merge different parts of data.
          (5) Based on one set of data result clustered by k-means algorithm, the cosine similarity between all product word vectors and each cluster is calculated for the secondary partition.
          (6) The large-scale category are tried to be clustered secondarily with K-means algorithm. However, the effect is not satisfactory, and it still cannot be subdivided.

4.3 Results Analysis

          (1) There are some clusters containing too many, mixed and disordered product words, and cannot be aggregated into a specific pattern. Furthermore, the data is sparse.
          (2) There are some product words in a cluster with high similarity to each other.

V. Improvement Scheme

          (1) Reduce the time span of order data.
          (2) The data sampling should follow the rule of user patterns.

VI. References

[1]…
[2]…
[3]…