Can we use an independent t-test as a metric for feature importance?

I have a supervised binary classification problem. I tuned an xgboost model on the training set and achieved a reasonably high accuracy on the test set. Now I want to interpret the results of the model. I used the SHAP library to interpret the results and for the most part, they are consistent with what I would expect. However, there is one feature that, if we average over all shap values, is ranked as 7th most important and I would …
Category: Data Science

Pagination redirect set in .htaccess file is not working

I'm trying to redirect like below but it's not working. http://example.com/page/2 -> http://example.com/item/page/2 Below is the source of .htaccess file: RewriteEngine On RewriteCond %{HTTP_HOST} ^www\.hogehoge\.com RewriteRule (.*) http://hogehoge.com/$1 [R=301,L] RewriteRule ^hogehoge.com/page/(.*)$ http://hogehoge.com/item/page/$1 [R=301,L] # BEGIN WordPress <IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteRule ^index\.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] </IfModule> # END WordPress
Category: Web

What is the difference between nDCG and rank correlation methods?

When do we use one or the other? My use case: I want to evaluate a linear space to see how good retrieval results are. I have a set of data X (m x n) and some weights W (m x 1). I want to measure the nearest neighbour retrieval performance on W'X with a ground truth value Y. This is a continuous value, so I can't use simple precision/recall. If I use rank correlation, I will find the correlation …
Category: Data Science

Allow Contributors to Upload Files

I would like to allow contributors to upload a file. So I use this recommended plugin WP Role Editor After activated, I go to the plugin from User > User Role Editor and then select contributor in the selection dropdown. After that I put a check on upload_files and hit update. Then, I login with contributor account to test uploading a file. Great, I see the media upload button but when I click to upload a file, I get this …
Category: Web

ACF to select posts not displaying on blog page

I am trying to implement a repeater / post object so I can select posts that I want to displaying on the sidebar. I am using this code: <?php while ( have_rows('top_posts_repeater')) : the_row(); // loop through the repeater fields ?> <?php // set up post object $post_object = get_sub_field('selection'); if( $post_object ) : $post = $post_object; setup_postdata($post); ?> <article class="your-post"> <?php the_title(); ?> <?php the_post_thumbnail(); ?> <?php // whatever post stuff you want goes here ?> </article> <?php wp_reset_postdata(); …
Category: Web

How can I export the best classifier from my code to a model for real future usage?

# Read the CSV file df = pd.read_csv('processed.csv', header=0, engine='python') # Pre-processing the data # Define X,Y features X = df.drop('Class', axis=1) Y = df['Class'] # prepare configuration for cross validation test harness seed = 3 # prepare models models = [('LR', LogisticRegression()), ('LDA', LinearDiscriminantAnalysis()), ('KNN', KNeighborsClassifier()), ('CART', DecisionTreeClassifier()), ('NB', GaussianNB()), ('SVM', SVC())] # evaluate each model in turn results = [] names = [] scoring = 'accuracy' for name, model in models: kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed) cv_results = …
Category: Data Science

Does it make sense to use target encoding together with tree-based models?

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …
Category: Data Science

Can anyone tell me how can I get the following output?

Here is my code; file_name = ['0a57bd3e-e558-4534-8315-4b0bd53df9d8.jpeg', '20d721fc-c443-49b2-aece-fd760f13ff7e.jpeg'] img_id = {} images = [] for e, i in enumerate(range(len(file_name))): img_id['file_name'] = file_name[e] images.append(img_id) print(images) The output is; [{'file_name': '20d721fc-c443-49b2-aece-fd760f13ff7e.jpeg'}, {'file_name': '20d721fc-c443-49b2-aece-fd760f13ff7e.jpeg'}] I want it to be; [{'file_name': '0a57bd3e-e558-4534-8315-4b0bd53df9d8.jpeg'}, {'file_name': '20d721fc-c443-49b2-aece-fd760f13ff7e.jpeg'}] I don't know, why it is saves only the last file name in the dictionary?
Topic: python
Category: Data Science

List of main statistics models

I am not able to find some list of main statistics models. Is is possible to devide statistics models into categories as supervised (regression,classification) x unsupervised (clustering) or is it something which is used in filed of machine learning but not for categorizing statistics model? Thank you
Category: Data Science

How to stop Gutenberg from outputting inline CSS for specific blocks?

I like to know how I can remove these individually. <style id='wp-block-separator-inline-css'> @charset "UTF-8";.wp-block-separator{border-bottom:1px soli ... </style> WP has a feature to add inline JS and (and also CSS I think) to registered scripts/styles. So dequeue should work, but it does not. add_action( 'wp_enqueue_scripts', __NAMESPACE__ . '\action_wp_enqueue_scripts', 99 ); function action_wp_enqueue_scripts() { wp_dequeue_style( 'wp-block-navigation' ); // Comes from a file and works wp_dequeue_style( 'wp-block-post-comments-form' ); // Comes from a file and works wp_dequeue_style( 'wp-block-seperator' ); // Does not work } …
Category: Web

Two steps optimization of a credit card limit

I have a problem similar to what is on the title but not the same. The problem on the title allows me to explain the dynamics of my need. I have to determine what the optimal value is for a variable called QUOTA or LIMIT for a credit card. The goal of the model is to allow me to minimize the probability of default, given this variable and others that characterize my costumer. What is the best way to determine …
Category: Data Science

Using search.php without a 's' field in searchform.php

Sorry if this is a simple/daft question but I'm still getting to grips with how WordPress search functions. I want to completely replace the standard search within my template with a custom search that only queries a certain custom post type and its meta fields. I have a search form which does this and search.php which returns the correct data. However, the search will not function unless I include a input field named 's' and it is not empty. I …
Category: Web

Which string distance equation for fuzzy-matching person names is reliable?

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names: The left join shouldn't …
Category: Data Science

How to get list of posts from permalinks?

I am using an API which tracks metrics on parts of my site. The only useful bit it saves though is the URL (permalink). What is the most efficient way to query up all the posts that match those permalinks, given the fact that I don't have access to the ID's to use post__in with WP_Query.
Category: Web

How to avoid memory error with Pandas pd.read_csv method call with GridSearchCV usage for DecisionTreeRegressor model?

I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20. sklearn.__version__: 0.19.0 pandas.__version__: 0.20.3 numpy.__version__: 1.13.1 The GridSearchCV() instance uses parameter grid with parameter max_depth set to values …
Category: Data Science

Machine Learning for conditional density estimation

Suppose I have a set of examples $X = (x_1,x_2,..,x_n)$ with continuous numeric targets $Y = (y_1,y_2,..,y_n)$. While it is standard to use regression models to make point predictions of $y_i$ as $f(x_{i}) = \hat{y}_i$, I am interested in predicting a density function for $y_{i}$. What I want is analogous to the use of probabilities in classification instead of hard predictions (e.g. predict vs predict_proba in Scikit-learn), but for continuous regression problems. Specifically, a different density function (e.g. in the …
Category: Data Science

LSTM to multivariate sequence classification

How can I train multivariate to multiclass sequence using LSTM in keras? I have 50000 sequences, each in the length of 100 timepoints. At every time point, I have 3 features (So the width is 3). I have 4 classes and I want to bulid a classifier to determine class for sequence. What is the best way to do so? I saw many guides for univariate sequence classification but none for multivariate, and I don't know how to apply this …
Category: Data Science

Random search grid not displaying scoring metric

I want to do a grid search of some few hyperparameters through a XGBClassifier of a binary class, but whenever i run it the score value (roc_auc) is not being display. I read in other question that this can be related to some error in model training but i am not sure which one is in this case. My model training data X_train is a np.array of (X, 19) and my y_train is a numpy.ndarray of shape (X, ) which …
Category: Data Science

How to compare distribution of 2 continuous variable datasets

i want to compare 2 datasets and check for their similarity. I have tried statistical tests like ks test , z test but they gave a p value of 0.0 for most columns. I then read ks test won't work because the dataset size is huge and it will exaggerate even slight differences. Then I tried bhattacharya distance, helinger distance but the probability values are coming 0.01 (which is correct since it is continuous variable) . I am trying to …
Topic: distribution
Category: Data Science

Translate custom post type and taxonomy slug in URL?

I need help how to translate the custom post type and taxonomy when using multisite and multi-language. I am using subdirectory a /en /sv etc. Are using the plugin (Multisite Language Switcher), but can not change the rewrite settings there. So I am guesing I have to change some rewrite? Or should I translate the post type with translations file, .mo .po? This is how the post type set up are in functions.php. Should I do something with the rewrite? …
Category: Web

Why is json_decode failing?

I'm trying to determine if this is related to my having the latest version of PHP on my server while using the latest version of Wordpress. Or if I'm just doing it wrong: Here's my function that is correctly returning values (I can see them when I do an echo or a var dump): function my_Get_CURL (){ $url = 'http://someotherserver/event/94303?radius=30'; // Initiate curl $ch = curl_init(); // Disable SSL verification curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Will return the response, if false …
Category: Web

Why does a filter need to be applied to the output of the input gate before cell state is added to?

In a neural network there are 4 gates: input, output, forget and a gate whose output performs element wise multiplication with the output of the input gate, which is added to the cell state (I don't know the name of this gate, but it's the one in the below picture with the output C_tilde). Why is the addition of the C_tilde gate required in the model? In order to allow the input gate to subtract from the cell state, we …
Category: Data Science

Comment count same for every post in homepage WP_Query

I have a WP_Query on my homepage template that works just fine for get_the_title() and MANY other functions for each post in the loop. However, any way of getting the comment count just grabs the real count for the first post and then repeats the same output for the rest of the posts. I've tried both echo get_comments_number($post->ID); and comments_number(); The kicker: if I use the same template NOT on the homepage, comment count works fine.
Category: Web

Analysis of prediction shift problem in gradient boosting

I was going through the Catboost paper section 4.1 where they talk about the 'Analysis of prediction shift' using an example consisting of 2 features which are bernoulli random variables. I am unable to wrap my head around the experimental setup. Since there are only 2 indicator features, so we can have only 4 data points, everything else will be duplication. They mention that for train data points the output of the first estimator of the boosting model is biased, …
Category: Data Science

In Python, how can I transfer/remove duplicate columns from one dataset, such that the rows and columns of all datasets would be equal?

So I've been trying to improve my Random Decision Tree model for the Titanic Challenge on Kaggle by introducing a Validation Dataset, and now I encounter this roadblock, as shown by the images below: Validation Dataset Test Dataset After inspecting these datasets using the .info function, I've found that the Validation Dataset contains 178 and 714 non-null floats, while the Test Dataset contains an assorted 178 and 419 non-null floats and integers. Further, the Datasets contain duplicate rows, which I …
Category: Data Science

ValueError: Mixed precision training with AMP or APEX (`--fp16` or `--bf16`) and half precision evaluation (`--fp16) can only be used on CUDA devices

i’m fine tuning the wav2vec-xlsr model. i’ve created a virtual env for that and i’ve installed cuda 11.0 and tensorflow-gpu==2.5.0 but it gives the following error : ValueError: Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices. i want to fine tune the model on GPU ANY HELP ?
Category: Data Science

why is H2O using only a part of the data?

I have this dataframe: > head(df_clas_sn) country serial_no_of_generator_1 serial_no_of_generator_2 serial_no_of_generator_3 unit_type 11 Germany XY 01 0620 ORiP 12 India XY 01 0631 ORiP 13 Germany XY 02 0683 ORiP 14 Germany XZ 02 0735 KRIT 15 England XY 03 0844 KRIT 16 Germany XZ 05 0243 ORiP position_in_unit hours_balance status_code 11 Y 2771 1 12 DE 3783 1 13 G 1267 1 14 DE 7798 1 15 G 1136 1 16 M 6197 1 with these dimensions: > dim(df_clas_sn) [1] …
Category: Data Science

Pretrained vs. finetuned model

I have a doubt regarding terminology. When dealing with huggingface transformer models, I often read about "using pretrained models for classification" vs. "fine-tuning a pretrained model for classification." I fail to understand what the exact difference between these two is. As I understand, pretrained models by themselves cannot be used for classification, regression, or any relevant task, without attaching at least one more dense layer and one more output layer, and then training the model. In this case, we would …
Category: Data Science

How to solve 404 permalink errors on nginx server

I can't solve this issue. I tried everything I found over the web. I first tried configuring my nginx.conf following the example on codex with no success. https://codex.wordpress.org/Nginx I found that many users encounter this issue but the most popular fix is this: location / { try_files $uri $uri/ /index.php?$args; } I still get 404 for all pages. This only happens if I don't use the Plain setting in permalinks structure. Any ideas on how can I fix this? Thank …
Category: Web

Odds vs Likelihood

Odds is the chance of an event occurring against the event not occurring. Likelihood is the probability of a set of parameters being supported by the data in hand. In logistic regression, we use log odds to convert a probability-based model to a likelihood-based model. In what way are odds & likelihood related? And can we call odds a type of conditional probability?
Category: Data Science

WordPress server banning IP

I am getting a weird problem with a client site. There is an existing WordPress working on it and has been working without any problem. I have deployed a mobile application that uses the WordPress API. Suppose my laptop and mobile app is connected to the same Internet Box. When using the mobile app, after I am unable to access the website even with my laptop. It seems as if the WordPress server or WordPress instance is banning my IP. …
Category: Web

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.