String utilities¶

Utilities for working with strings

Encoding¶

`encode_sequence`(sequence, encoding_map, …)	Extract the amino acid properties of the given sequence
`encode_all_sequences`(sequences, …)	Extract the amino acid feature vectors for each peptide sequence

Length manipulation¶

`pad_sequence`(seq, max_seq_len, pad_value, align)	Pad seq to max_seq_len with value based on the align strategy
`pad_trim_sequences`(seq_vec, pad_value, …)	Pad and/or trim a list of sequences to have common length
`trim_sequence`(seq, maxlen, align)	Trim seq to at most maxlen characters using align strategy

Other operations¶

`simple_fill`(text, width)	Split text into equal-sized chunks of length width
`split`(s, delimiters, maxsplit)	Split s on any of the given delimiters

Human-readable data type helpers¶

`bytes2human`(n, format)	Convert n bytes to a human-readable format
`human2bytes`(s)	Convert a human-readable byte string to an integer
`try_parse_float`(s)	Convert s to a float, if possible
`try_parse_int`(s)	Convert s to an integer, if possible
`str2bool`(s)	Convert s to a boolean value, if possible

Definitions¶

Utilities for working with strings

pyllars.string_utils.bytes2human(n: int, format: str = '%(value)i%(symbol)s') → str[source]¶

Convert n bytes to a human-readable format

This code is adapted from: http://goo.gl/zeJZl

Parameters:	n (int) – The number of bytes format (string) – The format string
Returns:	human_str – A human-readable version of the number of bytes
Return type:	string

Examples

>>> bytes2human(10000)
'9K'
>>> bytes2human(100001221)
'95M'

pyllars.string_utils.encode_all_sequences(sequences: Iterable[str], encoding_map: Mapping[str, numpy.ndarray], maxlen: Optional[int] = None, align: str = 'start', pad_value: str = 'J', same_length: bool = False, flatten: bool = False, return_as_numpy: bool = True, swap_axes: bool = False, progress_bar: bool = True) → Union[numpy.ndarray, List][source]¶

Extract the amino acid feature vectors for each peptide sequence

See get_peptide_aa_features for more details.

Parameters:	sequences (typing.Iterable[str]) – The sequences encoding_map (typing.Mapping[str, numpy.ndarray]) – The features for each character maxlen (typing.Optional[int]) – align (str) – pad_value (str) – same_length (bool) – flatten (bool) – Whether to (attempt to) convert the features of each peptide into a single long vector (True) or leave as a (presumably) 2d position-feature vector. return_as_numpy (bool) – Whether to return as a 2d or 3d numpy array (True) or a list containing 1d or 2d numpy arrays. (The dimensionality depends upon flatten.) swap_axes (bool) – If the values are returned as a numpy tensor, swap axes 1 and 2. N.B. This flag is only compatible with return_as_numpy=True and flatten=False. progress_bar (bool) – Whether to show a progress bar for collecting the features.
Returns:	all_encoded_peptides – The resulting features. See the flatten and return_as_numpy parameters for the expected output.
Return type:	typing.Union[numpy.ndarray, typing.List]

pyllars.string_utils.encode_sequence(sequence: str, encoding_map: Mapping[str, numpy.ndarray], flatten: bool = False) → numpy.ndarray[source]¶

Extract the amino acid properties of the given sequence

This function is designed with the idea of mapping from a sequence to numeric features (such as chemical properties or BLOSUM features for amino acid sequences). It may fail if other features are included in encoding_map.

Parameters:	sequence (str) – The sequence encoding_map (typing.Mapping[str, numpy.ndarray]) – A mapping from each character to a set of features. Presumably, the features are numpy-like arrays, though they need not be. flatten (bool) – Whether to flatten the encoded sequence into a single, 1d array or leave them as-is.
Returns:	encoded_sequence – A 1d or 2d np.array, depending on flatten. By default (flatten=False), this is a 1d array of objects, in which the outer dimension indexes the position in the epitope. If flatten is True, then the function attempts to reshape the features into a single long feature vector. This will likely fail if the encoding_map values are not numpy-like arrays.
Return type:	numpy.ndarray

pyllars.string_utils.human2bytes(s: str) → int[source]¶

Convert a human-readable byte string to an integer

This code is adapted from: http://goo.gl/zeJZl

Parameters:	s (string) – The human-readable byte string
Returns:	num_bytes – The number of bytes
Return type:	int

Examples

>>> human2bytes('1M')
1048576
>>> human2bytes('1G')
1073741824

pyllars.string_utils.pad_sequence(seq: str, max_seq_len: int, pad_value: str = 'J', align: str = 'end') → str[source]¶

Pad seq to max_seq_len with value based on the align strategy

If seq is already of length max_seq_len or longer it will not be changed.

Parameters:	seq (str) – The character sequence max_seq_len (int) – The maximum length for a sequence pad_value (str) – The value for padding. This should be a single character align (str) – The strategy for padding the string. Valid options are start, end, and center
Returns:	padded_seq – The padded string. In case seq was already long enough or longer, it will not be changed. So padded_seq could be longer than max_seq_len.
Return type:	str

pyllars.string_utils.pad_trim_sequences(seq_vec: Sequence[str], pad_value: str = 'J', maxlen: Optional[int] = None, align: str = 'start') → List[str][source]¶

Pad and/or trim a list of sequences to have common length

The procedure is as follows:

Pad the sequence with pad_value
Trim the sequence

Parameters:	seq_vec (typing.Sequence[str]) – List of sequences that can have various lengths pad_value (str) – Neutral element with which to pad the sequence. This should be a single character. maxlen (typing.Optional[int]) – Length of padded/trimmed sequences. If None, maxlen is set to the longest sequence length. align (str) – To which end to align the sequences when triming/padding. Valid options are start, end, center
Returns:	padded_sequences – The padded and/or trimmed sequences
Return type:	typing.List[str]

pyllars.string_utils.simple_fill(text: str, width: int = 60) → str[source]¶

Split text into equal-sized chunks of length width

This is a simplified version of textwrap.fill.

The code is adapted from: http://stackoverflow.com/questions/11781261

Parameters:	text (string) – The text to split width (int) – The (exact) length of each line after splitting
Returns:	split_str – A single string with lines of length width (except possibly the last line)
Return type:	string

pyllars.string_utils.split(s: str, delimiters: Iterable[str], maxsplit: int = 0) → List[str][source]¶

Split s on any of the given delimiters

This code is adapted from: http://stackoverflow.com/questions/4998629/

Parameters:	s (string) – The string to split delimiters (list of strings) – The strings to use as delimiters maxsplit (int) – The maximum number of splits (or 0 for no limit)
Returns:	splits – the split strings
Return type:	list of strings

pyllars.string_utils.str2bool(s: str) → bool[source]¶

Convert s to a boolean value, if possible

Parameters:	s (string) – A string which may represent a boolean value
Returns:	bool_s – True if s is in _TRUE_STRING, and False otherwise
Return type:	boolean

pyllars.string_utils.trim_sequence(seq: str, maxlen: int, align: str = 'end') → str[source]¶

Trim seq to at most maxlen characters using align strategy

Parameters:	seq (str) – The (amino acid) sequence maxlen (int) – The maximum length align (str) – The strategy for trimming the string. Valid options are start, end, and center
Returns:	trimmed_seq – The trimmed string. In case seq was already an appropriate length, it will not be changed. So trimmed_seq could be shorter than maxlen.
Return type:	str

pyllars.string_utils.try_parse_float(s: str) → Optional[float][source]¶

Convert s to a float, if possible

Parameters:	s (string) – A string which may represent a float
Returns:	float_s (float) – A float — OR — None – If s cannot be parsed into a float.

pyllars.string_utils.try_parse_int(s: str) → Optional[int][source]¶

Convert s to an integer, if possible

Parameters:	s (string) – A string which may represent an integer
Returns:	int_s (int) – An integer — OR — None – If s cannot be parsed into an int.